ParallelTracker: A Transformer Based Object Tracker for UAV Videos

Wei, Haoran; Wan, Gang; Ji, Shunping

doi:10.3390/rs15102544

Open AccessArticle

ParallelTracker: A Transformer Based Object Tracker for UAV Videos

by

Haoran Wei

¹,

Gang Wan

² and

Shunping Ji

^1,*

¹

School of Remote Sensing and Information Engineering, Wuhan University, 129 Luoyu Road, Wuhan 430079, China

²

Department of Surveying and Mapping and Space Environment, Space Engineering University, Beijing 101407, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(10), 2544; https://doi.org/10.3390/rs15102544

Submission received: 27 March 2023 / Revised: 4 May 2023 / Accepted: 9 May 2023 / Published: 12 May 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Efficient object detection and tracking from remote sensing video data acquired by unmanned aerial vehicles (UAVs) has significant implications in various domains, such as scene understanding, traffic surveillance, and military operations. Although the modern transformer-based trackers have demonstrated superior tracking accuracy, they often require extensive training time to achieve convergence, and the information from templates is not fully utilized in and integrated into tracking. To accelerate convergence and further improve tracking accuracy, we propose an end-to-end tracker named ParallelTracker that extracts prior knowledge from templates for better convergence and enriches template features in a parallel manner. Our core design incorporates spatial prior knowledge into the tracking process through three modules: a prior knowledge extractor module (PEM), a template features parallel enhancing module (TPM), and a template prior knowledge merge module (TPKM). These modules enable rich and discriminative feature extraction as well as integration of target information. We employ multiple PEM, TPM and TPKM modules along with a localization head to enhance accuracy and convergence speed in object tracking. To enable efficient online tracking, we also design an efficient parallel scoring prediction head (PSH) for selecting high-quality online templates. Our ParallelTracker achieves state-of-the-art performance on the UAV tracking benchmark UAV123, with an AUC score of 69.29%, surpassing the latest OSTrack and STARK methods. Ablation studies further demonstrate the positive impact of our designed modules on both convergence and accuracy.

Keywords:

object tracking; transformer; unmanned aerial vehicle video

Graphical Abstract

1. Introduction

Visual object tracking (VOT) [1,2,3,4,5,6,7] is a fundamental task in remote sensing and computer vision that involves predicting the position and shape of a target in each subsequent frame based on its position and shape in an initial frame. This task has numerous applications, including aerial or satellite platform surveillance, human–computer interaction [8], mobile robots [9], and autonomous driving [10]. Despite the ability of mainstream tracking algorithms to balance accuracy and speed [11,12,13,14,15,16,17], the task of object tracking faces significant challenges, such as fast-moving targets, background interference, occlusion, and changes in shape, as well as loss of targets. Therefore, designing a robust, accurate, and real-time tracker remains a daunting challenge. VOT tasks generally consist of a multi-level pipeline [17,18,19] that comprises three parts. The first part is feature extraction, which involves generic feature extraction from both the template and search region. The second part is information interaction, which aggregates the feature information of the template and the search region. The final part involves bounding box and location estimation, typically achieved through a specific task head that allows for precise target positioning and bounding box estimation.

The classic VOT algorithms, including correlation-based methods such as SiamFC (Siamese Fully Convolutional Network) [20], SiamRPN (Siamese Region Proposal Network) [21], C-RPN (Cascaded Region Proposal Network) [22], SiamFC++ [17], SiamBAN (Siamese Box Adaptive Network) [23], SiamCAR (Siamese Classification and Regression network) [24], and Ocean (Object-aware Anchor-free Network) [13], as well as online learning algorithms such as MOSSE (Minimum Output Sum of Squared Error filter) [25], KCF (Kernelized Correlation Filter) [26], CSR-DCF (Channel Spatial Reliability-Discriminative Correlation Filter) [27], ATOM (Accurate Tracking by Overlap Maximization) [28], DiMP (Discriminative Model Prediction) [29], FCOT (Fully Convolutional Online Tracking) [14], and ECO (Efficient Convolution Operators) [30], generally utilize CNN as the basic component of feature extraction and information interaction. However, CNNs are not always effective in modeling long-range dependencies between image content and time-series features, which may result in target loss in some cases (e.g., when the target is briefly occluded). In sequence modeling, a transformer [31] overcomes this problem with its global and dynamic modeling capabilities. For tracking tasks [32,33], the transformer can extract discriminative spatial and temporal features, making it a fundamental component of modern high-precision trackers. Although the flexibility of attention in vision transformers (VIT) [34,35] enables higher performance, one of the main challenges of current VIT-based trackers [19] is their slow training convergence speed due to the large amount of input image content, which places a heavier burden on computing resources [32].

The current algorithms used for tracking unmanned aerial vehicles (UAVs) face several challenges, as highlighted in previous studies [36,37]. These challenges can be summarized into two points. First, UAV videos are prone to motion blur, occlusion, and background clutter due to the rapid movement of the drone. The altitude variation of the drone’s six degrees of freedom (6 DOF) movement further complicates the task by causing objects to appear at different scales. As a result, extracting stable features from UAV videos is a comparatively difficult task compared to videos from other fields. Second, object tracking in UAV videos requires the rapid and accurate location of moving objects, making real-time performance a critical requirement. Therefore, it is necessary to reduce the convergence difficulty of a popular tracker for being deployed on the UAV platform. To address the challenges faced by existing tracking methods, we design a new tracker called ParallelTracker, which offers several key advantages. First, we design the prior knowledge extractor module (PEM) to extract spatial information from the template image, addressing the lack of spatial prior information in the VIT architecture. This spatial knowledge is utilized to assist the tracker in both training and inference while reducing the computational burden during training and maintaining optimal performance. Second, we design the template features parallel enhancing module (TPM), which is a parallel attention module that utilizes prior knowledge from the template image to capture more target-oriented discriminative features and enhance feature extraction. Furthermore, we design the template prior knowledge merge module (TPKM) based on TPM to further extract features and facilitate information communication between the template and search regions. TPKM utilizes position self-attention and channel self-attention to extract deep features from both the template and search region in parallel, allowing for information interaction in both position dimension and channel dimension through parallel attention operations. This improves the acquisition of correlations between the template and the search region. Finally, we use a simple location prediction head to complete the tracking process. Additionally, we design a score prediction head based on parallel attention to implement the target template update strategy. Our ParallelTracker requires fewer training epochs than the existing VIT-based trackers [18,19,38], effectively decreasing the difficulty of convergence and striking a balance between performance and speed.

The main contributions of this paper can be divided into two aspects. First, we propose a simple, concise, and effective VIT-based tracker called ParallelTracker that aims to enhance the accuracy of VIT-based tracking algorithms for UAV video by addressing the challenges of motion blur, camera motion, occlusion, and the convergence speed. We design PEM, TPM, and TPKM to assist feature extraction and information interaction, leading to faster convergence and better performance. Second, We demonstrate that ParallelTracker achieves state-of-the-art performance on the UAV tracking benchmark UAV123, faster convergence speed, and a real-time performance of 25 FPS on an NVIDIA GeForce RTX 2070 GPU.

2. Related Literature

Early Siamese tracking networks, such as SiamFC [20], SiamRPN [21], and SiamBAN [23], first extract features from the template and the search region using CNN networks with identical structures and parameters. The template and the search region features are then fused using correlation calculation to estimate the subsequent state. These networks also utilized CNN networks, including ResNet [39] and others [40], as the backbone for extracting image features.

The transformer [31], originally proposed for machine translation tasks in NLP (natural language processing), is now widely adopted for various vision tasks. It introduced a multi-head attention mechanism to enable information interaction among the elements of a sequence, thereby endowing it with unique memory and global computing capabilities. Compared to a CNN with a local receptive field, the core component of the transformer, namely self-attention, is capable of capturing global information and computing interdependencies among image features. The DETR model [41] introduced the transformer into object detection and surpassed many CNN-based methods. Many researchers [19,32,36] have also introduced DETR into the field of visual object tracking, achieving excellent results. OSTrack [18] proposed a complete end-to-end tracker that used the transformer to perform both feature extraction and information interaction simultaneously. TrDiMP [33] integrated transformer structures into tracking frameworks by separating the encoder and decoder of the transformer and exploring temporal information between frames, which effectively enhanced the robustness of Siamese-like trackers. TrTr [42] utilized contextual information extracted from the learned template image in the encoder and transmited it to the decoder for information exchange with the search region. TransT [32] proposed an attention mechanism for nonlinear feature fusion, which effectively integrated global information from both template images and search regions by capturing associated information from long-range features. STARK [19] concatenated template images and search regions to learn spatiotemporal features jointly across frames. In this study, ParallelTracker differs fundamentally from existing trackers in several aspects. Specifically, instead of using self-attention as in Dual_VIT [43], our designed module employs a parallel attention structure in feature extraction and information interaction. Moreover, we adopt multiple templates and search regions as inputs and utilize corner-based localization heads to generate bounding boxes.

3. Methodology

We propose an end-to-end tracking framework named ParallelTracker, which comprises the PEM module for extracting the spatial prior of the template, as well as the TPM and TPKM modules for enhancing the features and fusing the target information, respectively. In addition, ParallelTracker incorporates a corner-based fully convolutional localization head that estimates the bounding box of the tracked object. Moreover, we have introduced a confidence-based score head, which offers a specially designed mechanism for updating the online template to address potential deformations of the tracked object. The overall architecture of ParallelTracker is illustrated in Figure 1.

3.1. PEM, TPM and TPKM

We design PEM, as shown in Figure 2a, to extract prior knowledge from object templates.

We propose PEM to extract local spatial information from the template image, boosting the capacity of the tracker while mitigating the computational burden during training. The utilization of local spatial information in the vision transformer enables the model to capture specific local features of the object. Additionally, this approach imbues the model with inductive biases that are similar to CNNs, such as locality and spatial invariance. Prior knowledge can be defined as the local spatial feature of the target, which informs the model’s understanding of the target’s spatial context. The specific operation of PEM is as follows. First, we obtain downsampled template tokens by using a fully convolutional patch embedding layer. The patch embedding layer is composed of convolutional layers to increase the channel resolution while reducing the spatial resolution. The specific operation of patch embedding layers is: Given T templates (i.e., the first template and

T - 1

online templates) with the size of

T \times h_{t} \times w_{t} \times 3

, we first map it into patch embeddings using a convolutional layer with stride 4 and kernel size 7. Then, we flatten the patch embeddings, resulting in a token sequence with a size of

\frac{h_{t}}{4} \times \frac{w_{t}}{4} \times C

, where C represents the number of channels and is set to 96. Here,

h_{t}

and

w_{t}

denote the height and width of the template, respectively, and are both set to 128. Next, these tokens are fed into an average pooling layer, which reduces their resolution while preserving vital information. The target query is initialized using a random vector and upsampled through bicubic interpolation to match the shape of the template tokens. To map the template tokens and target query to the same linear space while retaining the spatial information of the template in the target query using a self-attention mechanism, we incorporate a linear mapping (LM) layer into our model. This allows us to extract local spatial information as spatial prior knowledge, which can guide subsequent modules in feature extraction, feature enhancement, and information interaction. As a result, it is possible to incorporate prior information, called prior tokens (PT), into the tracking network. This is a special feature of our model and sets it apart from other approaches in the field. As shown in (1),

\begin{matrix} q_{T q} = k_{T t} = L i n e a r (T_{q}) \\ v_{T t} = L i n e a r (T_{t}) \\ P T = A t t e n t i o n (q_{T q}, k_{T t}, v_{T t}) & = S o f t m a x (\frac{q_{T_{q}} k_{T_{t}}}{\sqrt{d}}) v_{T_{t}} \end{matrix}

(1)

where d represents the dimension of the key,

T_{q}

represents the target query, and

T_{t}

represents the template tokens. The learnable target token

T_{q}

is interpolated to match the spatial shape of the template tokens

T_{t}

, and then all tokens are linearly mapped to obtain

q_{T q}

,

k_{T t}

, and

v_{T t}

. We utilize a self-attention mechanism to extract spatial prior information from the template tokens and fuse it into the target tokens, resulting in the generation of tokens with template prior knowledge.

TPM is designed to enrich template features, as illustrated in Figure 2b. TPM consists of two parallel attention modules and two feed-forward layers (FFN), with layer normalization (LN) applied before the FFN and LM layers. The TPM module utilizes prior tokens (PT) extracted by the PEM module to enhance the template features, thereby improving the ability of the extracted features to attend to both spatial and semantic information in the target template. The first attention module is utilized to enhance the spatial representation in the template tokens, while the other attention module effectively extracts key information from the template features and stores it in the enhanced prior tokens. The implementation of these two attentions is shown in Equation (2).

\begin{matrix} P T_{e n h a n c e d} & = F F N (L N (A t t e n t i o n_{P T} (P T, T_{t}))) \\ T_{t} & = F F N (L N (A t t e n t i o n_{P T} (T_{t}))) \end{matrix}

(2)

TPKM, as illustrated in Figure 2c, closely resembles the TPM module. TPKM consists of two multi-head attention layers and three feed-forward layers (FFN). Layer normalization is applied before the FFN and attention. Parallel attention is utilized to engage with distinct sequence feature dimensions, thereby intensifying the fused feature information communication, as shown in (3):

T, S = F F N (L N (s p l i t (A t t e n t i o n_{p o s i t i o n} (L N (x)) + A t t e n t i n_{c h a n n e l} (L N (x)))))

(3)

T and S denote the feature sequence of the template and the search region, respectively. The fused features are represented by x, which are obtained by concatenating online template/template tokens and search tokens along the same dimension. To perform information interaction, we employ parallel position attention and channel attention.

3.2. Detailed Structure of ParallelTracker

The proposed end-to-end ViT-based tracking network, ParallelTracker, embedded with PEM, TPM, and TPKM (as illustrated in Figure 2) consists of three parts: a backbone, a localization head, and a confidence prediction head. By integrating feature extraction and information interaction, our tracker achieves faster convergence speed. Moreover, it decouples the image content features from prior knowledge, without requiring any additional post-processing or integration modules.

Backbone. The backbone employs a step-by-step multi-block architecture strategy, which consists of three distinct blocks. These blocks operate on feature maps that have been down-sampled to the same scale, with the same number of channels. Each block varies slightly in structure, incorporating overlapping patch embedding layers and a different number (

N_{i}

) of either PEMs, TPMs, or TPKMs.

In our approach, we input T templates with a size of

T \times h_{t} \times w_{t} \times 3

and a search region with a size of

h_{s} \times w_{s} \times 3

into the tracker. The T templates comprise templates generated from the initial frame and

T - 1

online templates selected from the confidence prediction head. In this study, we set

T = 2

. Additionally, the search region is cropped according to the previous target state. To map the input image to tokens and increase the channel dimension while reducing the spatial resolution of image features, we utilize the patch embedding layer, which comprises a convolutional layer, batch normalization, and ReLU. Subsequently, in the first block, the image features are converted into template tokens and search tokens with shapes of

\frac{h_{t}}{4} \times \frac{w_{t}}{4} \times C

and

\frac{h_{s}}{4} \times \frac{w_{s}}{4} \times C

, respectively, through the embedding layer. Where C is 96,

h_{t}

and

w_{t}

are 128,

h_{s}

and

w_{s}

are 320. The template tokens are then fed into the PEM to obtain the template of prior knowledge.

In the second block, the image features are further down-sampled to

\frac{h_{t}}{8} \times \frac{w_{t}}{8} \times 3 C

and

\frac{h_{s}}{8} \times \frac{w_{s}}{8} \times 3 C

, respectively, using the embedding layer. To enhance the template token features, we employ TPM in the next step. In the final block, the down-sampled features from the first two blocks are concatenated to generate the fused feature token sequence with a size of

(T \times \frac{h_{t}}{16} \times \frac{w_{t}}{16} + \frac{h_{s}}{16} \times \frac{w_{s}}{16}) \times 5 C

. We employ TPKM to interact information. Before being fed into the location head, the fused feature tokens sequence is first split into template tokens, online template tokens, and search tokens. However, only the search tokens are inputted to both the localization head and the score head. The search tokens are converted into the format of

\frac{h_{s}}{16} \times \frac{w_{s}}{16} \times 5 C

. In this context,

h_{s}

and

w_{s}

indicate the spatial dimensions of the search region, whereas

h_{t}

and

w_{t}

indicate the spatial dimensions of the template. Unlike other trackers [36,38], we do not employ a multi-scale feature aggregation strategy.

Head. The proposed tracking network employs a fully convolutional localization head based on corners for predicting the bounding box, which bears a resemblance to the STARK method [19]. The localization head exclusively relies on a basic convolutional network to predict the upper-left and lower-right corners, after which it computes the expected probability distribution of the corners to predict the bounding box. Utilizing the localization head to estimate the bounding box represents Stage 1 of our tracking system. The effectiveness of tracking is significantly impacted by low-quality online templates. In view of this, we draw inspiration from the scoring modules in contemporary trackers [19,32,36] and propose our parallel scoring prediction head (PSH) for selecting reliable online templates based on expected confidence scores, as shown in Figure 3.

The PSH model is composed of three attention blocks, including two position attentions and one channel attention, and a three-layer perceptron. Initially, a learnable score token is utilized as a query to attend the search ROI tokens, allowing the score tokens to encode the extracted target information. The attended score token compares the extracted target with the tracked target at all positions of the initial target, and the scores are generated using MLP layers and sigmoid activations. Any online template prediction scores below 0.5 are deemed negative. The utilization of PSH to derive scores represents Stage 2 of our tracking process.

3.3. Training and Inference

Training. The training procedure of ParallelTracker generally adheres to the standard training protocol employed by current trackers [19]. We begin by pre-training the backbone network of ParallelTracker using the Dual_VIT model. Subsequently, we fine-tune the tracking network using labeled tracking datasets. We employ a combination of L1 loss and GIOU [44] loss as the objective function.

L_{l o c} = λ_{L_{1}} L_{1} (B, B^{'}) + λ_{G I o U} L_{G I o U} (B, B^{'})

(4)

where

λ_{L_{1}} = 5

and

λ_{G I o U} = 2

are the weight values of the two losses; B is the ground-truth bounding box of the target, and

B^{'}

is the predicted regression box.

Template online update. The template is an image patch of size

128 \times 128

, which is cropped from the initial frame of a video. The online templates are also image patches of the same size, but they are dynamically selected from the current frames by PSH. Online templates play a crucial role in extracting target-related information from time series data. Nevertheless, tracking may encounter issues such as motion blur or objects going beyond the field of view, which may decrease the similarity between online templates and targets. To address this issue, we propose parallel scoring prediction head (PSH), as illustrated in Figure 3, to select reliable online templates using confidence scores. We train PSH separately using standard cross-entropy loss:

L_{s c o r e} = y_{i} l o g (p_{i}) + (1 - y_{i}) l o g (1 - p_{i})

(5)

where y is the ground-truth label, p is the predicted confidence score, and i represents the ith sample.

4. Experiments

4.1. Implementation Details

Our tracker is implemented using Python 3.7 and PyTorch 1.7.1 and is trained on two A6000 GPUs. Notably, ParallelTracker is a compact tracker that does not utilize post-processing, position embedding, or multi-layer feature aggregation strategies.

Structure. As shown in Table 1, we instantiate ParallelTracker with different numbers of PEM, TPM, and TPKM modules, as well as hidden feature dimensions at each block. We initialize the backbone of ParallelTracker using the Dual_VIT [43] model, which has been pre-trained on the ImageNet-1k dataset.

Training details. The training dataset comprises the train splits of several benchmark datasets, including TrackingNet [45], LaSOT [46], GOT-10k [47], and COCO [48], which are also used by other VIT-based trackers during training. More specifically, the training dataset includes 118,287 images from COCO2017, 10,000 video sequences from GOT-10k, 1120 video sequences from LaSOT, and 10,214 video sequences from the TrackingNet dataset. The sizes of search images and templates are

320 \times 320

pixels and

128 \times 128

pixels, respectively, corresponding to 25 and 4 times the bounding box area. The complete training process for the ParallelTracker consists of two stages. In the first stage, we train the backbone and localization head for 200 epochs, followed by an additional 40 epochs for training the confidence prediction head. We train ParallelTracker using ADAM [49] with weight decay. The learning rate is initially set to

1 \times 10^{- 4}

and decreases to

1 \times 10^{- 5}

after 100 epochs. To augment the data, we employ horizontal flipping and brightness jittering.

Inference. For ParallelTracker, we utilize the template generated from the initial frame, online templates selected by PSH, and the search region as inputs. By default, the online templates are updated every 200 frames during the tracking process. We update the online template with the image that has the highest confidence prediction score.

4.2. Comparison with State-of-the-Art Trackers

We validate the performance of ParallelTracker on a public UAV dataset, UAV123 [50], as the main experiment. We also report our results on a close-range dataset, LaSOT [46], to demonstrate the generalization ability of the method.

Evaluation Metrics. This study employs the one-pass evaluation (OPE) standard, as defined by prior literature [45], which comprises three key components: precision, normalized precision, and success rate. These metrics are commonly used to evaluate the tracking performance of visual trackers in the field of computer vision. The success rate is calculated by measuring the intersection-over-union (IoU) between the predicted bounding box and the ground-truth bounding box. The success plot is commonly used to visualize the percentage of frames where IoU exceeds a predetermined threshold, and the success rate is calculated as the area under the curve (

A U C

) of the success plot. Precision (P) is defined as the Euclidean distance between the predicted center point of the bounding box and the ground-truth center point. Normalized precision (

P_{norm}

) is computed by scaling the accuracy with respect to the size of the ground-truth object.

Experiments on UAV123 datasets. The UAV123 dataset comprises 123 video sequences that are captured by low-altitude unmanned aerial vehicles (UAVs). The average sequence length is 915 frames. Based on the results presented in Table 2, our tracker outperforms all the other trackers. Our tracker exhibits a significantly faster convergence speed, requiring fewer training epochs while maintaining high performance, as compared to the state-of-the-art transformer-based models OStrack [18] and STARK [19]. When using 60 k image pairs per epoch, our tracker reduces the required training epochs by approximately 20% compared to OStrack. Additionally, our method achieves convergence twice as fast as STARK. These findings suggest that our approach has the potential to accelerate the convergence speed of tracking algorithms. It is worth noting that, despite using only 100 epochs, TCTrack [36] does not converge, which resulted in a worse performance.

ParallelTracker achieves a top-ranked

A U C

score of 69.3%, surpassing the second-best OSTrack by 1.0% and third-best STARK by 1.1%; the latter two represent the most advanced level of object tracking. Furthermore, ParallelTracker achieves the top-ranked precision (P) of 90.46%, surpassing OSTrack by 1.7% and STARK by 1.4%. In the various scenarios of the UAV123 dataset, ParallelTracker consistently achieves good results.

For instance, as shown in Figure 4 and Figure 5, ParallelTracker demonstrates a statistically significant improvement over the second-best STARK by 1.1% in terms of

A U C

and 6.7% P in the “full occlusion” scenario and surpasses the second-best OStrack by 1.7%

A U C

and 3.0% P in the “partial occlusion” scenario. These results indicate that our algorithm is more robust to occlusions in images captured by unmanned aerial vehicles than other state-of-the-art trackers.

In the “scale variation”scenario, as shown in Figure 6, our tracker outperforms the second-best tracker, STARK, by 1.1% in terms of

A U C

and 1.6% in terms of precision (P).

This finding illustrates that our tracker can effectively leverage prior knowledge of the tracked object’s scale invariance, thereby mitigating the impact of scale variations on its features. Moreover, in scenarios of “background clutter” and “low resolution” (as illustrated in Figure 7 and Figure 8), our tracking algorithm outperforms the second-best algorithm by 1.5% in

A U C

and 4.85% in precision (P) for the former, and by 1.8% in

A U C

and 3.3% in P for the latter. These results demonstrate the algorithm’s capacity to accurately extract discriminative object information amidst background disturbances and noise.

Figure 9 and Figure 10 demonstrate that our tracker outperforms the second-best tracker STARK by 1.4% in terms of

A U C

and 2.1% in terms of P in the “out-of-view” scenario, and 1.3% in terms of

A U C

and 1.6% in terms of P in the “viewpoint change” scenario. These results validate the effectiveness of our template update mechanism and its capability to address tracking drift more efficiently than other trackers.

Experiments on LaSOT datasets. The LaSOT test set comprises 280 long sequence videos captured at close range. We use this set to evaluate the long-term tracking performance of our ParallelTracker in close-range scenarios. As shown in Table 2, our tracker achieves a comparable level of performance to the best OSTracker in long-term tracking and outperforms all the other methods. ParallelTracker achieves a performance of 68.5% on the AUC metric, which is only 0.04% lower than that of OSTrack. Additionally, ParallelTracker surpasses the second-ranked STARK by 2.4%. The success plots and precision plots can be found in Figure 11. The experiments demonstrate our method not only is superior for UAV videos, but also can be well generalized to close-range scenes.

Visualization. We visualize the tracking boxes during the tracking process in Figure 12, demonstrating that our tracker is robust in scenarios such as tracking small objects, dealing with occlusions, and handling fast movements.

4.3. Ablation Study and Analysis

To verify the effectiveness of the newly designed components of ParallelTracker, we conduct ablation experiments on the UAV123 dataset.

4.3.1. Study on PEM

The PEM module is a critical component in the first block of ParallelTracker, intended to extract spatial prior information from image content, enhancing the convergence speed of model training and improving inference accuracy during testing. To verify our design, we conduct several experiments: (1) we remove the PEM module; (2) we utilize the PEM module to extract spatial prior information from each image separately; (3) we extract spatial prior information from all images. The results are presented in Table 3.

As shown in Table 3, we observe a significant drop in accuracy (approximately 5.28%) and a slower convergence speed when we remove the PEM module. The epochs required for model convergence are about 2.5 times longer compared to using PEM. By comparing experiments #2 and #3 with experiment #4, extracting spatial prior information from the template generated by the target initial frame effectively improves the AUC of our tracker on the UAV123 dataset by approximately 4.05%. We compare experiments #2, #3, and #4 with #5, and find that incorporating the spatial prior information of all images as learnable tokens and inputting them into the tracker does not improve its performance. Instead, it increases the number of parameters while decreasing the tracking accuracy. These findings demonstrate that incorporating key spatial prior knowledge can effectively reduce the training difficulty of the tracker and improve tracking accuracy.

4.3.2. Study on TPM

The TPM module is the second core component in ParallelTracker. The goal is to leverage prior knowledge to enhance the image features of the template. The results are shown in Table 4.

Incorporating the TPM module enables our tracker to increase the AUC by 3.58% on the AUV123 dataset, but it also decreases the FPS during the tracking process. These results demonstrate the importance of the TPM module in constructing a high-precision tracker.

4.3.3. Study on TPKM

The TPKM module is the final module in the ParallelTracker architecture. The objective is to leverage position attention and channel attention to facilitate information interaction from the fused sequence features, resulting in sequence features with image content correlation, which can be fed into the subsequent processing module. Our ablation experiments comprise the following components: (1) leveraging only location attention, (2) leveraging only channel attention, and (3) leveraging both types of attention in parallel. The results presented in Table 5 demonstrate that our designed position- and channel-parallel attention mechanism achieves an AUC of 69.29% on the UAV123 dataset, which is superior to either of the individual attention mechanisms.

4.3.4. Study on PSH

As presented in Table 6, solely utilizing the online-update ParallelTracker that updates at a fixed interval led to inferior performance. Our online ParallelTracker, which incorporates our designed PSH module, achieves the highest AUC score among all methods evaluated on the UAV123 dataset. This finding highlights the crucial role of the PSH module in selecting a reliable online template.

4.3.5. Sensitivity Analysis for $λ_{L_{1}}$ and $λ_{G I o U}$

We leverage the lambda parameter setting in DETR [41], as shown in Equation (4). To evaluate the effect of different parameter settings in our tracker, we conduct sensitivity analysis experiments on the UAV123 dataset, as detailed in Table 7. Because Equation (4) is only active during Stage 1, we employ this stage of the ParallelTracker for testing. It can be concluded that different weights of the two parameters have little impact on the accuracy.

5. Conclusions

We have developed a new end-to-end tracking algorithm, called ParallelTracker, which incorporates prior knowledge and parallel attention mechanisms to integrate image priors with feature extraction and interaction processes. Specifically, the PEM module is designed to address the lack of spatial prior information in the VIT-based tracker and enhance efficiency, the TPM module leverages the prior information in the template image to extract object-oriented and discriminative features in searched frames, and the TPKM module facilitates information exchange between templates and search regions. In addition, PSH incorporates an object template update strategy to enhance the tracker’s ability to accommodate changes in object shape and occlusions.

Experimental results demonstrate that ParallelTracker outperforms state-of-the-art algorithms in UAV videos. ParallelTracker can also maintain high levels of accuracy comparable to the latest methods in close-range video tracking scenarios, showing its powerful generalizing ability. Moreover, ParallelTracker can significantly reduce the epochs required by the popular methods to achieve convergence. This leads to a noteworthy reduction in the convergence difficulty of the VIT-based tracker.

Author Contributions

Methodology and original draft, H.W.; resources, supervision, and editing, S.J.; editing, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (grant No. 42171430) and the State Key Program of the National Natural Science Foundation of China (grant No. 42030102).

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bertinetto, L.; Valmadre, J.; Golodetz, S.; Miksik, O.; Torr, P.H.S. Staple: Complementary Learners for Real-Time Tracking. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1401–1409. [Google Scholar] [CrossRef]
Bhat, G.; Johnander, J.; Danelljan, M.; Khan, F.S.; Felsberg, M. Unveiling the Power of Deep Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 493–509. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 101–117. [Google Scholar] [CrossRef]
Galoogahi, H.K.; Fagg, A.; Lucey, S. Learning Background-Aware Correlation Filters for Visual Tracking. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1144–1152. [Google Scholar] [CrossRef]
Li, Y.; Fu, C.; Ding, F.; Huang, Z.; Lu, G. AutoTrack: Towards high-performance visual tracking for UAV with automatic spatio-temporal regularization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11923–11932. [Google Scholar] [CrossRef]
Nam, H.; Han, B. Learning Multi-domain Convolutional Neural Networks for Visual Tracking. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar] [CrossRef]
Song, Y.; Ma, C.; Wu, X.; Gong, L.; Bao, L.; Zuo, W.; Shen, C.; Lau, R.W.H.; Yang, M.-H. VITAL: VIsual Tracking via Adversarial Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8990–8999. [Google Scholar] [CrossRef]
Liu, L.; Xing, J.; Ai, H.; Ruan, X. Hand posture recognition using finger geometric feature. In Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012; pp. 565–568. [Google Scholar]
You, S.; Zhu, H.; Li, M.; Li, Y. A review of visual trackers and analysis of its application to mobile robot. arXiv 2019, arXiv:1910.09761. [Google Scholar]
Luo, Z. Object Tracking for Automatic Driving. In Proceedings of the 2021 2nd International Conference on Computing and Data Science (CDS), Stanford, CA, USA, 28–29 January 2021; pp. 265–269. [Google Scholar] [CrossRef]
Voigtlaender, P.; Luiten, J.; Torr, P.H.S.; Leibe, B. Siam R-CNN: Visual Tracking by Re-Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6577–6587. [Google Scholar] [CrossRef]
Danelljan, M.; Van Gool, L.; Timofte, R. Probabilistic Regression for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7181–7190. [Google Scholar] [CrossRef]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 771–787. [Google Scholar] [CrossRef]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Fully convolutional online tracking. Comput. Vis. Image Understand. 2022, 224, 103547. [Google Scholar] [CrossRef]
Guo, D.; Shao, Y.; Cui, Y.; Wang, Z.; Zhang, L.; Shen, C. Graph Attention Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9538–9547. [Google Scholar] [CrossRef]
Du, F.; Liu, P.; Zhao, W.; Tang, X. Correlation-Guided Attention for Corner Detection Based Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6835–6844. [Google Scholar] [CrossRef]
Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; Yu, G. SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12549–12556. [Google Scholar] [CrossRef]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 341–357. [Google Scholar] [CrossRef]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10428–10437. [Google Scholar] [CrossRef]
Cen, M.; Jung, C. Fully Convolutional Siamese Fusion Networks for Object Tracking. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 3718–3722. [Google Scholar] [CrossRef]
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar] [CrossRef]
Fan, H.; Ling, H. Siamese Cascaded Region Proposal Networks for Real-Time Visual Tracking. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7944–7953. [Google Scholar] [CrossRef]
Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese Box Adaptive Network for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6667–6676. [Google Scholar] [CrossRef]
Guo, D.; Wang, J.; Cui, Y.; Wang, Z.; Chen, S. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6268–6276. [Google Scholar] [CrossRef]
Bolme, D.S.; Beveridge, J.R.; Draper, B.A.; Lui, Y.M. Visual object tracking using adaptive correlation filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 13–18 June 2010; pp. 4847–4856. [Google Scholar] [CrossRef]
Henriques, J.F.; Caseiro, R.; Martins, P.; Batista, J. High-Speed Tracking with Kernelized Correlation Filters. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 583–596. [Google Scholar] [CrossRef] [PubMed]
Lukežic, A.; Vojír, T.; Zajc, L.C.; Matas, J.; Kristan, M. Discriminative Correlation Filter with Channel and Spatial Reliability. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4847–4856. [Google Scholar] [CrossRef]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ATOM: Accurate Tracking by Overlap Maximization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4655–4664. [Google Scholar] [CrossRef]
Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Learning Discriminative Model Prediction for Tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6181–6190. [Google Scholar] [CrossRef]
Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ECO: Efficient Convolution Operators for Tracking. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6931–6939. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8122–8131. [Google Scholar] [CrossRef]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 1571–1580. [Google Scholar] [CrossRef]
d’Ascoli, S.; Touvron, H.; Leavitt, M.; Morcos, A.; Biroli, G.; Sagun, L. ConViT: Improving vision transformers with soft convolutionalinductive biases. arXiv 2021, arXiv:2103.10697. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Cao, Z.; Huang, Z.; Pan, L.; Zhang, S.; Liu, Z.; Fu, C. TCTrack: Temporal Contexts for Aerial Tracking. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14778–14788. [Google Scholar] [CrossRef]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
Xie, F.; Wang, C.; Wang, G.; Yang, W.; Zeng, W. Learning Tracking Representations via Dual-Branch Fully Transformer Networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 10–17 October 2021; pp. 2688–2697. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar] [CrossRef]
Zhao, M.; Okada, K.; Inaba, M. Trtr: Visual tracking with transformer. arXiv 2021, arXiv:2105.03817. [Google Scholar]
Yao, T.; Li, Y.; Pan, Y.; Wang, Y.; Zhang, X.P.; Mei, T. Dual vision transformer. arXiv 2022, arXiv:2207.04976. [Google Scholar] [CrossRef] [PubMed]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), Berlin, Heidelberg, 8–14 September 2018; pp. 300–317. [Google Scholar] [CrossRef]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, L.; Liao, C.; Ling, H. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5369–5378. [Google Scholar] [CrossRef]
Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intel. 2021, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for uav tracking. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 445–461. [Google Scholar] [CrossRef]
Yu, B.; Tang, M.; Zheng, L.; Zhu, G.; Wang, J.; Feng, H.; Feng, X.; Lu, H. High-Performance Discriminative Tracking with Transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9836–9845. [Google Scholar] [CrossRef]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. Target transformed regression for accurate tracking. arXiv 2021, arXiv:2104.00403. [Google Scholar]
Zhou, Z.; Pei, W.; Li, X.; Wang, H.; Zheng, F.; He, Z. Saliency-Associated Object Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9846–9855. [Google Scholar] [CrossRef]
Zhang, Z.; Liu, Y.; Wang, X.; Li, B.; Hu, W. Learn to Match: Automatic Matching Network Design for Visual Tracking. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 13319–13328. [Google Scholar] [CrossRef]
Fu, Z.; Liu, Q.; Fu, Z.; Wang, Y. STMTrack: Template-free Visual Tracking with Space-time Memory Networks. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13769–13778. [Google Scholar] [CrossRef]

Figure 1. The structure of ParallelTracker.

Figure 2. The structure of the prior knowledge extractor module (a), template features parallel enhancing module (b), and template prior knowledge merge module (c).

Figure 3. Parallel scoring prediction head (PSH).

Figure 4. Comparison of “full occlusion” scenario on the UAV123 dataset.

Figure 5. Comparison of “partial occlusion” scenario on the UAV123 dataset.

Figure 6. Comparison of “scale variation” scenario on the UAV123 dataset.

Figure 7. Comparison of “background clutter” scenario on the UAV123 dataset.

Figure 8. Comparison of “low resolution” scenario on the UAV123 dataset.

Figure 9. Comparison of “out-of-view”scenario on the UAV123 dataset.

Figure 10. Comparison of “viewpoint change” scenario on the UAV123 dataset.

Figure 11. Success plot and precision plot comparison on the close-range LaSOT dataset.

Figure 12. Qualitative analysis. Tracking results of different methods on UAV images.

Table 1. The architecture for ParallelTracker. The input is a tuple of templates with shape of

128 \times 128 \times 3

and search region with shape of

320 \times 320 \times 3

; S and T represent for the search region and template;

H D_{i}

is the head number and embedding feature dimension in

B l o c k_{i}

;

C_{i}

is the feature dimension of output features;

R_{i}^{x}

and

R_{i}^{z}

are the feature dimension expansion ratio in the MLP layer.

Table 1. The architecture for ParallelTracker. The input is a tuple of templates with shape of

128 \times 128 \times 3

and search region with shape of

320 \times 320 \times 3

; S and T represent for the search region and template;

H D_{i}

is the head number and embedding feature dimension in

B l o c k_{i}

;

C_{i}

is the feature dimension of output features;

R_{i}^{x}

and

R_{i}^{z}

are the feature dimension expansion ratio in the MLP layer.

	Output Size	Layer Name	ParallelTracker
$B l o c k_{1}$	S: $80 \times 80$ T: $32 \times 32$	Patch Embedding	Downsampling rate 4
$B l o c k_{1}$	S: $80 \times 80$ T: $32 \times 32$	PEM TPM	$[\begin{matrix} H D_{1} = 1 \\ C_{1} = 96 \\ R_{1}^{x} = 8 \\ R_{1}^{z} = 4 \end{matrix}] \times 3$
$B l o c k_{2}$	S: $40 \times 40$ T: $16 \times 16$	Patch Embedding	Downsampling rate 2
$B l o c k_{2}$	S: $40 \times 40$ T: $16 \times 16$	TPM	$[\begin{matrix} H D_{2} = 4 \\ C_{2} = 128 \\ R_{2}^{x} = 8 \\ R_{2}^{z} = 4 \end{matrix}] \times 4$
$B l o c k_{3}$	S: $20 \times 20$ T: $8 \times 8$	Patch Embedding	Downsampling rate 2
$B l o c k_{3}$	S: $20 \times 20$ T: $8 \times 8$	TPKM	$[\begin{matrix} H D_{3} = 10 \\ C_{3} = 320 \\ R_{3}^{x} = 4 \\ R_{3}^{z} = 2 \end{matrix}] \times 15$
MACs		26.76 M
FLOPs		26.03 G
Speed (2070)		25 FPS

Table 2. Comparison on LaSOT [46] and UAV123 [50] datasets. The best two results are shown in red and blue type.

Method	Epochs	Structure	UAV123		LaSOT
Method	Epochs	Structure	$AUC$	$P$	$AUC$	$P_{Norm}$	$P$
ParallelTracker	240	Transformer	69.3	90.5	68.8	78.4	74.9
OSTrack [18]	300	Transformer	68.3	88.8	69.1	78.7	75.2
STARK [19]	500	Transformer	68.2	89.1	67.1	77.0	66.1
TCTrack [36]	100	Transformer	60.4	80.0	-	-	-
DualTFR [38]	300	Transformer	68.2	-	63.5	72.0	66.5
DTT [51]	400	CNN+Transformer	-	60.1	-	-
TREG [52]	50	CNN+Transformer	66.9	88.4	64.0	74.1	-
TransT [32]	1000	CNN+Transformer	68.1	-	64.9	73.8	69.0
TrDimp [33]	50	CNN+Transformer	67.5	-	63.9	-	61.4
SAOT [53]	200	CNN	-	61.6	70.8	-
AutoMatch [54]	25	CNN	-	58.2	-	59.9
STMTracker [55]	20	CNN	64.7	-	60.6	69.3	63.3
SiamRCNN [11]	2720	CNN	64.9	83.4	64.8	72.2	-
PrDimp [12]	50	CNN	68.0	-	59.8	68.8	60.8
Ocean [13]	50	CNN	-	-	56.0	65.1	56.6
FCOT [14]	105	CNN	65.6	87.3	57.2	67.8	-
SiamGAT [15]	20	CNN	64.6	84.3	53.9	63.3	53.0
GGACD [16]	20	CNN	63.3	83.3	51.8	62.6	-
SiamFC++[17]	20	CNN	-	-	54.4	62.3	54.7
DiMP [29]	50	CNN	65.4	-	56.9	65.0	56.7

Table 3. Ablation for PEM.

#	Module	Epochs	Params	FLOPs	AUC
1	No PEM	500	21.42 M	20.95 G	63.95
2	PEM on search region	240	27.35 M	26.80 G	65.24
3	PEM on online-template	240	26.76 M	26.03 G	66.42
4	PEM on template	240	26.76 M	26.03 G	69.29
5	PEM on all images	240	47.57 M	47.14 G	64.82

Table 4. Ablation for TPM.

	TPM	FPS	AUC
ParallelTracker	-	29	65.71
ParallelTracker	✓	25	69.29

Table 5. Ablation for TPKM.

	Position Attention	Channel Attention	AUC
ParallelTracker	✓	-	69.12
ParallelTracker	-	✓	69.04
ParallelTracker	✓	✓	69.29

Table 6. Ablation for PSH.

	Online	PSH	AUC
ParallelTracker	-	-	68.42
ParallelTracker	✓	-	67.57
ParallelTracker	✓	✓	69.29

Table 7. Sensitivity analysis for

λ_{L_{1}}

and

λ_{G I o U}

.

Table 7. Sensitivity analysis for

λ_{L_{1}}

and

λ_{G I o U}

.

	$λ_{L_{1}}$	$λ_{GIoU}$	AUC
ParallelTracker	5	2	68.42
ParallelTracker	2	5	68.27
ParallelTracker	1	1	68.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, H.; Wan, G.; Ji, S. ParallelTracker: A Transformer Based Object Tracker for UAV Videos. Remote Sens. 2023, 15, 2544. https://doi.org/10.3390/rs15102544

AMA Style

Wei H, Wan G, Ji S. ParallelTracker: A Transformer Based Object Tracker for UAV Videos. Remote Sensing. 2023; 15(10):2544. https://doi.org/10.3390/rs15102544

Chicago/Turabian Style

Wei, Haoran, Gang Wan, and Shunping Ji. 2023. "ParallelTracker: A Transformer Based Object Tracker for UAV Videos" Remote Sensing 15, no. 10: 2544. https://doi.org/10.3390/rs15102544

APA Style

Wei, H., Wan, G., & Ji, S. (2023). ParallelTracker: A Transformer Based Object Tracker for UAV Videos. Remote Sensing, 15(10), 2544. https://doi.org/10.3390/rs15102544

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

ParallelTracker: A Transformer Based Object Tracker for UAV Videos

Abstract

1. Introduction

2. Related Literature

3. Methodology

3.1. PEM, TPM and TPKM

3.2. Detailed Structure of ParallelTracker

3.3. Training and Inference

4. Experiments

4.1. Implementation Details

4.2. Comparison with State-of-the-Art Trackers

4.3. Ablation Study and Analysis

4.3.1. Study on PEM

4.3.2. Study on TPM

4.3.3. Study on TPKM

4.3.4. Study on PSH

4.3.5. Sensitivity Analysis for $λ_{L_{1}}$ and $λ_{G I o U}$

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

ParallelTracker: A Transformer Based Object Tracker for UAV Videos

Abstract

1. Introduction

2. Related Literature

3. Methodology

3.1. PEM, TPM and TPKM

3.2. Detailed Structure of ParallelTracker

3.3. Training and Inference

4. Experiments

4.1. Implementation Details

4.2. Comparison with State-of-the-Art Trackers

4.3. Ablation Study and Analysis

4.3.1. Study on PEM

4.3.2. Study on TPM

4.3.3. Study on TPKM

4.3.4. Study on PSH

4.3.5. Sensitivity Analysis for λ L 1 and λ G I o U

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3.5. Sensitivity Analysis for $λ_{L_{1}}$ and $λ_{G I o U}$