FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking

Liu, Hang; Huang, Detian; Lin, Mingxin

doi:10.3390/app142210589

Open AccessArticle

FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking

by

Hang Liu

¹,

Detian Huang

^1,2,*

and

Mingxin Lin

¹

College of Engineering, Huaqiao University, Quanzhou 362021, China

²

Quanzhou Digital Institute, Quanzhou 362021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(22), 10589; https://doi.org/10.3390/app142210589

Submission received: 28 August 2024 / Revised: 2 November 2024 / Accepted: 14 November 2024 / Published: 17 November 2024

(This article belongs to the Special Issue Applications in Computer Vision and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Visual object tracking is a fundamental task in computer vision, with applications ranging from video surveillance to autonomous driving. Despite recent advances in transformer-based one-stream trackers, unrestricted feature interactions between the template and the search region often introduce background noise into the template, degrading the tracking performance. To address this issue, we propose FETrack, a feature-enhanced transformer-based network for visual object tracking. Specifically, we incorporate an independent template stream in the encoder of the one-stream tracker to acquire the high-quality template features while suppressing the harmful background noise effectively. Then, we employ a sequence-learning-based causal transformer in the decoder to generate the bounding box autoregressively, simplifying the prediction head network. Further, we present a dynamic threshold-based online template-updating strategy and a template-filtering approach to boost tracking robustness and reduce redundant computations. Extensive experiments demonstrate that our FETrack achieves a superior performance over state-of-the-art trackers. Specifically, the proposed FETrack achieves a 75.1% AO on GOT-10k, 81.2% AUC on LaSOT, and 89.3%

P_{n o r m}

on TrackingNet.

Keywords:

object tracking; one-stream tracker; transformer; template updating

1. Introduction

As a crucial task in computer vision, visual object tracking (VOT) [1] has been widely applied in video surveillance [2], autonomous driving [3], human–computer interaction [4], medical imaging [5], etc. VOT aims to estimate the state (i.e., position and scale) of the target across subsequent frames of a video sequence by constructing an appearance model based on the target provided in the first frame. Despite numerous excellent VOT methods proposed in the past decades, designing a robust and accurate tracker still faces significant challenges due to various complex real-world scenarios such as appearance change, occlusion, background clutter, and similar objects. Benefiting from the powerful feature representation capabilities of Convolutional Neural Networks (CNNs) and transformers, numerous deep-learning-based trackers have achieved remarkable success. Especially, transformer-based trackers [6] have shown an impressive performance in dealing with challenging real-world scenarios.

Classic tracking paradigms (e.g., Siamese network and two-stream pipeline trackers) can be roughly divided into three stages, i.e., feature extraction, feature fusion, and target prediction. In general, robust trackers possess powerful feature extraction and accurate feature fusion capabilities [7]. Although CNN-based trackers have achieved remarkable results, they suffer from weak global interaction capabilities and tend to obtain only local optimal solutions due to the limitation by the local feature extraction of CNNs. In recent years, benefiting from the advantages of capturing contextual information, handing long sequential data, and parallel computing, transformers have shown great potential in the field of natural language processing (NLP) [8]. Inspired by this, many efforts have been devoted to developing transformer architecture to construct tracking networks, thereby improving the global interaction capabilities. Specifically, benefiting from the powerful global feature modeling, transformers have been gradually applied in the feature extraction and feature fusion stages of tracking networks. At present, most of the mainstream transformer-based trackers [9,10,11,12] adopt the one-stream one-stage tracking pipeline. Such trackers leverage the flexibility of the attention mechanism and unify feature extraction and interaction between the template and the search region into a Vision Transformer (ViT) [13] backbone. By performing extensive feature matches between the template and the search region at multiple levels, the one-stream tracker acquires discriminative features due to its ability to suppress redundant non-target features. Nevertheless, there are often significant scale gaps between the target and the background in the search region [14,15], where interfering objects with a similar appearance to the target may be present. As a result, the deep interaction between the template and the search region may introduce the background noise into template features. This degrades the template feature quality and hinders the tracking performance, especially for trackers relying severely on similarity matching. On the other hand, the undesirable interaction between the template and the search region may cause enhanced features of similar interfering objects to contain target features from the template, resulting in confusion between the target and the background during subsequent target prediction. Especially in the case where there are numerous similar objects surrounding the target, such similar objects impair the awareness ability of the tracker [16], leading to tracking drift or even loss of the target. Additionally, most existing trackers decompose the prediction process into a series of subtasks, such as foreground–background classification, scale estimation, and center point location. These subtasks usually require specific head networks and various post-processing techniques, which tend to complicate the tracking framework and decrease the speed of training and inference. Meanwhile, the loss function of each head network introduces additional hyperparameters, which need to be trained and inferred separately, inevitably increasing computational resources. This is contrary to a simple and efficient end-to-end framework required in real-world applications.

To address the above issues, we propose a feature-enhanced transformer-based network for robust visual object tracking (FETrack). The proposed FETrack employs a transformer-based encoder–decoder structure, where the encoder is responsible for extracting visual features of the template and the search region and the decoder is reliable in predicting the target bounding box autoregressively. To prevent the templates from being altered by the background noise, we embed an independent template stream in the original one-stream encoder, in which we employ the self-attention mechanism to enhance the learned template features and alleviate the interference from the search region. Additionally, we adopt cross-attention to enhance both the template and search region features to extract target-oriented features under mutual guidance. Then, we propose a template-filtering approach to obtain a small yet more accurate template cropped from the template image to improve tracking accuracy and reduce computational efforts. Subsequently, to simplify the complexity of the prediction network, we adopt the sequence-learning-based method that converts four values of the target bounding box into discrete token sequences in the decoder. Specifically, a causal transformer is used to autoregressively generate the sequences of the bounding box value one by one and further fuse them with visual features from the encoder with a cross-attention layer. As a result, the proposed FETrack prevents both customized head networks and complicated post-processing techniques. Additionally, during inference, different from existing trackers that rely on the two-stage inference update or the two-stage training network, we design a novel online template-updating strategy. This strategy calculates the dynamic threshold adaptively to obtain the high-quality template, so as to prevent incorrect template updating due to long-term occlusion and the disappearance of the target.

We evaluate the performance of our tracker on six benchmarks (i.e., GOT-10k [17], LaSOT [18], TrackingNet [19],

{LaSOT}_{e x t}

[20], OTB100 [21], and NFS30 [22]) and experimental results show that our FETrack achieves the state-of-the-art performance. In summary, the main contributions of this paper are as follows:

To tackle the performance degradation of one-stream trackers due to unrestrained information interaction, we propose FETrack, a feature-enhanced transformer-based network for visual object tracking. Experimental results on six benchmarks validate the superiority of our FETrack and the effectiveness of the proposed modules.
An independent template stream is incorporated into the one-stream encoder, which effectively prevents the template feature quality from background interferences.
A dynamic threshold-based online template-updating strategy is presented to adaptively select the template with the highest similarity, and a template-filtering approach is designed to alleviate the background noise and substantially reduce the computational cost.

2. Related Work

2.1. Visual Object Tracking

Siamese network-based trackers [15,23,24] accomplish object tracking via comparing two input images (i.e., the template image and the search image). Initially, CNNs with a shared structure and parameters are used to extract features from both the template and the search region. Subsequently, the convolution network computes the similarity between these features to locate the target. However, traditional CNNs cannot fully exploit the global context information, and the convolution operation is insufficient to capture the nonlinear interactions between the template and the search region, which limits the performance of Siamese-based trackers. In recent years, transformer [25] architectures have made remarkable progress in visual tasks. Numerous trackers [12,26,27,28] improve the Siamese network by introducing stacked transformer layers, which substantially boosts the tracking performance.

2.2. Transformer in Tracking

Transformers, originally developed for translation tasks in natural language processing (NLP) [8], have recently gained prominence in object tracking due to their exceptional global modeling capabilities. The pioneering work TransT [26] follows a two-stream architecture of the hybrid CNN-transformer [29], which constructs an attention-based feature fusion network by stacking a series of self-attention and cross-attention layers and employs a Siamese-like feature extraction backbone. Furthermore, DualTFR [30] constructs a Siamese-like dual-branch network based on pure transformers, which utilizes a series of transformer-based local global attention blocks for feature extraction and feature fusion. However, the two-stream tracker pipeline lacks the ability to perceive the target due to independent feature extraction. To address this limitation, MixFormer [27] proposes a one-stream pipeline that leverages the flexibility of the self-attention mechanism and introduces a mixed attention module (MAM) for simultaneous feature extraction and target information integration. Similarly, OSTrack [10] also develops a one-stream pipeline in which the template and search region are split, flattened, and linear projected, and then image embeddings are concatenated and fed into the layers of the encoder for joint feature learning and relationship modeling. The follow-up trackers [11,12,31] generally adopt a similar one-stream architecture. However, existing one-stream trackers allow for a free interaction between the template and all search images, which may lead to confusion about the target with the background in the case of insufficient discriminative power of the extracted template features. To this end, GRM [14] selects appropriate search region tokens to interact with template tokens, restricts unnecessary information flow between the template and search region tokens, and achieves a more flexible and effective modeling scheme. OIFTrack [32] deeply explored the information flow of tokens. By dividing the search tokens in the deeper encoder layers into target search tokens and non-target search tokens, OIFTrack achieves a bidirectional interaction from target search tokens to template tokens and prevents the interaction between non-target search tokens and template tokens, which improves the tracking accuracy. Wang [33] et al. designed a Discriminative Distractors Mining (DDM) module for the search region, which enables the tracker to utilize the background for target localization more efficiently by refining the background prior knowledge and discarding meaningless background regions. SuperSBT [34] employs a hierarchical structure to improve shallow features with a local modeling layer, thereby enhancing the ability to distinguish the target from similar interfering objects during the feature extraction stage. The recently reported SeqTrack [35] and ARTrack [36] restrict the information flow between tokens in decoder layers to prevent tokens from attending to subsequent tokens. However, there is still no improvement to the encoder in these trackers. Conversely, we construct an encoder modeling scheme using the independent template stream for more accurate tracking.

2.3. Sequence Learning

Sequence learning, originally designed for natural language modeling tasks [8,37], has recently been explored in the context of visual tasks. Pix2Seq [38] is a prominent example that applies sequence learning to object detection by representing object descriptions (e.g., bounding boxes and category labels) as discrete tokens. Building on this foundation, subsequent work [39] extends sequence learning to other visual tasks such as instance segmentation and key point detection, providing a unified framework for multiple visual tasks. Inspired by Pix2Seq, recent studies [35,36] treat object tracking as a sequence generation problem. Such studies discretize the target bounding box and unify multiple subtasks of the tracking prediction network into an intra-frame sequence model in an autoregressive manner after acquiring visual features.

3. Proposed Method

In this section, we first provide a brief overview of our tracker and then detail its network architecture. Finally, we introduce the online template-updating strategy and the template-filtering strategy, along with the training and inference pipeline.

3.1. Overall

To cope with the performance degradation of one-stream trackers caused by an unconstrained interaction, a feature-enhanced transformer-based network for robust visual object tracking (FETrack) is proposed in this paper. Specifically, the proposed FETrack adopts a transformer-based encoder–decoder structure, as illustrated in Figure 1. The encoder is used to extract visual features of the template and the search images, while the decoder is used to fuse the extracted visual features and generate the bounding box token sequence autoregressively. Figure 2 depicts the structure of the encoder and decoder modules. In the encoder, we leverage the advantages of a one-stream and two-stream pipeline to construct a modeling scheme that consists of an inherent template stream and a bidirectional template-search stream. This scheme prevents template features from suffering from background noise while realizing the deep interaction between the template and the search region features. In the decoder, a causal transformer is employed to generate the bounding box. Specifically, a causal self-attention mask restricts the current token to attend only to preceding tokens, and visual features from the encoder are leveraged to increase the accuracy of the bounding box prediction. After obtaining the predicted bounding box, we evaluate the results to determine whether to use the currently predicted search image as a new dynamic template.

3.2. Network Architecture

The structure of the encoder and decoder modules is presented in Figure 2. Before describing the flows of the encoder and decoder, we first introduce their input representations.

Input Representation. The inputs to FETrack include two pairs of template images (the initial template

i t_{i m g} \in R^{3 \times H_{t} \times W_{t}}

and dynamic template

d t_{i m g} \in R^{3 \times H_{t} \times W_{t}}

) and a search image

s_{i m g} \in R^{3 \times H_{s} \times W_{s}}

. Similar to ViT [13], we split and flatten the input images into sequences of patches, i.e.,

i t_{p} \in R^{N_{t} \times 3 \cdot P^{2}}

,

d t_{p} \in R^{N_{t} \times 3 \cdot P^{2}}

, and

s_{p} \in R^{N_{s} \times 3 \cdot P^{2}}

. Where

P \times P

represents the resolution of each patch and

N_{t} = H_{t} W_{t} / P^{2}

and

N_{s} = H_{s} W_{s} / P^{2}

denote the number of patches of the template and the search region, respectively. Then, these patch sequences are fed into a linear projection layer to obtain the D-dimensional patch embedding. Subsequently, after introducing the learnable position embedding, the patch embeddings are fed into the encoder. The inputs to the decoder comprise visual features of the search region extracted by the encoder and the discretized bounding box tokens. For the discretized representation of the bounding box values, similarly to [35], we evenly discretize the bounding box values corresponding to the target position and scale into a vocabulary V between [1, 4000]. In this way, the input sequence to the decoder can be expressed as [start, x, y, w, h], and the special token (start) also corresponds to a learnable embedding. Then, these corresponding embeddings are fed into the decoder.

Encoder. The modeling scheme of the one-stream pipeline allows for the deep bidirectional information interaction between the template and the search region, which dynamically acquires the discriminative target-oriented features by mutual guidance. However, such extensive and unrestricted information interaction potentially corrupts the quality of the template features, leading to tracking drift or even failure. As shown in Figure 3, for the one-stream pipeline, the template features are progressively contaminated during the bidirectional information interaction due to interference from the search region. In contrast, the modeling scheme of the two-stream pipeline effectively avoids this contamination by processing the template and the search region separately, thus ensuring that the template features are more focused on the target. Nevertheless, the template features extracted by the two-stream pipeline possess low discrimination, which struggles to handle challenges such as appearance change and occlusion. To address the inherent defects of these two modeling schemes as well as to leverage their strengths, we propose a novel encoder modeling scheme that employs an independent template stream and a bidirectionally interacting template-search stream to acquire visual features in parallel. In addition, we take Vanilla ViT [13] as the backbone and employ the proposed encoder to replace the traditional ViT-based encoder. We also remove the class token and add a linear projection layer at the last layer of the encoder to align the feature scales of the encoder and decoder.

As shown in Figure 1, the input of the encoder is composed of the initial template

i t \in R^{N_{t} \times D}

, dynamic template

d t \in R^{N_{t} \times D}

, and search region

s \in R^{N_{t} \times D}

. Firstly, the independent template stream is intended to preserve the purity of the template features and minimize the potential interference from irrelevant search regions. The structure of the independent template stream is illustrated in Figure 2. Both the initial and dynamic template tokens are utilized as queries for attention enhancement. Since the information interaction between the initial and dynamic templates strengthens the extracted target features, we perform a self-attention operation with the initial and dynamic templates jointly as both keys and values for each other (the dynamic template is the same as the initial template during initialization). Specifically, the self-attention feature for the initial template is calculated by obtaining the query

q_{i}

from the initial template tokens and the key

k_{i}

and value

v_{i}

from the initial and dynamic template tokens. This process can be expressed as follows:

\begin{matrix} q_{i} & \leftarrow i t \end{matrix}

(1)

\begin{matrix} k_{i}, v_{i} & \leftarrow [i t; d t] \end{matrix}

(2)

\begin{matrix} i t^{'} & = i t + M H S A (q_{i}, k_{i}, v_{i}) \end{matrix}

(3)

where

M H S A (\cdot)

denotes the function of the multi-head self-attention (MHSA) operation.

Similarly, the self-attention feature for the dynamic template is calculated by obtaining the query

q_{d}

from the dynamic template tokens and the key

k_{d}

and value

v_{d}

from the initial and dynamic template tokens. This process can be expressed as follows:

\begin{matrix} q_{d} \leftarrow d t \end{matrix}

(4)

\begin{matrix} k_{d}, v_{d} \leftarrow [i t; d t] \end{matrix}

(5)

\begin{matrix} d t^{'} = d t + M H S A (q_{d}, k_{d}, v_{d}) \end{matrix}

(6)

As a result, both template images are enhanced through self-attention to acquire the pure template features, thereby reducing interference from the search region. At the same time, we take advantage of the bidirectional information interaction possessed by the one-stream pipeline and perform a deep interaction between the template and the search region features through cross-attention operations to extract target-oriented features in a mutually guided manner. Specifically, we obtain the query

q_{c}

from the dynamic template and search region tokens and then obtain the key

k_{c}

and value

v_{c}

from the initial template, dynamic template, and search region tokens. The cross-attention feature extraction process can be expressed as follows:

\begin{matrix} q_{c} \leftarrow [d t; s] \end{matrix}

(7)

\begin{matrix} k_{c}, v_{c} \leftarrow [i t; d t; s] \end{matrix}

(8)

\begin{matrix} A_{c} = Softmax (\frac{q_{c} k_{c}^{T}}{\sqrt{d}}) v_{c} \end{matrix}

(9)

where

Softmax (\cdot)

represents the softmax activation function and

A_{c}

denotes the output of cross-attention, including the enhanced dynamic template features and search region features. Here, we select the dynamic template and search region to perform cross-attention operations. The reason is that a dynamic template enhanced with self-attention implies both the feature information of the initial template and the appearance change information of the target. Meanwhile, the search region features benefit from the aggregated features of both the initial and dynamic templates, resulting in more discriminative target-oriented features. Eventually, these enhanced search region features are fed into the decoder for subsequent prediction.

Decoder. As shown in Figure 2, the decoder of FETrack is a causal transformer. Each transformer block contains a masked multi-head self-attention (MMHSA) layer, a multi-head self-attention (MHSA) layer, and a feed-forward network (FFN). Specifically, the decoder receives the token embeddings from a previous transformer block and uses the causality of MMHSA to ensure that the output of each token element only depends on its previous input tokens, thereby eliminating any interference from subsequent tokens. Then, the output of MMHSA is fused with visual features from the encoder to predict a more accurate bounding box. Finally, FFN is used to generate the token embedding of the next transformer block.

3.3. Online Template Updating

Due to factors such as the shooting angle, fast motion, and occlusion, the target appearance tends to change drastically during tracking, making it unreliable to solely rely on the initial template. To deal with this problem, numerous trackers adopt online template-updating strategies to capture the appearance changes. However, most trackers set a static threshold and update the template only when the confidence score exceeds the set threshold or at a predetermined interval. Obviously, the deficiency of such trackers is that when the threshold is set unreasonably or the trackers cannot keep up with the rapid changes in the target, they are prone to tracking drift or loss of the target due to error accumulation from low-quality templates. To overcome this limitation, we design a dynamic threshold-based online template-updating strategy, which facilitates the robustness of our FETrack.

Regarding online template updating, the following two principles are usually followed. (1) The quality of the updated templates should be as high as possible, with excellent discrimination and accuracy, which means that the threshold should not be set too small. (2) The updated template should be as close as possible to the target of the current frame, which means that the updated template has the highest similarity to the target of the current frame. Based on the above criteria, we design the following dynamic threshold:

\begin{matrix} σ_{i} = σ_{i - 1} \frac{σ_{i - 1}}{σ_{i - 1} + s_{i}} + s_{i} \frac{s_{i}}{σ_{i - 1} + s_{i}} \end{matrix}

(10)

where

σ_{i}

denotes the dynamic threshold, i refers to the current frame (starting from 2), and

s_{i}

denotes the confidence score of the current frame. Essentially, we add a dynamic weight to update the threshold adaptively. If the confidence score

s_{i}

of the current frame is relatively low, it means that the target may have undergone significant changes and the previous template is no longer applicable to the target of the current frame. As a result, we assign the previous template a low weight to update a new dynamic template as soon as possible.

3.4. Template Filtering

Currently, most mainstream trackers utilize similarity matching for object tracking, which involves measuring feature similarities between the template and the search images. However, the template image may contain substantial background information, as shown in the first row of Figure 4. In this case, due to the significant scale gap between the target and the background, using self-attention to highlight the target features also amplifies the background noise. Moreover, performing self-attention calculations between the template and all search region tokens leads to excessive redundant computations.

To alleviate the above issues, accompanied by providing more refined target information to the transformer backbone, we propose a template-filtering (TF) approach. As depicted in the second row of Figure 4, we crop a smaller yet more accurate target region

t^{*} \in R^{3 \times H_{t}^{*} \times W_{t}^{*}}

from the template image according to the size of the target bounding box and serialize it into image patches

t_{p}^{*} \in R^{N_{t}^{*} \times 3 \cdot P^{2}}

, where

N_{t}^{*} = H_{t}^{*} W_{t}^{*} / P^{2}

. Next, the image patches are fed into a linear projection layer, and the obtained mapped features are then summed with the position embedding to obtain the sequences that are more focused on the target. Consequently, the proposed TF approach mitigates the background noise so that the template image can be more focused on the target. This prompts FETrack to increase the tracking accuracy while reducing a large number of redundant computations.

3.5. Training and Inference

Training. Similar to NLP modeling [8], our FETrack does not need to design various complex loss functions and only adopts a simple cross-entropy loss. The goal of training is to maximize the log-likelihood of the target tokens based on the previous sequence and the input video frame. Our objective function can be expressed as

\begin{matrix} maximize \sum_{j = 1}^{L} log Q (\hat{z_{j}} | s, i t, d t, \hat{z} < j) \end{matrix}

(11)

where

Q (\cdot)

represents the softmax probability, s is the search image,

i t

is the initial template,

d t

is the dynamic template,

\hat{z_{j}}

represents the target sequence, j represents the token position, and L denotes the length of the target sequence.

\hat{z} < j

is used to predict the subsequence before the current token

\hat{z_{j}}

. The input sequence is the coordinates describing the template, and the target sequence is the coordinates of the search region predicted by the network. The entire training process aims to obtain the final predicted target sequence based on the input sequence under the constraints of the objective function.

Inference. In the inference stage, the encoder first perceives and extracts the features of the template images (the initial dynamic template is the same as the initial template) and the search region image in the subsequent video frames. Then, given an initial input start of the decoder, this input is used to instruct the model to start generating predicted target sequences one by one. Each coordinate token value is sampled by the model from the vocabulary V according to the maximum likelihood result; that is,

\hat{z_{j}} = arg {max}_{z_{j}} Q (z_{j} | s, i t, d t, \hat{z} < j)

, where

z_{j}

is also a value in V. After obtaining the predicted target sequence, it is scored. When the score is higher than the threshold, the dynamic template is updated. Then, according to the predicted bounding box, a new dynamic template is cropped from the original image and then input into the encoder for feature extraction of the next frame.

4. Experiments

4.1. Implementation Details

Model. We utilize the native ViT-Base model, pre-trained with MAE on ImageNet, as the backbone of our FETrack encoder. The number N of encoders is 12, the size of the encoder hidden is 768, and the number of attention heads is 12. The resolutions of the template and the search images are 128 × 128 and 256 × 256, respectively. All input images are split into numerous patches of size 16 × 16. The decoder consists of two transformer blocks, the size of the decoder hidden is 256, and the number of attention heads is 8. The word-to-embedding is a linear projection layer with an embedding dimension of 256, which is consistent with the size of the hidden layer of the decoder. The embedding-to-word sampling is a three-layer perceptron connected with a softmax. The hidden dimension of the perceptron is 256, and the output dimension is 4000. Finally, the word with the maximum likelihood is sampled as the output word. To verify the flexibility of the proposed FETrack, we also test a high-resolution variant with the template and the search images of size 192 × 192 and 384 × 384, respectively. Our models are trained on a GeForce RTX3090 24GB GPU (NVIDIA Corporation, Santa Clara, CA, USA). All models are implemented using Python 3.8 and PyTorch 1.11.0.

Training. Our training data cover four datasets, namely GOT-10k [17], TrackingNet [19], COCO [40], and LaSOT [18]. Considering the VOT2020 evaluation protocol [41], we remove 1k forbidden videos in GOT-10k during training. For the evaluation of the GOT-10k test set, we follow the official one-shot protocol and only use the training split of GOT-10k for training. The template and the search images are obtained by expanding the bounding box by two and four times. The detailed training settings are shown in Table 1.

Inference. In the inference stage, we adopt the proposed dynamic threshold-based online template-updating strategy to update the dynamic template. Specifically, the dynamic threshold

σ_{1}

is initialized to one to avoid unnecessary template updates in the initial stage. The thresholds

σ_{i}

in subsequent stages are determined by the confidence score

s_{i}

of the current frame and the threshold

σ_{i - 1}

of the previous frame, and i starts from two. For the window penalty, we impose penalties on the predicted coordinate values to suppress the large displacement of the target between consecutive frames. Specifically, the output of the predicted coordinates after the embedding-to-word module is a softmax score vector of size 4000, which is then multiplied by a 1-D Hanning window of size 4000 to implement the window penalty.

4.2. Comparison with State of the Arts

We compare our FETrack with state-of-the-art trackers on six benchmark datasets, including four large-scale datasets and two small-scale ones. Table 2 shows comparison results on the large-scale GOT-10k, LaSOT,

{LaSOT}_{e x t}

, and TranckingNet benchmarks, and Table 3 presents comparison results on the small-scale NFS30 and OTB100 benchmarks. For the comparison methods, we obtained the source codes and pre-trained models released by the authors of the paper and implemented them on our GPU to produce experimental results while maintaining the original experimental settings.

Evaluation protocol: We follow other VOT methods and use the average overlap percentage (AO), the success rate at the 0.5 threshold (

S R_{0.5}

), and the success rate at the 0.75 threshold (

S R_{0.75}

) to measure the tracking performance on the GOT-10k dataset. For other benchmark tests, we use the percentage of the area under the success rate curve (AUC) score, precision (P), and normalized precision (

P_{n o r m}

) to measure the performance.

GOT-10k. GOT-10k [17] is a large-scale short-term tracking dataset comprising over 10,000 video clips of moving objects in real-world scenarios. The object categories in the training set and test set are completely non-overlapping. The test set contains 180 videos, covering various challenging scenarios common in tracking. We strictly follow official requirements to train our tracker only on the GOT-10k training split and submit the tracking results to the official evaluation server. It can be seen from Table 2 that the proposed FETrack-256 achieves the optimal performance in two tested metrics. For example, our FETrack-256 improves the AO metric by 2.2% compared with VideoTrack and improves the

S R_{0.75}

metric by 3.5% compared with TATrack. This indicates that our FETrack is able to identify and locate the target more accurately.

LaSOT. LaSOT [18] is a large-scale long-term tracking dataset that covers 1400 video sequences, of which the test set contains 280 videos with an average length of 2448 frames. As can be seen from Table 2, our FETrack-256 slightly outperforms the latest ARTrack-256. With the same ViT-B encoder architecture and input resolution, the proposed FETrack-256 achieves a 0.2% and 0.4% improvement over ARTrack-256 in terms of the AUC score and

P_{n o r m}

metric, respectively. In addition, the high-resolution version of FETrack-384 obtains an AUC score of 71.8%, outperforming all comparison trackers. These results indicate that our FETrack is capable of improving stability and accuracy in long-term tracking scenarios.

LaSOText.

{LaSOT}_{e x t}

[20] is a recently released dataset extending LaSOT, which contains 150 extra video sequences from 15 object categories. Due to the existence of many similar interfering videos,

{LaSOT}_{e x t}

is more challenging than LaSOT. From Table 2, we can see that our FETrack-256 is superior to state-of-the-art trackers. For example, our FETrack achieves 61.3% in the

P_{n o r m}

metric, 0.5% higher than SeqTrack. Moreover, our FETrack-384 significantly outperforms the comparison trackers on all three tested metrics. The above results indicate that our FETrack not only possesses a significant generalization capability but also a strong anti-interference capability.

TrackingNet. TrackingNet [19] is a large short-term tracking dataset containing 511 test video sequences and covering various object categories and scenarios. We submit tracking results to the official evaluation server. From Table 2, we can observe that our FETrack-384 obtains 84.1% and 89.3% in the AUC and

P_{n o r m}

, respectively, exhibiting a competitive performance compared to the comparison trackers. This suggests that our FETrack also has advantages in short-term tracking scenarios.

OTB100 and NFS30. We also evaluate the proposed FETrack on two small-scale benchmarks (i.e., NFS30 [22] and OTB100 [21]). As can be seen from Table 3, our FETrack-256 and FETrack-384 perform excellently on both benchmarks and achieve the state-of-the-art performance. These results further validate the flexibility and effectiveness of FETrack across different scales.

4.3. Ablation Study and Analysis

We conduct a series of ablation studies and extensive analysis to verify the effectiveness of the proposed modules.

Study on the encoder. The encoder, as a crucial component of FETrack, is responsible for integrating the dependency between the template and the search region features. To verify the effectiveness of the independent template stream embedded in the encoder, we conduct experiments on the GOT-10k and LaSOT datasets. Specifically, we compare the encoder architectures with different modeling schemes. (1) Two-stream (TS) scheme. Similar to a Siamese network architecture, TS uses two transformer blocks with shared weights to extract the template and the search region features separately and then feeds the search region features into the decoder after a simple fusion module. (2) One-stream (OS) scheme. Similar to the encoder used in OSTrack [10], OS feeds the template and the search region tokens together into the encoder to extract visual features and then feeds the search region features into the decoder. (3) The proposed encoder modeling scheme. In addition to free interaction between the template and the search region in the encoder, we utilize the self-attention mechanism to mitigate interference from the search region. Table 4 shows comparison results of the above modeling schemes. As can be seen from Table 4, our proposed scheme achieves a superior performance on both GOT-10k and LaSOT, validating its effectiveness and advantages.

Study on the template-updating strategy. We evaluated the proposed online template-updating strategy on the short-term GOT-10k dataset and the long-term LaSOT dataset, respectively. As shown in Table 5, we conduct four sets of experiments with different template-updating strategies. (1) Only use the initial template without template updating. (2) Crop the prediction region of the previous frame and use it as a dynamic template. (3) Set the mean of the confidence scores of the prediction results from historical frames as a threshold to update the template. (4) The proposed online template-updating strategy. As can be observed from Table 5, the tracker with the first strategy performs mediocrely; the tracker with the second strategy performs even worse. This is because the prediction results of the previous frame may be poor, resulting in a low quality of the cropped template as well as the worst performance; the tracker with the third strategy obtains the best performance on the short-term GOT-10k dataset but performs poorly on the long-term LaSOT dataset. The reason is that while the mean-based threshold appropriately suppresses the low-quality templates and keeps the dynamic template close to the target object of the current frame, it fails to resist long-term target disappearance in the case where the mean is too low. However, compared with previous strategies, the proposed strategy effectively copes with the long-term low-quality templates through the dynamic threshold, prompting our tracker to obtain more reliable results. It can be seen from the above results that our tracker performs well on the long-term sequences and maintains a good performance on the short-term sequences.

Study on the template-filtering approach. Since the quality of template features is crucial for the tracking performance, we evaluate the impact of the purity of the template feature on the tracker. Table 6 reports quantitative metrics of the tracker with and without the proposed template-filtering (TF) approach. It can be observed from Table 6 that the tracker with TF improves the tracking robustness compared with the one without TF.

4.4. Visualization and Qualitative Analysis

To further validate the effectiveness and superiority of the proposed FETrack, we subjectively compare the performance of various trackers. First, to verify the contribution of the proposed encoder to our FETrack, we visualize the attention maps of different layers of the encoder in Figure 5. From Figure 5, we can observe that compared with the encoder adopting the free bidirectional interaction stream paradigm, our encoder focuses on the target object in the search region and suppresses the interference in the background by retaining the pure template features, which is beneficial to locate a more accurate target position. Secondly, to investigate how the sequence-learning-based causal transformer acquires the target state, Figure 6 visualizes the cross-attention maps of the last decoder block. From Figure 6, it can be seen that the attention gradually changes from the dispersed state to the focused state and ultimately focuses on the key position of the target as the bounding box tokens are generated sequentially. Notably, our FETrack pays attention to the edge of the target adaptively when predicting the coordinates in various cases. This also indicates that our tracker possesses the ability to accurately locate the target.

To visualize the tracking performance, Figure 7 compares the tracking results of the proposed FETrack with three state-of-the-art trackers on four sequences of the LaSOT dataset. These sequences mainly have challenges such as similar targets, occlusion, and scale variation, and these challenges may occur individually or simultaneously. As can be seen from Figure 7, other trackers may lose targets during tracking, while our FETrack achieves superior tracking results. Specifically, the proposed FETrack accurately discriminates similar targets in the bird-2 sequence, overcomes the occlusion challenge in the crab-3 sequence, and successfully adapts to the scale variation in the kite-10 sequence. Especially, our FETrack simultaneously overcomes the challenges of occlusion and similar targets and achieves stable tracking of the target in the glraffa-2 sequence. These results also indicate that the proposed FETrack can effectively deal with the challenges of similar targets, occlusion, and scale variation.

In addition to the overall performance, we compare the success rates of different trackers on eleven challenging attributes of the OTB dataset, and the results are shown in Figure 8. It can be seen from Figure 8 that our FETrack shows an extremely competitive performance in most challenging tracking scenarios compared with other transformer-based one-stream trackers, such as OSTrack, GRM, and SeqTrack. Notably, compared with other state-of-the-art methods, our FETrack achieves the best success rate on the attributes of In-Plane Rotation, scale variation, Deformation, and occlusion, obtaining success rates of 71.2%, 73.1%, 67.5%, and 69.8%, respectively. This is attributed to the fact that we incorporate the independent template stream into our FETrack to ensure the quality of the template features and combine the template-search stream to capture appearance changes of the target. In addition, our FETrack also achieves the best overall performance in evaluating all the attributes comprehensively. The above results further validate that the proposed FETrack possesses excellent adaptability and robustness.

5. Conclusions

To tackle the template feature quality degradation of the one-stream tracker due to unrestrained feature interaction, we propose FETrack, a feature-enhanced transformer-based network for visual object tracking. First of all, we embed an independent template stream in the one-stream encoder to prevent the templates from potential interferences from cluttered search regions. Secondly, we adopt a sequence-learning-based causal transformer in the decoder to generate the target bounding box in an autoregressive manner, which greatly simplifies the tracking framework. In addition, we separately design a dynamic threshold-based online template-updating strategy and a template-filtering approach, which prompt our FETrack to effectively handle complex scenes, such as those with scale variation, occlusion, and out-of-view objects. Extensive experiments show that the proposed FETrack is superior to state-of-the-art trackers. Nevertheless, while the proposed FETrack utilizes dynamic templates to integrate the information of historical frames, it still faces challenges in dealing with prolonged occlusion and targets out of view. In our future work, we will focus on exploring a more effective template-update strategy by comprehensively considering factors such as appearance changes, key features of the target, and dynamic thresholds. Moreover, we will address the computational overhead issue while maintaining the tracking accuracy, enabling the proposed FETrack to be deployed on embedded devices with limited computational resources.

Author Contributions

Conceptualization, H.L.; methodology, H.L. and D.H.; software, H.L.; validation, H.L. and M.L.; formal analysis, H.L.; writing—original draft preparation, H.L.; writing—review and editing, H.L. and D.H.; visualization, H.L. and M.L.; funding acquisition, D.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China under grant 2021YFE0205400, in part by the National Natural Science Foundation of China under grant 61901183 and grant 61976098, in part by the Fundamental Research Funds for the Central Universities under grant ZQN-921, in part by the Collaborative Innovation Platform Project of Fujian Province under grant 2021FX03, in part by the Natural Science Foundation of Fujian Province under grant 2023J01140, in part by the Key Project of Quanzhou Science and Technology Plan under grant 2023C007R, and in part by the Key Science and Technology Project of Xiamen City under grant 3502Z20231005.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study. However, if you need the code of the relevant module, you can contact the author by email: [email protected].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Javed, S.; Danelljan, M.; Khan, F.S.; Khan, M.H.; Felsberg, M.; Matas, J. Visual Object Tracking with Discriminative Filters and Siamese Networks: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6552–6574. [Google Scholar] [CrossRef] [PubMed]
Choubisa, M.; Kumar, V.; Kumar, M.; Khanna, S. Object Tracking in Intelligent Video Surveillance System Based on Artificial System. In Proceedings of the 2023 International Conference on Computational Intelligence, Communication Technology and Networking (CICTN), Ghaziabad, India, 7–8 December 2023; pp. 160–166. [Google Scholar]
Barbu, T.; Bejinariu, S.I.; Luca, R. Transfer Learning-Based Framework for Automatic Vehicle Detection, Recognition and Tracking. In Proceedings of the 2024 International Conference on Electronics, Computers and Artificial Intelligence (ECAI), Iasi, Romania, 27–28 June 2024; pp. 1–6. [Google Scholar]
Cao, X. Eye Tracking in Human-computer Interaction Recognition. In Proceedings of the 2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 29–31 September 2023; pp. 203–207. [Google Scholar]
Ibragimov, B.; Mello-Thoms, C. The Use of Machine Learning in Eye Tracking Studies in Medical Imaging: A Review. IEEE J. Biomed. Health Inform. 2024, 28, 3597–3612. [Google Scholar] [CrossRef] [PubMed]
Kugarajeevan, J.; Kokul, T.; Ramanan, A.; Fernando, S. Transformers in Single Object Tracking: An Experimental Survey. IEEE Access 2023, 11, 80297–80326. [Google Scholar] [CrossRef]
Deng, A.; Liu, J.; Chen, Q.; Wang, X.; Zuo, Y. Visual Tracking with FPN Based on Transformer and Response Map Enhancement. Appl. Sci. 2022, 12, 6551. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Chen, B.; Li, P.; Bai, L.; Qiao, L.; Shen, Q.; Li, B.; Gan, W.; Wu, W.; Ouyang, W. Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 375–392. [Google Scholar]
Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 341–357. [Google Scholar]
He, K.; Zhang, C.; Xie, S.; Li, Z.; Wang, Z. Target-aware tracking with long-term context attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 773–780. [Google Scholar]
Xie, F.; Chu, L.; Li, J.; Lu, Y.; Ma, C. VideoTrack: Learning to Track Objects via Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 22826–22835. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Gao, S.; Zhou, C.; Zhang, J. Generalized relation modeling for transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18686–18695. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar]
Choi, J. Target-Aware Feature Bottleneck for Real-Time Visual Tracking. Appl. Sci. 2023, 13, 10198. [Google Scholar] [CrossRef]
Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef]
Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A Large-Scale Dataset and Benchmark for Object Tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
Fan, H.; Bai, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Harshit; Huang, M.; Liu, J.; et al. LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. Int. J. Comput. Vis. 2021, 129, 439–461. [Google Scholar] [CrossRef]
Wu, Y.; Lim, J.; Yang, M.H. Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef]
Kiani Galoogahi, H.; Fagg, A.; Huang, C.; Ramanan, D.; Lucey, S. Need for Speed: A Benchmark for Higher Frame Rate Object Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1125–1134. [Google Scholar]
Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 771–787. [Google Scholar]
Xu, Z.; Huang, D.; Huang, X.; Song, J.; Liu, H. DLUT: Decoupled Learning-Based Unsupervised Tracker. Sensors 2024, 24, 83. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 8126–8135. [Google Scholar]
Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Orleans, LA, USA, 19–24 June 2022; pp. 13608–13618. [Google Scholar]
Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. AiATrack: Attention in Attention for Transformer Visual Tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 146–164. [Google Scholar]
Ma, Z. Hybrid Transformer-CNN Feature Enhancement Network for Visual Object Tracking. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 22–24 March 2024; pp. 1917–1921. [Google Scholar]
Xie, F.; Wang, C.; Wang, G.; Yang, W.; Zeng, W. Learning Tracking Representations via Dual-Branch Fully Transformer Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2688–2697. [Google Scholar]
Lan, J.P.; Cheng, Z.Q.; He, J.Y.; Li, C.; Luo, B.; Bao, X.; Xiang, W.; Geng, Y.; Xie, X. Procontext: Exploring Progressive Context Transformer for Tracking. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Kugarajeevan, J.; Kokul, T.; Ramanan, A.; Fernando, S. Optimized Information Flow for Transformer Tracking. arXiv 2024, arXiv:2402.08195. [Google Scholar]
Wang, Z.; Zhou, Z.; Chen, F.; Xu, J.; Pei, W.; Lu, G. Robust Tracking via Fully Exploring Background Prior Knowledge. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3353–3367. [Google Scholar] [CrossRef]
Xie, F.; Yang, W.; Wang, C.; Chu, L.; Cao, Y.; Ma, C.; Zeng, W. Correlation-Embedded Transformer Tracking: A Single-Branch Framework. arXiv 2024, arXiv:2401.12743. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14572–14581. [Google Scholar]
Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9697–9706. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. Adv. Neural Inf. Process. Syst. 2014, 27, 3104–3112. [Google Scholar]
Chen, T.; Saxena, S.; Li, L.; Fleet, D.J.; Hinton, G. Pix2seq: A Language Modeling Framework for Object Detection. arXiv 2021, arXiv:2109.10852. [Google Scholar]
Chen, T.; Saxena, S.; Li, L.; Lin, T.Y.; Fleet, D.J.; Hinton, G.E. A Unified Sequence Interface for Vision Tasks. Adv. Neural Inf. Process. Syst. 2022, 35, 31333–31346. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Kämäräinen, J.K.; Danelljan, M.; Zajc, L.Č.; Lukežič, A.; Drbohlav, O.; et al. The Eighth Visual Object Tracking VOT2020 Challenge Results. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 547–601. [Google Scholar]
Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust Object Modeling for Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 18–22 June 2023; pp. 9589–9600. [Google Scholar]
Zhu, J.; Chen, X.; Diao, H.; Li, S.; He, J.Y.; Li, C.; Luo, B.; Wang, D.; Lu, H. Exploring Dynamic Transformer for Efficient Object Tracking. arXiv 2024, arXiv:2403.17651. [Google Scholar]
Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1571–1580. [Google Scholar]

Figure 1. Network architecture of the proposed FETrack. The components of FETrack consist of a transformer-based encoder and decoder. The former is responsible for capturing visual features of input images, while the latter generates the sequence of bounding box tokens autoregressively.

Figure 2. The structure of encoder and decoder modules. In the encoder, an independent template stream is embedded to prevent potential background interference. In the decoder, a masked multi-head self-attention layer is used to maintain the causality of the token sequence, and a multi-head cross-attention layer is used to integrate visual features from the encoder. The symbols C and + represent concatenate and addition, respectively.

Figure 3. Comparison of template features between two-stream (T-S) and one-stream (O-S) pipelines.

Figure 4. Schematic diagram of the proposed template-filtering (TF) approach. The TF reduces the background noise, enabling the template image to focus more on the target.

Figure 5. Comparison of attention maps of different layers of the encoder. The two rows of attention maps corresponding to each search region are the attention from the search region to the template in different blocks of the encoder of our FETrack and OSTrack, respectively.

Figure 6. Cross-attention map of the decoder. (a) Search region. (b–e) Cross-attention map of the target token to the search region in the last layer of the decoder.

Figure 7. Visual comparisons between our FETrack and other state-of-the-art trackers on four representative sequences (i.e., bird-2, crab-3, glraffa-2, and kite-10) of the LaSOT dataset. Different color rectangular boxes are used to mark the results obtained by different trackers, and the frame numbers are displayed in the upper left corner of each frame. These results validate the robustness of our tracker against challenges such as scale variation, similar targets, and occlusion.

Figure 8. Comparison of the success rate with state-of-the-art trackers on eleven challenging attributes, including Illumination Variation, scale variation, occlusion, Deformation, Motion Blur, fast motion, In-Plane Rotation, Out-of-Plane Rotation, out of view, background clutter, and Low Resolution, and all attributes (ALL) in the OTB100 dataset. Where different colored lines represent the success rate obtained by different trackers in the corresponding attribute, and the value under the attribute represents the best success rate obtained by all trackers in this attribute

Table 1. Detailed training settings.

Parameter	Value
Data augmentation	Horizontal flipping and brightness jitter
Batch size	32
Optimizer	Adam
Encoder’s learning rate	$10^{- 5}$
Decoder’s learning rate	$10^{- 4}$
Weight decay	$10^{- 4}$
Training epochs	500
Sample images per epoch	60,000
Learning rate decay epoch	400
Decay rate	0.1

Table 2. Comparison with state-of-the-art methods on four large-scale benchmarks, including GOT-10k, LaSOT, TrackingNet, and

{LaSOT}_{e x t}

. The symbol * indicates the model trained only with the GOT-10k training set. The best and the second-best results are marked in red and blue, respectively.

Table 2. Comparison with state-of-the-art methods on four large-scale benchmarks, including GOT-10k, LaSOT, TrackingNet, and

{LaSOT}_{e x t}

. The symbol * indicates the model trained only with the GOT-10k training set. The best and the second-best results are marked in red and blue, respectively.

Method	Source	GOT-10k *			LaSOT			TrackingNet			LaSOT_ext
Method	Source	AO (%)	${SR}_{0.5}$ (%)	${SR}_{0.75}$ (%)	AUC (%)	$P_{norm}$ (%)	$P$ (%)	AUC (%)	$P_{norm}$ (%)	$P$ (%)	AUC (%)	$P_{norm}$ (%)	$P$ (%)
STARK [42]	ICCV21	68.8	78.1	64.1	67.1	77.0	72.2	82.0	86.9	79.1	-	-	-
TransT [26]	CVPR21	67.1	76.8	60.9	64.9	73.8	69.0	81.4	86.7	80.3	45.1	51.3	51.2
OSTrack-256 [10]	ECCV22	71.0	80.4	68.2	69.1	78.7	75.2	83.1	87.8	82.0	47.4	57.3	53.3
Mixformer-1k [27]	CVPR22	71.2	79.9	65.8	67.9	77.3	73.9	82.6	87.7	81.2	-	-	-
SwinTrack [43]	NIPS22	72.4	80.5	67.8	71.3	79.4	76.5	84.0	88.6	82.8	49.1	59.2	55.6
TATrack [11]	AAAI23	73.0	83.3	68.5	69.4	78.2	74.1	83.5	88.3	81.8	-	-	-
GRM [14]	CVPR23	73.4	82.9	70.4	69.9	79.3	75.8	84.0	88.7	83.3	-	-	-
VideoTrack [12]	CVPR23	72.9	81.9	69.8	70.2	79.2	76.4	83.8	88.7	83.1	-	-	-
SeqTrack-256 [35]	CVPR23	74.7	84.7	71.8	69.9	79.7	76.3	83.3	88.3	82.2	49.5	60.8	56.3
ARTrack-256 [36]	CVPR23	73.5	82.2	70.9	70.4	79.5	76.6	84.2	88.7	83.5	46.4	56.5	52.3
DyTrack [44]	Preprint24	71.4	80.2	68.5	69.2	78.9	75.2	82.9	87.3	81.2	48.1	58.1	54.6
OIFTrack [32]	Preprint24	74.6	85.6	71.9	69.6	79.5	75.4	84.1	89.0	82.8	-	-	-
FETrack-256	ours	75.1	85.3	72.0	70.6	79.9	76.5	83.6	88.5	82.4	49.8	61.3	56.7
FETrack-384	ours	74.9	84.8	71.7	71.8	81.2	77.9	84.1	89.3	83.8	50.7	61.9	57.7

Table 3. Comparison results of FETrack on two small-scale benchmarks, including NFS30 and OTB100. The tracking results are evaluated using the AUC (%) score. The best and the second-best results are marked in red and blue, respectively.

	STARK [42]	TransT [26]	TrDiMP [45]	OSTrack-256 [10]	Mixformer [27]	AiATrack [28]	SeqTrack [35]	FETrack-256 (Ours)	FETrack-384 (Ours)
OTB100	68.5	69.4	70.8	68.1	70.4	69.6	68.3	71.5	70.9
NFS30	65.2	65.7	65.8	66.5	66.4	67.9	67.6	68.2	69.1

Table 4. Ablation study of the encoder modeling scheme. The best results are in bold fonts. The symbol * indicates the model trained only with the GOT-10k training set.

Method	GOT-10k *			LaSOT
Method	AO (%)	${SR}_{0.5}$ (%)	${SR}_{0.75}$ (%)	AUC (%)	$P_{norm}$ (%)	$P$ (%)
TS	71.6	80.3	68.2	65.7	74.1	71.6
OS (w/o CE)	74.7	84.7	71.8	69.9	79.7	76.3
OS (w/CE)	74.7	84.8	71.8	70.2	79.7	76.2
ours	75.1	85.3	72.0	70.6	79.9	76.5

Table 5. Ablation study of the template-updating strategy. The best results are in bold fonts. The symbol * indicates the model trained only with the GOT-10k training set.

Method	GOT-10k *			LaSOT
Method	AO (%)	${SR}_{0.5}$ (%)	${SR}_{0.75}$ (%)	AUC (%)	$P_{norm}$ (%)	$P$ (%)
w/o Update	72.5	82.3	68.3	69.2	77.0	74.8
w/Previous	69.8	78.7	62.8	63.5	70.1	68.5
Mean	75.4	85.5	72.6	68.6	77.7	73.6
ours	75.1	85.3	72.0	70.6	79.9	76.5

Table 6. Ablation study of the template-filtering (TF) approach. The best results are in bold fonts. The symbol * indicates the model trained only with the GOT-10k training set.

Method	GOT-10k *		LaSOT		MACs (G)
Method	AO (%)	${SR}_{0.5}$ (%)	AUC (%)	$P_{norm}$ (%)	MACs (G)
w/o TF	75.0	85.1	70.3	79.0	34.1
w/TF	75.1	85.3	70.6	79.9	27.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, H.; Huang, D.; Lin, M. FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking. Appl. Sci. 2024, 14, 10589. https://doi.org/10.3390/app142210589

AMA Style

Liu H, Huang D, Lin M. FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking. Applied Sciences. 2024; 14(22):10589. https://doi.org/10.3390/app142210589

Chicago/Turabian Style

Liu, Hang, Detian Huang, and Mingxin Lin. 2024. "FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking" Applied Sciences 14, no. 22: 10589. https://doi.org/10.3390/app142210589

APA Style

Liu, H., Huang, D., & Lin, M. (2024). FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking. Applied Sciences, 14(22), 10589. https://doi.org/10.3390/app142210589

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking

Abstract

1. Introduction

2. Related Work

2.1. Visual Object Tracking

2.2. Transformer in Tracking

2.3. Sequence Learning

3. Proposed Method

3.1. Overall

3.2. Network Architecture

3.3. Online Template Updating

3.4. Template Filtering

3.5. Training and Inference

4. Experiments

4.1. Implementation Details

4.2. Comparison with State of the Arts

4.3. Ablation Study and Analysis

4.4. Visualization and Qualitative Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI