FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking
Abstract
:1. Introduction
- To tackle the performance degradation of one-stream trackers due to unrestrained information interaction, we propose FETrack, a feature-enhanced transformer-based network for visual object tracking. Experimental results on six benchmarks validate the superiority of our FETrack and the effectiveness of the proposed modules.
- An independent template stream is incorporated into the one-stream encoder, which effectively prevents the template feature quality from background interferences.
- A dynamic threshold-based online template-updating strategy is presented to adaptively select the template with the highest similarity, and a template-filtering approach is designed to alleviate the background noise and substantially reduce the computational cost.
2. Related Work
2.1. Visual Object Tracking
2.2. Transformer in Tracking
2.3. Sequence Learning
3. Proposed Method
3.1. Overall
3.2. Network Architecture
3.3. Online Template Updating
3.4. Template Filtering
3.5. Training and Inference
4. Experiments
4.1. Implementation Details
4.2. Comparison with State of the Arts
4.3. Ablation Study and Analysis
4.4. Visualization and Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Javed, S.; Danelljan, M.; Khan, F.S.; Khan, M.H.; Felsberg, M.; Matas, J. Visual Object Tracking with Discriminative Filters and Siamese Networks: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 6552–6574. [Google Scholar] [CrossRef] [PubMed]
- Choubisa, M.; Kumar, V.; Kumar, M.; Khanna, S. Object Tracking in Intelligent Video Surveillance System Based on Artificial System. In Proceedings of the 2023 International Conference on Computational Intelligence, Communication Technology and Networking (CICTN), Ghaziabad, India, 7–8 December 2023; pp. 160–166. [Google Scholar]
- Barbu, T.; Bejinariu, S.I.; Luca, R. Transfer Learning-Based Framework for Automatic Vehicle Detection, Recognition and Tracking. In Proceedings of the 2024 International Conference on Electronics, Computers and Artificial Intelligence (ECAI), Iasi, Romania, 27–28 June 2024; pp. 1–6. [Google Scholar]
- Cao, X. Eye Tracking in Human-computer Interaction Recognition. In Proceedings of the 2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE), Jinzhou, China, 29–31 September 2023; pp. 203–207. [Google Scholar]
- Ibragimov, B.; Mello-Thoms, C. The Use of Machine Learning in Eye Tracking Studies in Medical Imaging: A Review. IEEE J. Biomed. Health Inform. 2024, 28, 3597–3612. [Google Scholar] [CrossRef] [PubMed]
- Kugarajeevan, J.; Kokul, T.; Ramanan, A.; Fernando, S. Transformers in Single Object Tracking: An Experimental Survey. IEEE Access 2023, 11, 80297–80326. [Google Scholar] [CrossRef]
- Deng, A.; Liu, J.; Chen, Q.; Wang, X.; Zuo, Y. Visual Tracking with FPN Based on Transformer and Response Map Enhancement. Appl. Sci. 2022, 12, 6551. [Google Scholar] [CrossRef]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
- Chen, B.; Li, P.; Bai, L.; Qiao, L.; Shen, Q.; Li, B.; Gan, W.; Wu, W.; Ouyang, W. Backbone Is All Your Need: A Simplified Architecture for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 375–392. [Google Scholar]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 341–357. [Google Scholar]
- He, K.; Zhang, C.; Xie, S.; Li, Z.; Wang, Z. Target-aware tracking with long-term context attention. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 773–780. [Google Scholar]
- Xie, F.; Chu, L.; Li, J.; Lu, Y.; Ma, C. VideoTrack: Learning to Track Objects via Video Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 22826–22835. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Gao, S.; Zhou, C.; Zhang, J. Generalized relation modeling for transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18686–18695. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Springer: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar]
- Choi, J. Target-Aware Feature Bottleneck for Real-Time Visual Tracking. Appl. Sci. 2023, 13, 10198. [Google Scholar] [CrossRef]
- Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef]
- Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5374–5383. [Google Scholar]
- Muller, M.; Bibi, A.; Giancola, S.; Alsubaihi, S.; Ghanem, B. Trackingnet: A Large-Scale Dataset and Benchmark for Object Tracking in the wild. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 300–317. [Google Scholar]
- Fan, H.; Bai, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Harshit; Huang, M.; Liu, J.; et al. LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. Int. J. Comput. Vis. 2021, 129, 439–461. [Google Scholar] [CrossRef]
- Wu, Y.; Lim, J.; Yang, M.H. Object Tracking Benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1834–1848. [Google Scholar] [CrossRef]
- Kiani Galoogahi, H.; Fagg, A.; Huang, C.; Ramanan, D.; Lucey, S. Need for Speed: A Benchmark for Higher Frame Rate Object Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1125–1134. [Google Scholar]
- Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 771–787. [Google Scholar]
- Xu, Z.; Huang, D.; Huang, X.; Song, J.; Liu, H. DLUT: Decoupled Learning-Based Unsupervised Tracker. Sensors 2024, 24, 83. [Google Scholar] [CrossRef] [PubMed]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
- Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 8126–8135. [Google Scholar]
- Cui, Y.; Jiang, C.; Wang, L.; Wu, G. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Orleans, LA, USA, 19–24 June 2022; pp. 13608–13618. [Google Scholar]
- Gao, S.; Zhou, C.; Ma, C.; Wang, X.; Yuan, J. AiATrack: Attention in Attention for Transformer Visual Tracking. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 146–164. [Google Scholar]
- Ma, Z. Hybrid Transformer-CNN Feature Enhancement Network for Visual Object Tracking. In Proceedings of the 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), Nanjing, China, 22–24 March 2024; pp. 1917–1921. [Google Scholar]
- Xie, F.; Wang, C.; Wang, G.; Yang, W.; Zeng, W. Learning Tracking Representations via Dual-Branch Fully Transformer Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2688–2697. [Google Scholar]
- Lan, J.P.; Cheng, Z.Q.; He, J.Y.; Li, C.; Luo, B.; Bao, X.; Xiang, W.; Geng, Y.; Xie, X. Procontext: Exploring Progressive Context Transformer for Tracking. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Kugarajeevan, J.; Kokul, T.; Ramanan, A.; Fernando, S. Optimized Information Flow for Transformer Tracking. arXiv 2024, arXiv:2402.08195. [Google Scholar]
- Wang, Z.; Zhou, Z.; Chen, F.; Xu, J.; Pei, W.; Lu, G. Robust Tracking via Fully Exploring Background Prior Knowledge. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 3353–3367. [Google Scholar] [CrossRef]
- Xie, F.; Yang, W.; Wang, C.; Chu, L.; Cao, Y.; Ma, C.; Zeng, W. Correlation-Embedded Transformer Tracking: A Single-Branch Framework. arXiv 2024, arXiv:2401.12743. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 14572–14581. [Google Scholar]
- Wei, X.; Bai, Y.; Zheng, Y.; Shi, D.; Gong, Y. Autoregressive Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 9697–9706. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. Adv. Neural Inf. Process. Syst. 2014, 27, 3104–3112. [Google Scholar]
- Chen, T.; Saxena, S.; Li, L.; Fleet, D.J.; Hinton, G. Pix2seq: A Language Modeling Framework for Object Detection. arXiv 2021, arXiv:2109.10852. [Google Scholar]
- Chen, T.; Saxena, S.; Li, L.; Lin, T.Y.; Fleet, D.J.; Hinton, G.E. A Unified Sequence Interface for Vision Tasks. Adv. Neural Inf. Process. Syst. 2022, 35, 31333–31346. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Kämäräinen, J.K.; Danelljan, M.; Zajc, L.Č.; Lukežič, A.; Drbohlav, O.; et al. The Eighth Visual Object Tracking VOT2020 Challenge Results. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 547–601. [Google Scholar]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
- Cai, Y.; Liu, J.; Tang, J.; Wu, G. Robust Object Modeling for Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 18–22 June 2023; pp. 9589–9600. [Google Scholar]
- Zhu, J.; Chen, X.; Diao, H.; Li, S.; He, J.Y.; Li, C.; Luo, B.; Wang, D.; Lu, H. Exploring Dynamic Transformer for Efficient Object Tracking. arXiv 2024, arXiv:2403.17651. [Google Scholar]
- Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 1571–1580. [Google Scholar]
Parameter | Value |
---|---|
Data augmentation | Horizontal flipping and brightness jitter |
Batch size | 32 |
Optimizer | Adam |
Encoder’s learning rate | |
Decoder’s learning rate | |
Weight decay | |
Training epochs | 500 |
Sample images per epoch | 60,000 |
Learning rate decay epoch | 400 |
Decay rate | 0.1 |
Method | Source | GOT-10k * | LaSOT | TrackingNet | LaSOText | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AO (%) | (%) | (%) | AUC (%) | (%) | (%) | AUC (%) | (%) | (%) | AUC (%) | (%) | (%) | ||
STARK [42] | ICCV21 | 68.8 | 78.1 | 64.1 | 67.1 | 77.0 | 72.2 | 82.0 | 86.9 | 79.1 | - | - | - |
TransT [26] | CVPR21 | 67.1 | 76.8 | 60.9 | 64.9 | 73.8 | 69.0 | 81.4 | 86.7 | 80.3 | 45.1 | 51.3 | 51.2 |
OSTrack-256 [10] | ECCV22 | 71.0 | 80.4 | 68.2 | 69.1 | 78.7 | 75.2 | 83.1 | 87.8 | 82.0 | 47.4 | 57.3 | 53.3 |
Mixformer-1k [27] | CVPR22 | 71.2 | 79.9 | 65.8 | 67.9 | 77.3 | 73.9 | 82.6 | 87.7 | 81.2 | - | - | - |
SwinTrack [43] | NIPS22 | 72.4 | 80.5 | 67.8 | 71.3 | 79.4 | 76.5 | 84.0 | 88.6 | 82.8 | 49.1 | 59.2 | 55.6 |
TATrack [11] | AAAI23 | 73.0 | 83.3 | 68.5 | 69.4 | 78.2 | 74.1 | 83.5 | 88.3 | 81.8 | - | - | - |
GRM [14] | CVPR23 | 73.4 | 82.9 | 70.4 | 69.9 | 79.3 | 75.8 | 84.0 | 88.7 | 83.3 | - | - | - |
VideoTrack [12] | CVPR23 | 72.9 | 81.9 | 69.8 | 70.2 | 79.2 | 76.4 | 83.8 | 88.7 | 83.1 | - | - | - |
SeqTrack-256 [35] | CVPR23 | 74.7 | 84.7 | 71.8 | 69.9 | 79.7 | 76.3 | 83.3 | 88.3 | 82.2 | 49.5 | 60.8 | 56.3 |
ARTrack-256 [36] | CVPR23 | 73.5 | 82.2 | 70.9 | 70.4 | 79.5 | 76.6 | 84.2 | 88.7 | 83.5 | 46.4 | 56.5 | 52.3 |
DyTrack [44] | Preprint24 | 71.4 | 80.2 | 68.5 | 69.2 | 78.9 | 75.2 | 82.9 | 87.3 | 81.2 | 48.1 | 58.1 | 54.6 |
OIFTrack [32] | Preprint24 | 74.6 | 85.6 | 71.9 | 69.6 | 79.5 | 75.4 | 84.1 | 89.0 | 82.8 | - | - | - |
FETrack-256 | ours | 75.1 | 85.3 | 72.0 | 70.6 | 79.9 | 76.5 | 83.6 | 88.5 | 82.4 | 49.8 | 61.3 | 56.7 |
FETrack-384 | ours | 74.9 | 84.8 | 71.7 | 71.8 | 81.2 | 77.9 | 84.1 | 89.3 | 83.8 | 50.7 | 61.9 | 57.7 |
STARK [42] | TransT [26] | TrDiMP [45] | OSTrack-256 [10] | Mixformer [27] | AiATrack [28] | SeqTrack [35] | FETrack-256 (Ours) | FETrack-384 (Ours) | |
---|---|---|---|---|---|---|---|---|---|
OTB100 | 68.5 | 69.4 | 70.8 | 68.1 | 70.4 | 69.6 | 68.3 | 71.5 | 70.9 |
NFS30 | 65.2 | 65.7 | 65.8 | 66.5 | 66.4 | 67.9 | 67.6 | 68.2 | 69.1 |
Method | GOT-10k * | LaSOT | ||||
---|---|---|---|---|---|---|
AO (%) | (%) | (%) | AUC (%) | (%) | (%) | |
TS | 71.6 | 80.3 | 68.2 | 65.7 | 74.1 | 71.6 |
OS (w/o CE) | 74.7 | 84.7 | 71.8 | 69.9 | 79.7 | 76.3 |
OS (w/CE) | 74.7 | 84.8 | 71.8 | 70.2 | 79.7 | 76.2 |
ours | 75.1 | 85.3 | 72.0 | 70.6 | 79.9 | 76.5 |
Method | GOT-10k * | LaSOT | ||||
---|---|---|---|---|---|---|
AO (%) | (%) | (%) | AUC (%) | (%) | (%) | |
w/o Update | 72.5 | 82.3 | 68.3 | 69.2 | 77.0 | 74.8 |
w/Previous | 69.8 | 78.7 | 62.8 | 63.5 | 70.1 | 68.5 |
Mean | 75.4 | 85.5 | 72.6 | 68.6 | 77.7 | 73.6 |
ours | 75.1 | 85.3 | 72.0 | 70.6 | 79.9 | 76.5 |
Method | GOT-10k * | LaSOT | MACs (G) | ||
---|---|---|---|---|---|
AO (%) | (%) | AUC (%) | (%) | ||
w/o TF | 75.0 | 85.1 | 70.3 | 79.0 | 34.1 |
w/TF | 75.1 | 85.3 | 70.6 | 79.9 | 27.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, H.; Huang, D.; Lin, M. FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking. Appl. Sci. 2024, 14, 10589. https://doi.org/10.3390/app142210589
Liu H, Huang D, Lin M. FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking. Applied Sciences. 2024; 14(22):10589. https://doi.org/10.3390/app142210589
Chicago/Turabian StyleLiu, Hang, Detian Huang, and Mingxin Lin. 2024. "FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking" Applied Sciences 14, no. 22: 10589. https://doi.org/10.3390/app142210589
APA StyleLiu, H., Huang, D., & Lin, M. (2024). FETrack: Feature-Enhanced Transformer Network for Visual Object Tracking. Applied Sciences, 14(22), 10589. https://doi.org/10.3390/app142210589