STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking
Abstract
:1. Introduction
- (1)
- We propose a spatio-temporal-focused Siamese network for infrared UAV tracking, which incorporates spatio-temporal constraints within a Siamese-based global search tracking framework, employing a two-level global-to-local focusing strategy.
- (2)
- To achieve search region focusing, a spatio-temporal information-guided region proposal network is proposed, which implements an adaptive region selection strategy to generate high-quality candidate targets.
- (3)
- To improve the accuracy of the target discrimination, we propose an instance discrimination region-CNN, which specifically focuses on the target among the candidate targets. Through metric learning, we further enhance the discrimination of the target features.
2. Proposed Method
2.1. Deep Feature Extraction Based on the Siamese Network
2.2. Candidate Target Generation Based on STG-RPN
2.2.1. Target Location Prediction
2.2.2. Anchor Shape Prediction
2.2.3. Feature Adaptation
2.3. Target Discrimination Based on ID-RCNN
2.3.1. QG-RCNN Branch
2.3.2. DML Branch
2.3.3. Target Discrimination
2.4. Loss Function
2.5. Tracking Process
Algorithm 1: STFTrack: Spatio-temporal-focused Siamese network for infrared UAV tracking. | |
1: | Input: Tracking sequence of length L, ; Initial target bounding box . |
Output: Target bounding box for each frame . | |
2: | for f = 1 to L do |
3: | if f == 1 then |
4: | Construct multi-scale template image features; |
5: | Extract multi-scale template target features using ROIAlign; |
6: | Initialize the Kalman filter. |
7: | if 1 < f < L + 1 then |
8: | Compute the multi-scale search feature map; |
9: | Compute the target location map by Equations (2), (6) and (7), and set anchor points in the region above threshold t; |
10: | Predict the anchor shape of each anchor point and obtain focused anchor boxes; |
11 | Obtain adapted feature by Equation (9); |
12: | Using the focused anchor boxes and adapted feature, obtain ROIs with candidate features through RPN; |
13: | Calculate modulated feature by Equation (10); |
14: | Using the modulated feature, obtain the classification scores and regression parameters of each candidate through the RCNN; |
15: | Calculate the similarity between each candidate and template target by Equation (12); |
16: | Obtain the discriminant scores by Equation (13); |
17: | Choose the candidate with highest discriminant score as the tracking target and output the refined bounding box ; |
18: | Update the Kalman Filter. |
19: | End for |
3. Experiments
3.1. Datasets and Evaluation Metrics
3.1.1. Datasets
3.1.2. Evaluation Metrics
3.2. Implementation Details
- CPU: Intel Core [email protected] GHz
- GPU: 4× NVIDIA RTX 2080Ti graphics cards, each with 12 GB GDDR6 memory
- Storage: 1 T solid state drive
- Memory: 64 GB DDR4
- Operating system: Windows 10
- Programming environment: Python 3.7 and PyTorch 1.10
3.3. Quantitative Evaluation
3.4. Attribute-Based Evaluation
3.5. Qualitative Evaluation
3.6. Ablation Analysis
3.7. Tracking Generalizability Evaluation
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Fan, J.; Yang, X.; Lu, R.; Xie, X.; Li, W. Design and Implementation of Intelligent Inspection and Alarm Flight System for Epidemic Prevention. Drones 2021, 5, 68. [Google Scholar] [CrossRef]
- Filkin, T.; Sliusar, N.; Ritzkowski, M.; Huber-Humer, M. Unmanned Aerial Vehicles for Operational Monitoring of Landfills. Drones 2021, 5, 125. [Google Scholar] [CrossRef]
- Svanström, F.; Alonso-Fernandez, F.; Englund, C. Drone Detection and Tracking in Real-Time by Fusion of Different Sensing Modalities. Drones 2022, 6, 317. [Google Scholar] [CrossRef]
- Dewangan, V.; Saxena, A.; Thakur, R.; Tripathi, S. Application of Image Processing Techniques for UAV Detection Using Deep Learning and Distance-Wise Analysis. Drones 2023, 7, 174. [Google Scholar] [CrossRef]
- Luo, J.; Wang, Z. A Review of Development and Application of UAV Detection and Counter Technology. Control Decis. 2022, 37, 530–544. [Google Scholar]
- Bertinetto, L.; Valmadre, J.; Henriques, J.; Vedaldi, A.; Torr, P. Fully-convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 850–865. [Google Scholar]
- Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8971–8980. [Google Scholar]
- Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. Atom: Accurate Tracking by Overlap Maximization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4660–4669. [Google Scholar]
- Bhat, G.; Danelljan, M.; Gool, L.; Timofte, R. Learning Discriminative Model Prediction for Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6182–6191. [Google Scholar]
- Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4277–4286. [Google Scholar]
- Fan, H.; Ling, H. SANet: Structure-aware Network for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 42–49. [Google Scholar]
- Wang, C.; Shi, Z.; Meng, L.; Wang, J.; Wang, T.; Gao, Q.; Wang, E. Anti-Occlusion UAV Tracking Algorithm with a Low-Altitude Complex Background by Integrating Attention Mechanism. Drones 2022, 6, 149. [Google Scholar] [CrossRef]
- Bhat, G.; Danelljan, M.; Van Gool, L.; Timofte, R. Know Your Surroundings: Exploiting Scene Information for Object Tracking. In Proceedings of the European Conference on Computer Vision, Online, 23–28 August 2020; pp. 205–221. [Google Scholar] [CrossRef]
- Zhang, H.; Li, X.; Zhu, B.; Zhang, Y. Two-stage Object Tracking Method Based on Siamese Neural Network. Infrared Laser Eng. 2021, 50, 20200491–1-20200481-12. [Google Scholar]
- Sun, L.; Zhang, J.; Yang, Z.; Fan, B. A Motion-Aware Siamese Framework for Unmanned Aerial Vehicle Tracking. Drones 2023, 7, 153. [Google Scholar] [CrossRef]
- Kalal, Z.; Mikolajczyk, K.; Matas, J. Tracking-Learning-Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 1409–1422. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yan, B.; Zhao, H.; Wang, D.; Lu, H.; Yang, X. ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-Term Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2385–2393. [Google Scholar]
- Dai, K.; Zhang, Y.; Wang, D.; Li, J.; Lu, H.; Yang, X. High-Performance Long-Term Tracking with Meta-Updater. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6298–6307. [Google Scholar]
- Zhao, J.; Zhang, X.; Zhang, P. A Unified Approach for Tracking UAVs in Infrared. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1213–1222. [Google Scholar]
- Huang, L.; Zhao, X.; Huang, K. GlobalTrack: A Simple and Strong Baseline for Long-term Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11037–11044. [Google Scholar]
- Voigtlaender, P.; Luiten, J.; Torr, P.; Leibe, B. Siam R-CNN: Visual Tracking by Re-Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6578–6588. [Google Scholar]
- Fang, H.; Wang, X.; Liao, Z.; Chang, Y.; Yan, L. A Real-time Anti-distractor Infrared UAV Tracker with Channel Feature Refinement Module. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 1240–1248. [Google Scholar] [CrossRef]
- Chen, J.; Huang, B.; Li, J.; Wang, Y.; Ren, M.; Xu, T. Learning Spatio-Temporal Attention Based Siamese Network for Tracking UAVs in the Wild. Remote. Sens. 2022, 14, 1797. [Google Scholar] [CrossRef]
- Shi, X.; Zhang, Y.; Shi, Z.; Zhang, Y. GASiam: Graph Attention Based Siamese Tracker for Infrared Anti-UAV. In Proceedings of the International Conference on Computer Vision, Image and Deep Learning & International Conference on Computer Engineering and Applications (CVIDL & ICCEA), Changchun, China, 20–22 May 2022; pp. 986–993. [Google Scholar]
- Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Wang, J.; Chen, K.; Yang, S.; Loy, C.; Lin, D. Region Proposal by Guided Anchoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2965–2974. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Song, G.; Liu, Y.; Jiang, M.; Wang, Y.; Yan, J.; Leng, B. Beyond Trade-off: Accelerate FCN-based Face Detector with Higher Accuracy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7756–7764. [Google Scholar]
- Cakir, F.; He, K.; Xia, X.; Kulis, B.; Sclaroff, S. Deep Metric Learning to Rank. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1861–1870. [Google Scholar]
- Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 2318–2327. [Google Scholar] [CrossRef] [Green Version]
- Tychsen-Smith, L.; Petersson, L. Improving Object Localization with Fitness NMS and Bounded IOU Loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6877–6885. [Google Scholar]
- Cheng, D.; Gong, Y.; Zhou, S.; Wang, J.; Zheng, N. Person Re-identification by Multi-Channel Parts-Based CNN with Improved Triplet Loss Function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 1335–1344. [Google Scholar] [CrossRef]
- Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Han, Z. Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking. IEEE Trans. Multimed. 2023, 25, 486–500. [Google Scholar] [CrossRef]
- Liu, Q.; Li, X.; He, Z.; Li, C.; Li, J.; Zhou, Z.; Yuan, D.; Li, J.; Yang, K.; Fan, N.; et al. LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3847–3856. [Google Scholar] [CrossRef]
- Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10448–10457. [Google Scholar]
- Zolfaghari, M.; Singh, K.; Brox, T. ECO: Efficient Convolutional Network for Online Video Understanding. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 713–730. [Google Scholar] [CrossRef] [Green Version]
- Danelljan, M.; Häger, G.; Khan, F.; Felsberg, M. Discriminative Scale Space Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1561–1575. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Nam, H.; Han, B. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 4293–4302. [Google Scholar]
- Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P. Fast Online Object Tracking and Segmentation: A Unifying Approach. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 1328–1338. [Google Scholar]
- Liu, Q.; Li, X.; He, Z.; Fan, N.; Yuan, D.; Liu, W.; Liang, Y. Multi-task Driven Feature Models for Thermal Infrared Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11604–11611. [Google Scholar]
Method | Precision ↑ | Success ↑ | Accuracy ↑ | FPS ↑ |
---|---|---|---|---|
DSST | 0.490 | 0.349 | 0.354 | 31.2 |
SiamFC | 0.510 | 0.369 | 0.375 | 60.2 |
ECO | 0.618 | 0.437 | 0.444 | 7.5 |
ATOM | 0.711 | 0.484 | 0.490 | 28.7 |
SiamRPN++LT | 0.756 | 0.501 | 0.509 | 26.3 |
STARK | 0.843 | 0.588 | 0.607 | 33.5 |
GlobalTrack | 0.889 | 0.639 | 0.642 | 9.2 |
Siam-RCNN | 0.900 | 0.653 | 0.658 | 6.1 |
Ours | 0.912 | 0.666 | 0.677 | 12.4 |
Num. | Main Modules | Evaluation Metrics | |||||
---|---|---|---|---|---|---|---|
QG-RPN | STG-RPN | QG-RCNN | ID-RCNN | Precision ↑ | Success ↑ | Accuracy ↑ | |
1 | √ | 0.838 | 0.586 | 0.594 | |||
2 | √ | 0.865 | 0.623 | 0.631 | |||
3 | √ | √ | 0.889 | 0.635 | 0.642 | ||
4 | √ | √ | 0.887 | 0.643 | 0.653 | ||
5 | √ | √ | 0.903 | 0.654 | 0.663 | ||
6 | √ | √ | 0.916 | 0.666 | 0.677 |
Spatial Location Prediction | Temporal Location Prediction | Precision ↑ | Success ↑ | Accuracy ↑ |
---|---|---|---|---|
√ | 0.890 | 0.645 | 0.658 | |
√ | 0.723 | 0.484 | 0.491 | |
√ | √ | 0.916 | 0.666 | 0.677 |
Loss Function | Precision ↑ | Success ↑ | Accuracy ↑ |
---|---|---|---|
No metric loss | 0.903 | 0.654 | 0.663 |
Naive triplet loss | 0.907 | 0.658 | 0.667 |
Hard triplet loss | 0.913 | 0.662 | 0.674 |
Constrained triplet loss | 0.916 | 0.666 | 0.677 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xie, X.; Xi, J.; Yang, X.; Lu, R.; Xia, W. STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking. Drones 2023, 7, 296. https://doi.org/10.3390/drones7050296
Xie X, Xi J, Yang X, Lu R, Xia W. STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking. Drones. 2023; 7(5):296. https://doi.org/10.3390/drones7050296
Chicago/Turabian StyleXie, Xueli, Jianxiang Xi, Xiaogang Yang, Ruitao Lu, and Wenxin Xia. 2023. "STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking" Drones 7, no. 5: 296. https://doi.org/10.3390/drones7050296
APA StyleXie, X., Xi, J., Yang, X., Lu, R., & Xia, W. (2023). STFTrack: Spatio-Temporal-Focused Siamese Network for Infrared UAV Tracking. Drones, 7(5), 296. https://doi.org/10.3390/drones7050296