Toward Versatile Small Object Detection with Temporal-YOLOv8
Abstract
:1. Introduction
2. Related Work
2.1. Object Detection and YOLOv8
2.2. Temporal Information and Deep Learning
2.3. Temporal Features in YOLO for Small Object Detection
2.4. Challenges in Small Object Detection
2.5. Datasets for Small Object Detection
3. Methods
3.1. Temporal-YOLOv8
3.1.1. Variants
3.2. Balanced Mosaicking
3.3. Bounding Box Augmentations
3.4. Metrics
4. Experiments and Results
4.1. Datasets
Enhanced Annotation
4.2. Ablation Study
4.3. Dataset Impact Study
4.3.1. Specificity vs. Generalization
4.3.2. Public Datasets
4.4. Detection Characteristics
5. Discussion
5.1. Ablation Study
Dataset Impact Study
5.2. Future Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Choromanski, K.; Ding, T.; Driess, D.; Dubey, A.; Finn, C.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv 2023, arXiv:2307.15818. [Google Scholar]
- van Rooijen, A.; Bouma, H.; Baan, J.; van Leeuwen, M. Rapid person re-identification retraining strategy for flexible deployment in new environments. In Proceedings of the Counterterrorism, Crime Fighting, Forensics, and Surveillance Technologies VI, International Society for Optics and Photonics, Berlin, Germany, 5–8 September 2022; Volume 12275, p. 122750D. [Google Scholar]
- Eker, T.A.; Heslinga, F.G.; Ballan, L.; den Hollander, R.J.; Schutte, K. The effect of simulation variety on a deep learning-based military vehicle detector. In Proceedings of the Artificial Intelligence for Security and Defence Applications, Amsterdam, The Netherlands, 3–7 September 2023; Volume 12742, pp. 183–196. [Google Scholar]
- Heslinga, F.G.; Ruis, F.; Ballan, L.; van Leeuwen, M.C.; Masini, B.; van Woerden, J.E.; den Hollander, R.J.M.; Berndsen, M.; Baan, J.; Dijk, J.; et al. Leveraging temporal context in deep learning methodology for small object detection. In Proceedings of the Artificial Intelligence for Security and Defence Applications, Amsterdam, The Netherlands, 3–7 September 2023; Volume 12742. [Google Scholar]
- Lecun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Heslinga, F.G.; Uysal, F.; van Rooij, S.B.; Berberich, S.; Caro Cuenca, M. Few-shot learning for satellite characterisation from synthetic inverse synthetic aperture radar images. IET Radar Sonar Navig. 2024, 18, 649–656. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar]
- Heslinga, F.G.; Eker, T.A.; Fokkinga, E.P.; van Woerden, J.E.; Ruis, F.; den Hollander, R.J.M.; Schutte, K. Combining simulated data, foundation models, and few real samples for training fine-grained object detectors. In Proceedings of the Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications II, National Harbor, MD, USA, 22–25 April 2024; Volume in press. [Google Scholar]
- Mirzaei, B.; Nezamabadi-pour, H.; Raoof, A.; Derakhshani, R. Small Object Detection and Tracking: A Comprehensive Review. Sensors 2023, 23, 6887. [Google Scholar] [CrossRef]
- Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Qiu, J. YOLO-v8 by Ultralytics; Software. Available online: https://ultralytics.com (accessed on 30 October 2024).
- Corsel, C.W.; van Lier, M.; Kampmeijer, L.; Boehrer, N.; Bakker, E.M. Exploiting Temporal Context for Tiny Object Detection. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–7 January 2023; pp. 1–11. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
- Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded Language-Image Pre-training. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J.; et al. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
- Bouwmans, T. Traditional and recent approaches in background modeling for foreground detection: An overview. Comput. Sci. Rev. 2014, 11–12, 31–66. [Google Scholar] [CrossRef]
- Benezeth, Y.; Jodoin, P.M.; Emile, B.; Laurent, H.; Rosenberger, C. Comparative study of background subtraction algorithms. J. Electron. Imaging 2010, 19, 033003. [Google Scholar]
- Xiao, J.; Cheng, H.; Sawhney, H.; Han, F. Vehicle detection and tracking in wide field-of-view aerial video. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 679–684. [Google Scholar]
- Elgammal, A.; Harwood, D.; Davis, L. Non-parametric Model for Background Subtraction. In Proceedings of the Computer Vision—ECCV 2000, Glasgow, UK, 23–28 August 2000; pp. 751–767. [Google Scholar]
- Fischer, N.M.; Kruithof, M.C.; Bouma, H. Optimizing a neural network for detection of moving vehicles in video. In Proceedings of the Counterterrorism, Crime Fighting, Forensics, and Surveillance, Warsaw, Poland, 11–14 September 2017; Volume 10441. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 2015, 115, 211–252. [Google Scholar] [CrossRef]
- Yu, J.; Ju, Z.; Gao, H.; Zhou, D. A One-stage Temporal Detector with Attentional LSTM for Video Object Detection. In Proceedings of the 2021 27th International Conference on Mechatronics and Machine Vision in Practice (M2VIP), Shanghai, China, 26–28 November 2021; pp. 464–468. [Google Scholar] [CrossRef]
- Zhu, X.; Wang, Y.; Dai, J.; Yuan, L.; Wei, Y. Flow-Guided Feature Aggregation for Video Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 408–417. [Google Scholar]
- Bosquet, B.; Mucientes, M.; Brea, V.M. STDnet-ST: Spatio-temporal ConvNet for small object detection. Pattern Recognit. 2021, 116, 107929. [Google Scholar] [CrossRef]
- Hajizadeh, M.; Sabokrou, M.; Rahmani, A. STARNet: Spatio-temporal aware recurrent network for efficient video object detection on embedded devices. Mach. Vis. Appl. 2024, 35, 1–11. [Google Scholar] [CrossRef]
- He, L.; Zhou, Q.; Li, X.; Niu, L.; Cheng, G.; Li, X.; Liu, W.; Tong, Y.; Ma, L.; Zhang, L. End-to-End Video Object Detection with Spatial-Temporal Transformers. In Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 20–24 October 2021. [Google Scholar]
- Zhou, Q.; Li, X.; He, L.; Yang, Y.; Cheng, G.; Tong, Y.; Ma, L.; Tao, D. TransVOD: End-to-end video object detection with spatial-temporal transformers. IEee Trans. Pattern Anal. Mach. Intell. 2022, 45, 7853–7869. [Google Scholar] [CrossRef] [PubMed]
- Luesutthiviboon, S.; de Croon, G.C.H.E.; Altena, A.V.N.; Snellen, M.; Voskuijl, M. Bio-inspired enhancement for optical detection of drones using convolutional neural networks. In Proceedings of the Artificial Intelligence for Security and Defence Applications, Amsterdam, The Netherlands, 3–7 September 2023; Volume 12742, p. 127420F. [Google Scholar]
- Alqaysi, H.; Fedorov, I.; Qureshi, F.Z.; O’Nils, M. A Temporal Boosted YOLO-Based Model for Birds Detection around Wind Farms. J. Imaging 2021, 7, 227. [Google Scholar] [CrossRef] [PubMed]
- Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. arXiv 2019, arXiv:1904.08189. [Google Scholar]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
- Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking Objects as Points. arXiv 2020, arXiv:2004.01177. [Google Scholar]
- Poplavskiy, D. The Winning Solution for the Airborne Object Tracking Challenge. Available online: https://gitlab.aicrowd.com/dmytro_poplavskiy/airborne-detection-starter-kit/-/blob/master/docs/Airborne%20Object%20Tracking%20Challenge%20Solution.pdf (accessed on 5 July 2024).
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
- Ji, S.J.; Ling, Q.H.; Han, F. An improved algorithm for small object detection based on YOLO v4 and multi-scale contextual information. Comput. Electr. Eng. 2023, 105, 108490. [Google Scholar] [CrossRef]
- Wu, K.; Chen, Y.; Lu, Y.; Yang, Z.; Yuan, J.; Zheng, E. SOD-YOLO: A High-Precision Detection of Small Targets on High-Voltage Transmission Lines. Electronics 2024, 13, 1371. [Google Scholar] [CrossRef]
- Li, C.; Li, Y.; Chen, X.; Zhang, Y. Concerning Imbalance and Bounding Box Loss to Detect Small Targets in Remote Sensing. IEEE Sens. J. 2024, 24, 27631–27639. [Google Scholar] [CrossRef]
- Zhang, P.; Liu, Y. A small target detection algorithm based on improved YOLOv5 in aerial image. PeerJ Comput. Sci. 2024, 10, e2007. [Google Scholar] [CrossRef] [PubMed]
- Mueller, M.; Smith, N.; Ghanem, B. A benchmark and simulator for UAV tracking. In Computer Vision—ECCV 2016, Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 445–461. [Google Scholar]
- Oh, S.; Hoogs, A.; Perera, A.; Cuntoor, N.; Chen, C.C.; Lee, J.T.; Mukherjee, S.; Aggarwal, J.; Lee, H.; Davis, L.; et al. A Large-scale Benchmark Dataset for Event Recognition in Surveillance Video. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, USA, 20–25 June 2011. [Google Scholar]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
- Liu, C.; Ding, W.; Yang, J.; Murino, V.; Zhang, B.; Han, J.; Guo, G. Aggregation signature for small object tracking. IEEE Trans. Image Process. 2019, 29, 1738–1747. [Google Scholar] [CrossRef]
- Airborne Object Tracking Challenge. 2021. Available online: https://www.aicrowd.com/challenges/airborne-object-tracking-challenge#dataset (accessed on 5 July 2024).
- Yin, Q.; Hu, Q.; Liu, H.; Zhang, F.; Wang, Y.; Lin, Z.; An, W.; Guo, Y. Detecting and tracking small and dense moving objects in satellite videos: A benchmark. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5612518. [Google Scholar] [CrossRef]
- Chen, J.; Wu, Q.; Liu, D.; Xu, T. Foreground-Background Imbalance Problem in Deep Object Detectors: A Review. arXiv 2020, arXiv:2006.09238. [Google Scholar]
- Leler, W.J. Human vision, anti-aliasing, and the cheap 4000 line display. ACM SIGGRAPH Comput. Graph. 1980, 14, 308–313. [Google Scholar] [CrossRef]
- Padilla, R.; Netto, S.; da Silva, E. A Survey on Performance Metrics for Object-Detection Algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Scaling Factor [Range] | Kernel Size [Pixels] |
---|---|
0–0.15 | 13 |
0.15–0.25 | 11 |
0.25–0.50 | 7 |
0.50–0.90 | 3 |
0.9–1.0 | no blur |
Environments | Forest, Desert, Plains, Urban, Port, Sea |
Targets | Persons, Ships, Civilian and Military Vehicles, Birds, Dogs |
Cameras | Visual and Infrared |
Conditions | Strong Winds, Calm Winds, Rain, Sunny, Overcast |
Viewpoints | Ground, Tower, Drone (20 m) |
Model Variations | |
---|---|
Experiment | Description |
Proposed | Model trained using our suggested, optimally performing hyperparameters. |
Manyframe-YOLO | The second model variant described in Section 3.1.1, with 11 input channels for the luminance data of 11 video frames. |
No-BBox-Clip | Model trained without the bounding box augmentation, which clips the bounding boxes to at least 15 × 15, as described in Section 3.3. |
Color-T-YOLO | The first model variant described in Section 3.1.1, with 9 input channels for 3 color channels of 3 video frames. |
Singleframe-YOLO | Model trained using regular settings with 3 channels for one RGB frame as input but with balanced mosaicking and bounding box augmentations. |
Default-YOLO | Model trained using one RGB frame as input without the data augmentation techniques presented in this work. |
Dataloader variations | |
Experiment | Description |
Proposed | Model trained using the dataloader based on balanced mosaicking, as described in Section 3.2. |
Crop-Mosaic | Model trained using regular mosaicking, with cropping. |
No-Mosaic | Model trained using only full-resolution frames. |
Experiment | mAP [std] |
---|---|
Proposed | 0.839 [0.001] |
Manyframe-YOLO | 0.781 [0.001] |
Color-T-YOLO | 0.743 [0.001] |
No-BBox-Clip | 0.673 [0.003] |
Singleframe-YOLO | 0.583 [0.001] |
Default-YOLO | 0.465 [0.002] |
Crop-Mosaic | 0.837 [0.002] |
No-Mosaic | 0.770 [0.001] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
van Leeuwen, M.C.; Fokkinga, E.P.; Huizinga, W.; Baan, J.; Heslinga, F.G. Toward Versatile Small Object Detection with Temporal-YOLOv8. Sensors 2024, 24, 7387. https://doi.org/10.3390/s24227387
van Leeuwen MC, Fokkinga EP, Huizinga W, Baan J, Heslinga FG. Toward Versatile Small Object Detection with Temporal-YOLOv8. Sensors. 2024; 24(22):7387. https://doi.org/10.3390/s24227387
Chicago/Turabian Stylevan Leeuwen, Martin C., Ella P. Fokkinga, Wyke Huizinga, Jan Baan, and Friso G. Heslinga. 2024. "Toward Versatile Small Object Detection with Temporal-YOLOv8" Sensors 24, no. 22: 7387. https://doi.org/10.3390/s24227387
APA Stylevan Leeuwen, M. C., Fokkinga, E. P., Huizinga, W., Baan, J., & Heslinga, F. G. (2024). Toward Versatile Small Object Detection with Temporal-YOLOv8. Sensors, 24(22), 7387. https://doi.org/10.3390/s24227387