SF-YOLOv5: A Lightweight Small Object Detection Algorithm Based on Improved Feature Fusion Mode
Abstract
:1. Introduction
- In the default network structure of YOLOv5, the network layer that originally used to generate the large object detection feature map has been reasonably clipped. While realizing the lightweight model, the computing resources required by the model are effectively released and the speed of the model is greatly improved.
- The (SPP [12]) layer at the end of the backbone network is introduced into the feature fusion network and connected with multiple prediction heads to improve the feature expression ability of the final output feature map and further enhance the ability of the algorithm.
2. Relevant Work
2.1. Some Classical Algorithms
2.2. Basic Idea of YOLOv5
2.3. Network Structure of YOLOv5
2.3.1. Backbone
2.3.2. Neck
3. Approach
- YOLOv5’s feature extraction network set three different sizes of feature map output for detection scenes of various scales. The process of obtaining the feature map requires multiple convolution down sampling operations, which takes up a lot of computing resources and parameters; however, in the actual detection for small targets, is the down sampling operation and feature fusion process for high-level feature map necessary?
- The feature fusion network of YOLOv5 combined a top-down path (FPN) with a bottom-up path (PAN), which enhances the detection performance on various scale objects; however, for dense small object scenes, can we set a new feature fusion path to improve the feature expression potential of the output feature map? Can we fuse features horizontally to further enhance the detection performance of the algorithm?
- YOLOv5 adds an SPPF module at the end of the backbone, which improves the performance of the feature extraction network through multiple convolution and pooling operations with different sizes. Can we use this idea to further tap the feature expression potential of the feature map in the end of the neck?
3.1. Feature Map Clipping
3.2. Improvement for Feature Fusion Path (PB-FPN)
3.3. The Improved Feature Fusion Network (SPPF)
4. Experiments
4.1. Experimental Environment
4.2. Dataset Introduction
4.3. Evaluation Criterion
4.4. Analysis of Experimental Results
4.5. Ablation Experiment
4.6. Comparison with Other Classical Algorithms
4.7. Performance of Novel SF-YOLOv5 on Other Datasets
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Kowalski, M.; Grudzień, A.; Mierzejewski, K. Thermal–Visible Face Recognition Based on CNN Features and Triple Triplet Configuration for On-the-Move Identity Verification. Sensors 2022, 22, 5012. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z.; Shen, W.; Qiao, S.; Wang, Y.; Wang, B.; Yuille, A. Robust face detection via learning small faces on hard images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1361–1370. [Google Scholar]
- Bai, T.; Gao, J.; Yang, J.; Yao, D. A study on railway surface defects detection based on machine vision. Entropy 2021, 23, 1437. [Google Scholar] [CrossRef] [PubMed]
- Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
- Zhang, P.; Zhong, Y.; Li, X. SlimYOLOv3: Narrower, faster and better for real-time UAV applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
- Gu, Y.; Si, B. A novel lightweight real-time traffic sign detection integration framework based on YOLOv4. Entropy 2022, 24, 487. [Google Scholar] [CrossRef] [PubMed]
- Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 5187–5196. [Google Scholar]
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Glenn, J. Yolov5-6.1—TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference. 2022. Available online: https://github.com/ultralytics/yolov5/releases/tag/v6.1 (accessed on 22 February 2022).
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European conference on computer vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
- Singh, B.; Davis, L.S. An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3578–3587. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 7036–7045. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
- Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
- Zhou, T.; Zhao, Y.; Wu, J. Resnext and res2net structures for speaker verification. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 301–307. [Google Scholar]
- Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar]
- Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 390–391. [Google Scholar]
- Yang, S.; Luo, P.; Loy, C.C.; Tang, X. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5525–5533. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
- Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1257–1265. [Google Scholar]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021. [Google Scholar] [CrossRef] [PubMed]
- Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Available online: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html (accessed on 30 July 2022).
- Tong, K.; Wu, Y.; Zhou, F. Recent advances in small object detection based on deep learning: A review. Image Vis. Comput. 2020, 97, 103910. [Google Scholar] [CrossRef]
Model | Size (Pixels) | mAP @0.5:0.95 | mAP @0.5 | Time CPU b1 (ms) | Time V100 b1 (ms) | Time V100 b32 (ms) | Params (M) | FLOPS @640 (B) |
---|---|---|---|---|---|---|---|---|
YOLOv5n | 640 | 28.0 | 45.7 | 45 | 6.6 | 0.6 | 1.9 | 4.5 |
YOLOv5s | 640 | 37.4 | 56.8 | 98 | 6.4 | 0.9 | 7.2 | 16.5 |
YOLOv5m | 640 | 45.4 | 64.1 | 224 | 8.2 | 1.7 | 21.2 | 49.0 |
YOLOv5l | 640 | 49.0 | 67.3 | 430 | 10.1 | 2.7 | 46.5 | 109.1 |
YOLOv5x | 640 | 50.7 | 68.9 | 766 | 12.1 | 4.8 | 86.7 | 205.7 |
C | From | n | Params | Module | Arguments |
---|---|---|---|---|---|
0 | −1 | 1 | 3520 | CBS | [3, 32, 6, 2, 2] |
1 | −1 | 1 | 18,560 | CBS | [32, 64, 3, 2] |
2 | −1 | 1 | 18,816 | C3 | [64, 64, 1] |
3 | −1 | 1 | 73,984 | CBS | [64, 128, 3, 2] |
4 | −1 | 2 | 115,712 | C3 | [128, 128, 2] |
5 | −1 | 1 | 295,424 | CBS | [128, 256, 3, 2] |
6 | −1 | 3 | 625,152 | C3 | [256, 256, 3] |
7 | −1 | 1 | 1,180,672 | CBS | [256, 512, 3, 2] |
8 | −1 | 1 | 1,182,720 | C3 | [512, 512, 1] |
9 | −1 | 1 | 656,896 | SPPF | [512, 512, 5] |
Methods | Size | [email protected] | [email protected]:0.95 | Parameters (M) | FLOPs (G) | Inference Time (ms) |
---|---|---|---|---|---|---|
YOLOv5s | 640 | 69.7 | 35.5 | 7.01 | 15.8 | 13.1 |
YOLOv5-N5 | 640 | 69.9 | 35.3 | 1.88 | 11.2 | 10.4 |
YOLOv5-PB | 640 | 70.8 | 36.0 | 2.03 | 12.7 | 11.3 |
SF-YOLOv5 | 640 | 71.3 | 36.3 | 2.23 | 13.8 | 12.2 |
improvement | - | +1.6 | +0.8 | −68.2% | −12.7% | −6.9% |
YOLOv5n | 640 | 63.9 | 30.8 | 1.76 | 4.20 | 8.60 |
SF-YOLOv5 | 640 | 71.3 | 36.3 | 2.23 | 13.8 | 12.2 |
improvement | - | +7.4 | +5.5 | +26.7% | +228.6% | +41.9% |
YOLOv7-tiny | 640 | 68.4 | 33.0 | 6.01 | 13.0 | 13.9 |
SF-YOLOv5 | 640 | 71.3 | 36.3 | 2.23 | 13.8 | 12.2 |
improvement | - | +2.9 | +3.3 | −62.9% | +6.2% | −12.2% |
YOLOv3 | 640 | 74.5 | 39.5 | 61.5 | 154.7 | 44.7 |
SF-YOLOv5L | 640 | 75.1 | 39.7 | 15.5 | 91.2 | 49.4 |
improvement | - | +0.6 | +0.2 | −74.8% | −41.0% | +10.5% |
YOLOv7 | 640 | 76.1 | 39.5 | 36.5 | 103.2 | 18.0 |
SF-YOLOv5L | 640 | 75.1 | 39.7 | 15.5 | 91.2 | 49.4 |
improvement | - | −1.0 | +0.2 | −57.5% | −11.6% | +174.4% |
ResNeXt-CSP | 640 | 73.7 | 37.6 | 31.8 | 58.9 | 32.6 |
SF-YOLOv5L | 640 | 75.1 | 39.7 | 15.5 | 91.2 | 49.4 |
improvement | - | +1.4 | +2.1 | −51.3% | +54.8% | +51.5% |
Dataset | Methods | Size | [email protected] | [email protected]:0.95 | Parameters (M) | FLOPs (G) | Inference Time (ms) |
---|---|---|---|---|---|---|---|
TinyPerson | YOLOv5s | 640 | 18.7 | 6.0 | 7.02 | 15.8 | 12.8 |
SF-YOLOv5 | 640 | 20.0 | 6.5 | 2.23 | 13.8 | 11.0 | |
improvement | - | +1.3 | +0.5 | −68.2% | −12.7% | −14.1% | |
VisDrone | YOLOv5s | 640 | 33.0 | 17.9 | 7.04 | 15.9 | 12.6 |
SF-YOLOv5 | 640 | 34.3 | 18.2 | 2.24 | 13.8 | 11.5 | |
improvement | - | +1.3 | +0.3 | −68.2% | −13.2% | −8.7% | |
VOC2012 | YOLOv5s | 640 | 60.8 | 37.0 | 7.06 | 15.9 | 25.8 |
SF-YOLOv5-P5 | 640 | 61.2 | 38.3 | 4.59 | 15.7 | 24.8 | |
improvement | - | +0.4 | +1.3 | −34.9% | −1.3% | −3.9% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, H.; Sun, F.; Gu, J.; Deng, L. SF-YOLOv5: A Lightweight Small Object Detection Algorithm Based on Improved Feature Fusion Mode. Sensors 2022, 22, 5817. https://doi.org/10.3390/s22155817
Liu H, Sun F, Gu J, Deng L. SF-YOLOv5: A Lightweight Small Object Detection Algorithm Based on Improved Feature Fusion Mode. Sensors. 2022; 22(15):5817. https://doi.org/10.3390/s22155817
Chicago/Turabian StyleLiu, Haiying, Fengqian Sun, Jason Gu, and Lixia Deng. 2022. "SF-YOLOv5: A Lightweight Small Object Detection Algorithm Based on Improved Feature Fusion Mode" Sensors 22, no. 15: 5817. https://doi.org/10.3390/s22155817
APA StyleLiu, H., Sun, F., Gu, J., & Deng, L. (2022). SF-YOLOv5: A Lightweight Small Object Detection Algorithm Based on Improved Feature Fusion Mode. Sensors, 22(15), 5817. https://doi.org/10.3390/s22155817