A Multiscale Parallel Pedestrian Recognition Algorithm Based on YOLOv5
Abstract
:1. Introduction
- Fusing MSP modules, which, based on residual networks and multiscale parallel concepts in the neck regions, enhances the model’s feature extraction, computational efficiency, and accuracy.
- Fusing Swin Transformer modules in the backbone and neck regions enhances the model’s feature extraction, computational efficiency, and accuracy.
- The CBAM and C3 modules are integrated into the CBAMC3 module to replace the C3 module in the backbone to improve the recognition accuracy of the model.
- The WMD-IOU loss function is introduced, designed to address the loss effects due to shape dimensions and the equalizing impacts of similar weights.
2. Materials and Methods
2.1. Network Architecture
2.2. Method
2.2.1. MSP Module
2.2.2. Swin Transformer
2.2.3. CBAMC3
2.2.4. WMD-IOU Loss Function
3. Experiment
3.1. Experiment Environment
3.2. Dataset Details
3.3. Evaluation Metrics
3.4. Experimental Results
3.5. Low-Brightness Experiments
3.6. Ablation Experiments
3.7. Hyperparameter Tuning
3.8. Crossover Experiments
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human recognition. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
- Dollár, P.; Tu, Z.; Perona, P.; Belongie, S. Integral channel features. In Proceedings of the British Machine Vision Conference, BMVC 2009, London, UK, 7–10 September 2009. [Google Scholar]
- Dollár, P.; Appel, R.; Belongie, S.; Perona, P. Fast feature pyramids for object recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1532–1545. [Google Scholar] [CrossRef] [PubMed]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object recognition with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object recognition. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Ultralytics. Yolov5. Available online: https://github.com/ultralytics/yolov5 (accessed on 18 October 2022).
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. Scaled-yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13029–13038. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox recognizeor. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot recognizeor. arXiv 2017, arXiv:1701.06659. [Google Scholar]
- Shen, Z.; Liu, Z.; Li, J.; Jiang, Y.G.; Chen, Y.; Xue, X. Dsod: Learning deeply supervised object recognizeors from scratch. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1919–1927. [Google Scholar]
- Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object recognition. arXiv 2017, arXiv:1705.09587. [Google Scholar]
- Li, Z.; Zhou, F. FSSD: Feature fusion single shot multibox recognizeor. arXiv 2017, arXiv:1712.00960. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Xie, H.; Xiao, Z.; Liu, W.; Ye, Z. PVNet: A Used Vehicle Pedestrian recognition Tracking and Counting Method. Sustainability 2023, 15, 14326. [Google Scholar] [CrossRef]
- Lan, W.; Dang, J.; Wang, Y.; Wang, S. Pedestrian recognition based on YOLO network model. In Proceedings of the 2018 IEEE International Conference on Mechatronics and Automation (ICMA), Changchun, China, 5–8 August 2018; pp. 1547–1551. [Google Scholar]
- Yang, X.; Wang, Y.; Laganiere, R. A scale-aware YOLO model for pedestrian recognition. In Advances in Visual Computing: 15th International Symposium, ISVC 2020, San Diego, CA, USA, 5–7 October 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 15–26. [Google Scholar]
- Krišto, M.; Ivasic-Kos, M.; Pobar, M. Thermal object recognition in difficult weather conditions using YOLO. IEEE Access 2020, 8, 125459–125476. [Google Scholar] [CrossRef]
- Xue, Y.; Ju, Z.; Li, Y.; Zhang, W. MAF-YOLO: Multi-modal attention fusion based YOLO for pedestrian recognition. Infrared Phys. Technol. 2021, 118, 103906. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable Detr: Deformable transformers for end-to-end object recognition. arXiv 2020, arXiv:2010.04159. [Google Scholar]
- Lin, M.; Li, C.; Bu, X.; Sun, M.; Lin, C.; Yan, J.; Ouyang, W.; Deng, Z. Detr for crowd pedestrian recognition. arXiv 2020, arXiv:2012.06785. [Google Scholar]
- Pu, Y.; Liang, W.; Hao, Y.; Yuan, Y.; Yang, Y.; Zhang, C.; Hu, H.; Huang, G. Rank-Detr for high quality object recognition. Adv. Neural Inf. Process. Syst. 2024, 36, 1–14. [Google Scholar]
- Srinivasan, A.; Srikanth, A.; Indrajit, H.; Narasimhan, V. A novel approach for road accident recognition using Detr algorithm. In Proceedings of the 2020 International Conference on Intelligent Data Science Technologies and Applications (IDSTA), Valencia, Spain, 19–22 October 2020; pp. 75–80. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. Cspnet: A new backbone that can enhance learning capabilityof CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
- Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
- Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
- Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3588–3597. [Google Scholar]
- Hu, H.; Zhang, Z.; Xie, Z.; Lin, S. Local relation networks for image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3464–3473. [Google Scholar]
- Bao, H.; Dong, L.; Wei, F.; Wang, W.; Yang, N.; Liu, X.; Wang, Y.; Gao, J.; Piao, S.; Zhou, M.; et al. Unilmv2: Pseudo-masked language models for unified language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 642–652. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
- Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. INRIA Person Dataset. 2020. Available online: https://paperswithcode.com/dataset/inria-person (accessed on 20 April 2024).
- Foszner, P.; Szczęsna, A.; Ciampi, L.; Messina, N.; Cygan, A.; Bizoń, B.; Cogiel, M.; Golba, D.; Macioszek, E.; Staniszewski, M. CrowdSim2: An open synthetic benchmark for object recognizeors. arXiv 2023, arXiv:2304.05090. [Google Scholar]
- KAIST Multispectral Pedestrian Detection Benchmark. 2015. Available online: https://paperswithcode.com/dataset/kaist-multispectral-pedestrian-detection (accessed on 20 April 2024).
Network | Precision (%) | Recall (%) | mAP@IoU = 0.5 (%) | F1 Score | Inference Speed per Image (ms) | FPS |
---|---|---|---|---|---|---|
YOLOv5s | 0.875 | 0.930 | 0.921 | 0.90 | 28.1 | 35 |
YOLOv7-tiny | 0.918 | 0.875 | 0.937 | 0.89 | 27.8 | 36 |
Detr | 0.921 | 0.883 | 0.939 | 0.90 | 26.9 | 37 |
YOLOv8s | 0.925 | 0.909 | 0.952 | 0.92 | 26.5 | 37 |
YOLO-MSP | 0.936 | 0.889 | 0.960 | 0.91 | 25.3 | 39 |
YOLOv5s | YOLOv7-tiny | Detr | YOLOV8S | YOLO-MSP | ||||||
---|---|---|---|---|---|---|---|---|---|---|
mAP | Time (ms) | mAP | Time (ms) | mAP | Time (ms) | mAP | Time (ms) | mAP | Time (ms) | |
Figure 1 | 0.902 | 31.3 | 0.915 | 28.7 | 0.913 | 27.2 | 0.937 | 27.0 | 0.909 | 26.3 |
Figure 2 | 0.911 | 30.8 | 0.929 | 31.2 | 0.928 | 26.3 | 0.955 | 25.8 | 0.974 | 25.1 |
Figure 3 | 0.919 | 31.6 | 0.924 | 30.5 | 0.923 | 26.9 | 0.949 | 26.4 | 0.963 | 25.3 |
Figure 4 | 0.807 | 32.5 | 0.811 | 31.8 | 0.820 | 28.7 | 0.820 | 27.8 | 0.730 | 27.0 |
Model | Swin Transformer | MSP | CBAMC3 | WMD-IOU | mAP(%) | F1 |
---|---|---|---|---|---|---|
Model0 | 0.919 | 0.88 | ||||
Model1 | ✓ | 0.928 | 0.89 | |||
Model2 | ✓ | ✓ | 0.935 | 0.89 | ||
Model3 | ✓ | ✓ | ✓ | 0.947 | 0.90 | |
YOLO-MSP | ✓ | ✓ | ✓ | ✓ | 0.960 | 0.91 |
Methods | Image Size | Batch Size | Optimizer | mAP (%) |
---|---|---|---|---|
YOLO-MSP | 1280 | 16 | SGD | 0.960 |
YOLO-MSP | 1280 | 8 | SGD | 0.943 |
YOLO-MSP | 1280 | 32 | SGD | 0.939 |
YOLO-MSP | 1024 | 16 | SGD | 0.921 |
YOLO-MSP | 640 | 16 | SGD | 0.926 |
Model | Precision (%) | Recall (%) | mAP@IoU = 0.5 (%) | F1-Score | Inference Speed per Image (ms) | FPS |
---|---|---|---|---|---|---|
Model1 | 0.931 | 0.871 | 0.958 | 0.90 | 25.3 | 39 |
Model2 | 0.939 | 0.882 | 0.959 | 0.91 | 25.3 | 39 |
Model3 | 0.936 | 0.867 | 0.959 | 0.90 | 25.2 | 39 |
Model4 | 0.931 | 0.871 | 0.947 | 0.90 | 25.1 | 39 |
Model5 | 0.936 | 0.889 | 0.960 | 0.91 | 25.3 | 39 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Song, Q.; Zhou, Z.; Ji, S.; Cui, T.; Yao, B.; Liu, Z. A Multiscale Parallel Pedestrian Recognition Algorithm Based on YOLOv5. Electronics 2024, 13, 1989. https://doi.org/10.3390/electronics13101989
Song Q, Zhou Z, Ji S, Cui T, Yao B, Liu Z. A Multiscale Parallel Pedestrian Recognition Algorithm Based on YOLOv5. Electronics. 2024; 13(10):1989. https://doi.org/10.3390/electronics13101989
Chicago/Turabian StyleSong, Qi, ZongHe Zhou, ShuDe Ji, Tong Cui, BuDan Yao, and ZeQi Liu. 2024. "A Multiscale Parallel Pedestrian Recognition Algorithm Based on YOLOv5" Electronics 13, no. 10: 1989. https://doi.org/10.3390/electronics13101989
APA StyleSong, Q., Zhou, Z., Ji, S., Cui, T., Yao, B., & Liu, Z. (2024). A Multiscale Parallel Pedestrian Recognition Algorithm Based on YOLOv5. Electronics, 13(10), 1989. https://doi.org/10.3390/electronics13101989