CCBA-NMS-YD: A Vehicle Pedestrian Detection and Tracking Method Based on Improved YOLOv7 and DeepSort
Abstract
:1. Introduction
- Traditional target detection algorithms, prior to the rise of deep learning techniques, relied heavily on hand-designed feature extraction methods and classical machine learning classifiers for target detection. The core of these methods lies in extracting features from images that contribute to target recognition, such as edges, textures, colors, and shapes. Sitaula C et al. [2] proposed a novel multi-scale deep visual bag-of-words (MBoDVW) feature extraction method, which extracts and fuses multi-scale deep features by applying different sizes of convolution kernels (one × one, two × two, and three × three) on the feature maps outputted from the fourth pooling layer of the VGG-16 model. Then, L2 normalization and k-means based clustering algorithms are utilized to generate more unique feature representations, which in turn are used to achieve accurate image classification using SVM (Support Vector Machine) classifiers. This method not only enhances the model’s ability to generalize CXR images at different resolutions but also significantly improves its diagnostic accuracy for COVID-19 and other similar diseases (such as pneumonia). Bilal et al. [3] proposed an improved SVM training method using a multi-round bootstrapping process that selects the most relevant negative samples for training instead of using a large number of random samples, thus improving the distinction ability of the model. A nonlinear quantization scheme is also proposed to speed up the evaluation of the kernel SVM while further reducing the false detection rate without increasing the computational cost. The improved model significantly improves the accuracy and efficiency of the pedestrian detector even with real-time processing constraints and limited embedded system capabilities.
- Deep-learning-based target detection algorithms, such as YOLO, Faster R-CNN, SSD, etc., employ convolutional neural networks for end-to-end feature extraction and detection, in which features of interest are learned automatically by the training and differentiable updating of the network layers. This allows deep learning target detection algorithms to be more robust to factors such as the shape, size, and attitude of the target, as well as to use more general and flexible features to improve detection performance. Zhai S et al. [4] designed a feature extraction network called DenseNet-S-32-1, which references the dense connectivity of DenseNet, replacing the original VGG-16 backbone network of the SSD. In addition, unlike most approaches that use pre-trained models, the DF-SSD model is trained from scratch, which avoids limitations on the design of the network structure. With these innovative points, DF-SSD improves the accuracy of target detection and reduces the number of model parameters. Masita K L et al. [5] proposed a pedestrian detection method based on deep learning techniques using R-CNN as a detector combined with AlexNet as a feature extraction model. The innovation of their method is the use of migration learning to fine-tune the AlexNet model to a specific pedestrian detection dataset and the Edge Boxes algorithm to extract key regions in the image to improve the accuracy of the detection.
- (1)
- A modified Soft-NMS algorithm, CCBA attention mechanism, and multi-scale feature network are used to optimize the YOLOv7 algorithm to enhance the feature extraction and perception capability of the network and to solve the problem of the omission and misidentification of pedestrians by vehicles for small targets or within high-density environments.
- (2)
- Introducing the MobileNetV3 lightweight module in the feature extraction network of DeepSort, the performance of the algorithm is further improved while reducing the number of model parameters and network complexity.
- (3)
- Based on the optimized YOLOv7 algorithm with DeepSort, the accurate detection and tracking of vehicles and pedestrians on the road are achieved, providing technical support for the realization of automatic driving.
2. Materials and Methods
2.1. Overview of YOLOv7
2.2. DeepSort
Algorithm 1 DeepSort |
Input: Frame sequence F = {F_1, F_2, ..., F_n} with detected bounding boxes B = {B_1, B_2, ..., B_m} |
Output: Tracked objects T = {T_1, T_2, ..., T_o} with IDs |
1. for frame Є F do 2. Detect objects and obtain bounding boxes B |
3. Extract features X_i = {x_i1, x_i2, ..., x_in} for each B_ij using CNN |
4. Predict the next state S’ for each existing track T_k using Kalman filter 5. end for 6. for each detected bounding box B_ij in B_i do 7. Calculate the cosine distance d_ij between the feature x_ij and each track feature X_k 8. if d_ij ≤ θ_max then 9. Associate B_ij with the track T_k that has the minimum d_ij 10. end if 11. end for 12. for each unassociated B_ij do 13. Create a new track T_new with the bounding box B_ij and feature x_ij 14. end for 15. Apply non-max suppression to remove overlapping tracks 16. Assign new IDs to unmatched tracks 17. return the list of tracks T with their corresponding IDs |
- Detection: Like Sort, first use an object detection algorithm (such as YOLOv7) to detect the target objects in the video frame and obtain the position information and feature vectors of each target object.
- Feature extraction: Use deep learning models to extract feature vectors for each target and store them in a feature vector library.
- Matching: In object detection, we use cosine similarity to measure the similarity between the newly detected bounding box and each previously tracked target and use cascaded matching algorithms for automatic matching to increase robustness.
- Filtering and updating: For target objects for which a match is found, Kalman filtering is used to adjust their state, and the smoothness of their routes is controlled using the Hungarian algorithm. For target objects that are not matched, they are considered as new target objects and added to the trajectory list.
- Clearing: Regularly clear invalid tracks and expired target objects.
3. Our Algorithm
- (1)
- Road Video Image Input: Road video images are first acquired as raw data for subsequent processing. This can be achieved by installing high-resolution cameras at strategic locations along the road or at intersections. The video stream is then transferred to a processing unit whereupon it is converted into a series of individual frames for further analysis.
- (2)
- Vehicle and Pedestrian Object Detection: In this phase, pre-processed frames from road videos are fed into the improved YOLOv7 model. The model combines optimized Soft-NMS, CCBA attention mechanism, and multi-scale feature fusion and is designed to accurately detect and identify vehicles and pedestrians in the scene, even under challenging conditions, such as dim lighting or high-density traffic. The output of this phase consists of a bounding box around the detected objects, as well as their respective category labels (detection categories are categorized as cars, buses, trucks, motorcycles, pedestrians, and bicycles) and confidence scores.
- (3)
- Vehicle and Pedestrian Target Tracking: The detection results are then passed to the lightweight DeepSort model, which recognizes the spatial information provided by the bounding box, taking into account the temporal information across frames, and tracks the movement of vehicles and pedestrians. The model assigns unique identifiers to each detected object and maintains their trajectories over time to provide a comprehensive view of the motion patterns.
3.1. Improvement of YOLOv7 Algorithm
3.1.1. Modification of NMS Algorithm
3.1.2. Improving Attention Mechanisms
3.1.3. Introduction of Multi-Scale Feature Network
3.2. DeepSort Algorithm Improvement
4. Experiment
4.1. Experimental Environment
4.2. Dataset Analysis
4.3. Evaluation Criterion
4.4. Ablation Experiment
4.5. Comparative Experiments and Visualization
4.5.1. Target Detection Algorithm Comparison Experiment
4.5.2. Comparative Experiment of Multi-Target Tracking Algorithms
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, F.; Zhang, R.; You, F. Fast pedestrian detection and dynamic tracking for intelligent vehicles within V2V cooperative environment. IET Image Process. 2017, 11, 833–840. [Google Scholar] [CrossRef]
- Sitaula, C.; Shahi, T.B.; Aryal, S.; Marzbanrad, F. Fusion of multi-scale bag of deep visual words features of chest X-ray images to detect COVID-19 infection. Sci. Rep. 2021, 11, 23914. [Google Scholar] [CrossRef] [PubMed]
- Bilal, M.; Hanif, M.S. Benchmark revision for HOG-SVM pedestrian detector through reinvigorated training and evaluation methodologies. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1277–1287. [Google Scholar] [CrossRef]
- Zhai, S.; Shang, D.; Wang, S.; Dong, S. DF-SSD: An improved SSD object detection algorithm based on DenseNet and feature fusion. IEEE Access 2020, 8, 24344–24357. [Google Scholar] [CrossRef]
- Masita, K.L.; Hasan, A.N.; Paul, S. Pedestrian detection using R-CNN object detector. In Proceedings of the 2018 IEEE Latin American Conference on Computational Intelligence (LA-CCI), Gudalajara, Mexico, 7–9 November 2018; pp. 1–6. [Google Scholar]
- Koonce, B.; Koonce, B. MobileNetV3. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 125–144. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Wong, C.; Zeng, Y.; Wang, Z.; et al. Classification Models, Apple M1, Reproducibility, Clearml and Deci.ai Integrations; Ultralytics/yolov5: v6. 2-yolov5; Zenodo: Geneva, Switzerland, 2022. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao HY, M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
- Mahasin, M.; Dewi, I.A. Comparison of cspdarknet53, cspresnext-50, and efficientnet-b0 backbones on yolo v4 as object detector. Int. J. Eng. Sci. Inf. Technol. 2022, 2, 64–72. [Google Scholar] [CrossRef]
- Wang, C.Y.; Liao HY, M.; Wu, Y.H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
- Zhao, Y.; Shan, Y.; Yuan, J. Wearing Mask Pedestrian Tracking Based on Improved YOLOv7 and DeepSORT. Comput. Eng. Appl. 2023, 59, 221–230. [Google Scholar]
- Jin, L.; Hua, Q.; Guo, B.; Xie, X.; Yan, F.-G.; Wu, B.-T. Multi-target tracking of vehicles based on optimized DeepSort. J. Zhejiang Univ. 2021, 55, 1056–1064. [Google Scholar]
- Zhang, T.; Li, R.; Lang, S.; Zeng, W. Underwater target acoustic image tracking method based on DeepSORT. Huazhong Univ. Sci. Technol. 2023, 51, 44–50. [Google Scholar] [CrossRef]
- Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS: Improving object detection with one line of code. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE Press: New York, NY, USA, 2017; pp. 5562–5570. [Google Scholar]
- He, Y.; Zhang, X.; Savvides, M.; Kitani, K.M. Softer-NMS: Rethinking bounding box regression for accurate object detection. arXiv 2018, arXiv:1809.08545. [Google Scholar]
- Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
- Chen, B.; Dang, Z. Fast PCB defect detection method based on FasterNet backbone network and CBAM attention mechanism integrated with feature fusion module in improved YOLOv7. IEEE Access 2023, 11, 95092–95103. [Google Scholar] [CrossRef]
- Guo, K.; Li, X.; Zhang, M.; Bao, Q.; Yang, M. Real-time vehicle object detection method based on multi-scale feature fusion. IEEE Access 2021, 9, 115126–115134. [Google Scholar] [CrossRef]
- Ma, W.; Wu, Y.; Cen, F.; Wang, G. MDFN: Multi-scale deep feature learning network for object detection. Pattern Recognit. 2020, 100, 107149. [Google Scholar] [CrossRef]
- Multi-scale multi-patch person re-identification with exclusivity regularized softmax. Neurocomputing 2020, 382, 64–70. [CrossRef]
- Qian, S.; Ning, C.; Hu, Y. MobileNetV3 for image classification. In Proceedings of the 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Nanchang, China, 26–28 March 2021; pp. 490–497. [Google Scholar]
- Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
- Padilla, R.; Netto, S.L.; Da Silva EA, B. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Graz, Austria, 9–11 July 2020; pp. 237–242. [Google Scholar]
- Liu, K.; Sun, Q.; Sun, D.; Peng, L.; Yang, M.; Wang, N. Underwater target detection based on improved YOLOv7. J. Mar. Sci. Eng. 2023, 11, 677. [Google Scholar] [CrossRef]
Environment | Detailed Information |
---|---|
CPU | Intel® Core™ I7-8700 @ 3.20 GHz * 12 |
OS | Ubuntu 18. 04. 6 |
Memory | 32 GB |
Integrated environment | Anaconda (Python 3.9.12) |
GPU | Nvidia GeForce GTX 2080 |
Base | Improved NMS | CCBA | Multi-Scale | [email protected] |
---|---|---|---|---|
√ | × | × | × | 83.29 |
√ | √ | × | × | 83.95 |
√ | × | √ | × | 84.34 |
√ | × | × | √ | 83.51 |
√ | √ | √ | × | 85.75 |
√ | √ | × | √ | 85.21 |
√ | × | √ | √ | 86.36 |
√ | √ | √ | √ | 87.06 |
Methods | All | Car | Bus | Truck | Motorcycle | Person | Bicycle |
---|---|---|---|---|---|---|---|
YOLOv5s | 79.31 | 89.48 | 78.01 | 83.16 | 89.08 | 60.7 | 75.47 |
Faster RCNN | 78.6 | 88.23 | 76.62 | 81.26 | 88.33 | 59.65 | 77.54 |
SSD | 80.18 | 87.16 | 80.72 | 85.7 | 87.28 | 61.13 | 79.1 |
YOLOv7 | 83.29 | 92.1 | 81.67 | 86.95 | 90.16 | 69.32 | 79.58 |
YOLOv7-AC | 86.11 | 93.62 | 85.42 | 85.7 | 91.63 | 77.5 | 82.81 |
W-YOLOv7 | 85.19 | 93.72 | 83.5x` | 85.87 | 91.5 | 74.84 | 81.33 |
Ours | 87.06 | 93.55 | 85.38 | 87.43 | 91.94 | 79.8 | 84.26 |
Model | MOTA (%) | MOTP (%) | Model Size (MB) |
---|---|---|---|
SORT | 59.8 | 79.4 | - |
StrongSORT | 65.7 | 80.5 | - |
DeepSORT | 65.9 | 79.3 | 43.8 |
Ours | 67.5 | 81.2 | 5.6 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yuan, Z.; Wang, Z.; Zhang, R. CCBA-NMS-YD: A Vehicle Pedestrian Detection and Tracking Method Based on Improved YOLOv7 and DeepSort. World Electr. Veh. J. 2024, 15, 309. https://doi.org/10.3390/wevj15070309
Yuan Z, Wang Z, Zhang R. CCBA-NMS-YD: A Vehicle Pedestrian Detection and Tracking Method Based on Improved YOLOv7 and DeepSort. World Electric Vehicle Journal. 2024; 15(7):309. https://doi.org/10.3390/wevj15070309
Chicago/Turabian StyleYuan, Zhenhao, Zhiwen Wang, and Ruonan Zhang. 2024. "CCBA-NMS-YD: A Vehicle Pedestrian Detection and Tracking Method Based on Improved YOLOv7 and DeepSort" World Electric Vehicle Journal 15, no. 7: 309. https://doi.org/10.3390/wevj15070309
APA StyleYuan, Z., Wang, Z., & Zhang, R. (2024). CCBA-NMS-YD: A Vehicle Pedestrian Detection and Tracking Method Based on Improved YOLOv7 and DeepSort. World Electric Vehicle Journal, 15(7), 309. https://doi.org/10.3390/wevj15070309