Improved Traffic Small Object Detection via Cross-Layer Feature Fusion and Channel Attention
Abstract
:1. Introduction
- We introduced an attention mechanism module into the feature extraction process to enhance the model’s focus on important information and reduce feature redundancy. This ensures that the model allocates more attention to crucial features, improving its overall performance;
- We designed a novel CAPAN neck network that effectively enriches the feature information of small objects. This network facilitates the communication and fusion of spatial and semantic information across different levels, resulting in enhanced feature representation for accurate detection;
- To optimize model training, we replaced the original three-coupling detection heads with two decoupling counterparts. This decoupling style allows for independent regression of the classification and localization tasks, leading to a more useful and effective training period with fewer model parameters;
- Extensive experiments are conducted on the TT100K [11] and BDD100K [12] datasets to validate the superiority of our designed model. CFA-YOLO noticeably outperforms the current latest lightweight methods, showing its outstanding performance and advancement in small object detection in traffic scenes.
2. Related Work
2.1. Two-Stage Method in Traffic Scenes
2.2. One-Stage Method in Traffic Scenes
2.3. Attention-Based Method
3. Methodology
3.1. CFA-YOLO Architecture
3.2. Improved Feature Extraction Network
3.3. Cross-layer Alternating Pyramid Aggregation Network
3.4. Improved Detect Head Structure
3.5. Loss Function
4. Experiments and Results
4.1. Traffic Scenario Dataset
4.2. Evaluation Metric
4.3. Experimental Configuration
4.4. Comparative Experiments
4.5. Ablation Experiments
4.6. Visualization of Experimental Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Dai, J.; He, K.; Sun, J. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3150–3158. [Google Scholar]
- Kang, K.; Li, H.; Yan, J.; Zeng, X.; Yang, B.; Xiao, T.; Zhang, C.; Wang, Z.; Wang, R.; Wang, X.; et al. T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circuits Syst. Video Technol. 2017, 28, 2896–2907. [Google Scholar] [CrossRef] [Green Version]
- Gu, Y.; Si, B. A novel lightweight real-time traffic sign detection integration framework based on YOLOv4. Entropy 2022, 24, 487. [Google Scholar] [CrossRef] [PubMed]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Liu, Y.; Shi, G.; Li, Y.; Zhao, Z. M-YOLO: Traffic sign detection algorithm applicable to complex scenarios. Symmetry 2022, 14, 952. [Google Scholar] [CrossRef]
- He, X.; Cheng, R.; Zheng, Z.; Wang, Z. Small object detection in traffic scenes based on YOLO-MXANet. Sensors 2021, 21, 7422. [Google Scholar] [CrossRef] [PubMed]
- Gong, H.; Mu, T.; Li, Q.; Dai, H.; Li, C.; He, Z.; Wang, W.; Han, F.; Tuniyazi, A.; Li, H.; et al. Swin-transformer-enabled YOLOv5 with attention mechanism for small object detection on satellite images. Remote Sens. 2022, 14, 2861. [Google Scholar] [CrossRef]
- Liu, H.; Sun, F.; Gu, J.; Deng, L. Sf-yolov5: A lightweight small object detection algorithm based on improved feature fusion mode. Sensors 2022, 22, 5817. [Google Scholar] [CrossRef]
- Qi, G.; Zhang, Y.; Wang, K.; Mazur, N.; Liu, Y.; Malaviya, D. Small object detection method based on adaptive spatial parallel convolution and fast multi-scale fusion. Remote Sens. 2022, 14, 420. [Google Scholar] [CrossRef]
- Zhu, Z.; Liang, D.; Zhang, S.; Huang, X.; Li, B.; Hu, S. Traffic-Sign Detection and Classification in the Wild. In Proceedings of the IEEE 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2110–2118. [Google Scholar]
- Yu, F.; Xian, W.; Chen, Y.; Liu, F.; Liao, M.; Madhavan, V.; Darrell, T. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv 2018, arXiv:1805.04687. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Uijlings, J.R.; Van De Sande, K.E.; Gevers, T.; Smeulders, A.W. Selective search for object recognition. Int. J. Comput. Vis. 2013, 104, 154–171. [Google Scholar] [CrossRef] [Green Version]
- Qian, R.; Liu, Q.; Yue, Y.; Coenen, F.; Zhang, B. Road surface traffic sign detection with hybrid region proposal and fast R-CNN. In Proceedings of the 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), Changsha, China, 13–15 August 2016; pp. 555–559. [Google Scholar]
- Li, J.; Liang, X.; Shen, S.; Xu, T.; Feng, J.; Yan, S. Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multimed. 2017, 20, 985–996. [Google Scholar] [CrossRef] [Green Version]
- Fan, Q.; Brown, L.; Smith, J. A closer look at Faster R-CNN for vehicle detection. In Proceedings of the 2016 IEEE Intelligent Vehicles Symposium (IV), Gothenburg, Sweden, 19–22 June 2016; pp. 124–129. [Google Scholar]
- Zhao, X.; Li, W.; Zhang, Y.; Gulliver, T.A.; Chang, S.; Feng, Z. A faster RCNN-based pedestrian detection system. In Proceedings of the 2016 IEEE 84th Vehicular Technology Conference (VTC-Fall), Montreal, QC, Canada, 18–21 September 2016; pp. 1–5. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
- Kim, H.; Lee, Y.; Yim, B.; Park, E.; Kim, H. On-road object detection using deep neural network. In Proceedings of the 2016 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Seoul, Republic of Korea, 26–28 October 2016; pp. 1–4. [Google Scholar]
- Xie, L.; Ahmad, T.; Jin, L.; Liu, Y.; Zhang, S. A new CNN-based method for multi-directional car license plate detection. IEEE Trans. Intell. Transp. Syst. 2018, 19, 507–517. [Google Scholar] [CrossRef]
- Jensen, M.B.; Nasrollahi, K.; Moeslund, T.B. Evaluating state-of-the-art object detector on challenging traffic light data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 9–15. [Google Scholar]
- Yang, W.; Zhang, J.; Wang, H.; Zhang, Z. A vehicle real-time detection algorithm based on YOLOv2 framework. In Proceedings of the Real-Time Image and Video Processing 2018, Orlando, FL, USA, 15–19 April 2018; Volume 10670, pp. 182–189. [Google Scholar]
- Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12595–12604. [Google Scholar]
- Zuo, Z.; Yu, K.; Zhou, Q.; Wang, X.; Li, T. Traffic signs detection based on faster r-cnn. In Proceedings of the 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW), Atlanta, GA, USA, 5–8 June 2017; pp. 286–288. [Google Scholar]
- Wang, S.y.; Qu, Z.; Li, C.j.; Gao, L.y. BANet: Small and multi-object detection with a bidirectional attention network for traffic scenes. Eng. Appl. Artif. Intell. 2023, 117, 105504. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Song, G.; Liu, Y.; Wang, X. Revisiting the sibling head in object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11563–11572. [Google Scholar]
- Wu, Y.; Chen, Y.; Yuan, L.; Liu, Z.; Wang, L.; Li, H.; Fu, Y. Rethinking classification and localization for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10186–10195. [Google Scholar]
- Jiang, B.; Luo, R.; Mao, J.; Xiao, T.; Jiang, Y. Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–799. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
- Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Dataset | Training and Validation Images | Test Images | Images Resolution | Categories |
---|---|---|---|---|
TT100K [11] | 8487 | 970 | 2048 × 2048 | 45 |
BDD100K [12] | 72,000 | 8000 | 1280 × 720 | 6 |
Experimental Setting | Configuration |
---|---|
CPU | Intel(R) Core(TM) i7-11700 CPU @2.50 GHz |
GPU | NVIDIA GeForce RTX 3090 |
OS | Ubuntu 20.04 |
Compiling Tool | PyTorch 1.11.0 |
Language | Python 3.8 |
Method | Backbone | mAP/% | Param | Speed (ms) |
---|---|---|---|---|
Faster R-CNN [15] | ResNet-50 | 56.8 | 41.4M | 26.5 |
Cascade R-CNN [39] | ResNet-50 | 66.0 | 69.1M | 31.2 |
Sparse R-CNN [40] | ResNet-50 | 64.7 | 106.1M | 32.8 |
RetinaNet [41] | ResNet-50 | 44.8 | 37.1M | 25.6 |
YOLOv5s | CSP-Darknet53-C3 | 62.8 | 7.1M | 9.3 |
YOLOv5m | CSP-Darknet53-C3 | 76.0 | 21.1M | 12.6 |
YOLOv7 [42] | ELAN-Net | 60.7 | 36.7M | 10.4 |
YOLOv3-Tiny | Darknet53 | 61.9 | 8.8M | 3.1 |
YOLOv4-Tiny | CSP-Darknet53 | 56.4 | 3.2M | 4.2 |
YOLOv7-Tiny | ELAN-Net | 49.2 | 6.2M | 7.5 |
YOLO-MAXNet [7] | SA-MobileNeXt | 74.5 | 14.1M | 15.6 |
YOLOv8s | CSP-Darknet53-C2f | 82.9 | 11.2M | 8.2 |
CFA-YOLO (Ours) | CSP-Darknet53-C3 | 84.2 | 11.9M | 10.4 |
Method | Backbone | Traffic Sign/% | Traffic Light/% | mAP/% | Param | Speed (ms) |
---|---|---|---|---|---|---|
Faster R-CNN [15] | ResNet-50 | 66.3 | 52.4 | 65.3 | 41.2M | 34.4 |
Cascade R-CNN [39] | ResNet-50 | 66.1 | 52.1 | 65.2 | 68.9M | 40.1 |
Sparse R-CNN [40] | ResNet-50 | 69.7 | 64.3 | 66.4 | 106.0M | 36.9 |
RetinaNet [41] | ResNet-50 | 64.8 | 52.0 | 63.1 | 36.2M | 34.7 |
YOLOv5s | CSP-Darknet53-C3 | 61.9 | 56.4 | 61.1 | 7.1M | 6.9 |
YOLOv5m | CSP-Darknet53-C3 | 67.3 | 62.1 | 65.9 | 20.9M | 9.2 |
YOLOv3 [30] | Darknet53 | 70.9 | 65.8 | 69.0 | 61.5M | 9.1 |
YOLOv4 [5] | CSP-Darknet53 | 71.1 | 66.0 | 69.4 | 60.4M | 14.0 |
YOLOv3-Tiny | Darknet53 | 34.8 | 27.2 | 39.8 | 8.7M | 2.8 |
YOLOv4-Tiny | CSP-Darknet53 | 40.1 | 29.6 | 41.5 | 3.1M | 3.9 |
YOLOv7-Tiny | ELAN-Net | 57.1 | 52.8 | 59.0 | 6.1M | 7.0 |
YOLO-MAXNet [7] | SA-MobileNeXt | 68.6 | 63.4 | 66.4 | 13.9M | 15.0 |
YOLOv8s | CSP-Darknet53-C2f | 63.9 | 57.9 | 64.8 | 11.1M | 6.8 |
CFA-YOLO (Ours) | CSP-Darknet53-C3 | 71.2 | 67.6 | 67.6 | 11.8M | 9.7 |
Method | SE | SPPF | CAPAN | Decoupled Head | mAP/% | Param |
---|---|---|---|---|---|---|
YOLOv5s | 62.8 | 7.1 M | ||||
(a) | ✓ | ✓ | 64.1 | 7.1 M | ||
(b) | ✓ | 78.2 | 7.1 M | |||
(c) | ✓ | 3 Heads | 83.9 | 14.4 M | ||
(d) | ✓ | 2 Heads | 83.7 | 11.9 M | ||
(e) | ✓ | ✓ | ✓ | 2 Heads | 84.2 | 11.9 M |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chuai, Q.; He, X.; Li, Y. Improved Traffic Small Object Detection via Cross-Layer Feature Fusion and Channel Attention. Electronics 2023, 12, 3421. https://doi.org/10.3390/electronics12163421
Chuai Q, He X, Li Y. Improved Traffic Small Object Detection via Cross-Layer Feature Fusion and Channel Attention. Electronics. 2023; 12(16):3421. https://doi.org/10.3390/electronics12163421
Chicago/Turabian StyleChuai, Qinliang, Xiaowei He, and Yi Li. 2023. "Improved Traffic Small Object Detection via Cross-Layer Feature Fusion and Channel Attention" Electronics 12, no. 16: 3421. https://doi.org/10.3390/electronics12163421
APA StyleChuai, Q., He, X., & Li, Y. (2023). Improved Traffic Small Object Detection via Cross-Layer Feature Fusion and Channel Attention. Electronics, 12(16), 3421. https://doi.org/10.3390/electronics12163421