Visual Multitask Real-Time Model in an Automatic Driving Scene
Abstract
:1. Introduction
- a
- We proposed a new scheme by using CSPNeXt as the backbone network for multitask learning networks, which achieves efficiency while simplifying the structural design.
- b
- We employed advanced network data enhancement techniques, such as mosaic filling and image blending during data pre-processing, which facilitated the generalizability of the model to different road scenarios.
- c
- We also optimized the loss function of the target detection task head by allowing it to match the detection frame classification and shared weight layers using a soft-label approach, which is used to improve the accuracy and speed of the target detection head network.
- d
- Experimental results showed that the proposed network outperforms the underlying network structure in terms of generalization.
2. Related Work
2.1. Traffic Target Detection in Real Time
2.2. Drivable Areas and Lane Splits
2.3. Multitask Learning
3. Our proposed Methods
3.1. Design Idea
3.2. Network Architecture
3.3. Backbone
3.4. Task Headers
3.5. Loss Function Design
3.6. Algorithm Details Implementation
Algorithm 1. The end-to-end direct training method |
Input:; the target neural network ; ; maximum number of iterations: ; . Output: |
1 For = 1 in do // There are three tasks 2 For = 1 in 3 do 3 4 ; 5 End 6 ; 7 Random sorting ; 8 Foreach in do 9 Calculate the ; 10 ; // x is the input image 11 12 End 13 End 14 Return |
4. Experimental Evaluation
4.1. Data Sets
4.2. Experimental Process
4.3. Training Methods
4.4. Comparison of Experimental Results
4.5. Visualization
4.6. Ablation Studies
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Nomenclature
Term | Term Definition |
ADAS | Advanced driver assistance systems, a technology that assists drivers in the driving process, found in reference [1]. |
ApolloScape | An open-source dataset for autonomous driving research and computer vision tasks, found in reference [30]. |
BasicBlock | A building block of convolutional neural networks used for feature extraction. |
BEV | Bird’s eye view, a perspective view used in computer vision to represent a top-down view of an object or environment, found in reference [33]. |
BiFPN | Bidirectional feature pyramid network, a network used for object detection tasks, found in reference [22]. |
CCP/AP | Pixel accuracy represents the number of correctly classified pixels in the lane line segmentation task divided by the number of all pixels |
CenterNet | An object detection framework that uses keypoint estimation to detect objects, found in reference [3]. |
ConvNeXt | A type of convolutional neural network architecture that improves upon traditional convolutional layers by using grouped convolutions, found in reference [27]. |
CSPdarknet | A lightweight deep learning framework for computer vision tasks. |
CSPNeXt | A deep neural network model for image classification and object detection tasks. |
CSPNeXtBlock | A building block used in CSPNeXt architectures. |
CSPLayer-T | A layer used in CSPNeXt architectures. |
FCN | Fully convolutional network, a type of neural network commonly used for semantic segmentation tasks, found in reference [15]. |
FPN | Feature pyramid network, a neural network used for object detection and semantic segmentation tasks. |
GFL | Generalized focal loss, a loss function used in object detection tasks, found in reference [31]. |
HybridNet | A neural network architecture that combines both convolutional and recurrent layers, found in reference [20]. |
LaneNet | A neural network used for lane detection and segmentation tasks, found in reference [19]. |
MultiNet | A multitask learning framework used for various computer vision tasks, found in reference [13]. |
PAN | Path aggregation network, a type of neural network used for semantic segmentation tasks. |
PolyLaneNet | A neural network used for lane marking detection in autonomous driving. |
PSPNet | Pyramid scene parsing network, a neural network used for semantic segmentation tasks, found in reference [11] |
RepLKNet | A neural network architecture for object detection and instance segmentation tasks. |
RetinaNet | An object detection framework that uses focal loss to address class imbalance issues, found in reference [2] |
R-CNN | Region-based convolutional neural network, an object detection framework, found in reference [4] |
RTMDet | Real-time multi-person detection, an object detection framework used for real-time multi-person detection, found in reference [27] |
SAD | Learning lightweight lane detection convolutional models through self-attentive refinement, found in reference [16] |
SegNet | A neural network used for semantic segmentation tasks, found in reference [10] |
SepBN Head | Separable batch normalization head, a type of normalization technique used in convolutional neural networks. |
Softmax | A function used to convert a vector of numbers into probabilities that sum to one. |
SPP | Spatial pyramid pooling, a method used to handle variable-sized inputs in neural networks, found in reference [29] |
Swin-T | Swin Transformer, a transformer-based neural network architecture commonly used for computer vision tasks, found in reference [25] |
UNet | A neural network architecture used for image segmentation tasks, found in reference [9] |
Vit | Vision transformer, a transformer-based neural network architecture commonly used for computer vision tasks, found in reference [24] |
YOLO | You only look once, an object detection framework that predicts bounding boxes and class probabilities directly from the input image, found in reference [6] |
YOLOP | You only look once for panoptic driving perception, a multitask learning model for autonomous driving, found in reference [19] |
References
- Bengler, K.; Dietmayer, K.; Farber, B.; Maurer, M.; Stiller, C.; Winner, H. Three Decades of Driver Assistance Systems: Review and Future Perspectives. IEEE 2014, 6, 6–22. [Google Scholar] [CrossRef]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 99, 2999–3007. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- Lin, G.; Liu, K.; Xia, X.; Yan, R. An Efficient and Intelligent Detection Method for Fabric Defects Based on Improved YOLO v5. Sensors 2023, 23, 97. [Google Scholar] [CrossRef] [PubMed]
- Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1055–1059. [Google Scholar]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Parashar, A.; Rhu, M.; Mukkara, A.; Puglielli, A.; Venkatesan, R.; Khailany, B.; Emer, J.; Keckler, S.W.; Dally, W.J. Scnn: An accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Comput. Archit. News 2017, 45, 27–40. [Google Scholar] [CrossRef]
- Teichmann, M.; Weber, M.; Zoellner, M.; Cipolla, R.; Urtasun, R. Multinet: Real-time joint semantic reasoning for autonomous driving. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1013–1020. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Hou, Y.; Ma, Z.; Liu, C.; Loy, C.C. Learning lightweight lane detection CNNs by self attention distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE Press: Piscataway, NJ, USA, 2019; pp. 1013–1021. [Google Scholar]
- Neven, D.; De Brabandere, B.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Towards end-to-end lane detection: An instance segmentation approach. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE Press: Piscataway, NJ, USA, 2018; pp. 286–291. [Google Scholar]
- Tabelini, L.; Berriel, R.; Paixao, T.M.; Badue, C.; De Souza, A.F.; Oliveira-Santos, T. PolyLaneNet: Lane estimation via deep polynomial regression. In Proceedings of the International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE Press: Piscataway, NJ, USA, 2021; pp. 6150–6156. [Google Scholar]
- Wu, D.; Liao, M.W.; Zhang, W.T.; Wang, X.G.; Bai, X.; Cheng, W.Q.; Liu, W.Y. Yolop: You only look once for panoptic driving perception. Mach. Intell. Res. 2022, 19, 550–562. [Google Scholar] [CrossRef]
- Vu, D.; Ngo, B.; Phan, H. HybridNets: End-to-End Perception Network. arXiv 2022, arXiv:2203.09035. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
- Xu, C. Text/Picture Robots, autopilots and supercomputers approach Tesla AI Day 2022. Microcomputer 2022, 5, 17. [Google Scholar]
- Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An Empirical Study of Designing Real-Time Object Detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar]
- Han, C.; Zhao, Q.; Zhang, S.; Chen, Y.; Zhang, Z.; Yuan, J. YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception. arXiv 2022, arXiv:2208.11434. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Ma, Y.; Yang, R. CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception. arXiv 2020, arXiv:2004.05966. [Google Scholar]
- Li, X.; Lv, C.; Wang, W.; Li, G.; Yang, L.; Yang, J. Generalized Focal Loss: Towards Efficient Representation Learning for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3139–3153. [Google Scholar] [CrossRef] [PubMed]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
- Li, H.; Sima, C.; Dai, J.; Wang, W.; Lu, L.; Wang, H.; Xie, E.; Li, Z.; Deng, H.; Tian, H.; et al. Delving into the Devils of Bird’s-eye-view Perception: A Review, Evaluation and Recipe. arXiv 2022, arXiv:2209.05324. [Google Scholar]
Kernel Size | mAP50:95 | Speed (FPS) |
---|---|---|
YOLOP | 76.8 | 90 |
CSPNeXtBlock 3 × 3 | 77.2 | 133 |
CSPNeXtBlock 7 × 7 | 77.8 | 107 |
CSPNeXtBlock 5 × 5 | 77.5 | 126 |
Model | Input Shape | Parameters | Reasoning Time (ms) |
---|---|---|---|
YOLOP | 640 × 640 | 7.92 M | 11.1 |
HybridNets | 640 × 640 | 12.84 M | 21.7 |
Our method | 640 × 640 | 6.3 M | 7.6 |
Model | mAP50:95 | Recall |
---|---|---|
MultiNet | 59.1 | 83.2 |
DLT-Net | 68.3 | 86.4 |
HybridNets | 75.4 | 92.3 |
YOLOV5s | 77.9 | 94.3 |
YOLOP | 76.8 | 93.8 |
Our method | 78.6 | 94.6 |
Model | Drivable mIoU |
---|---|
DLT-Net | 79.2 |
HybridNets | 95.4 |
YOLOP | 96.0 |
Our method | 97.1 |
Model | Lane mIoU | Accuracy (CCP/AP) |
---|---|---|
MultiNet | 55.9 | 68.8 |
DLT-Net | 69.4 | 70.9 |
HybridNets | 85.3 | 77.6 |
YOLOP | 85.3 | 76.5 |
Our method | 86.8 | 78.7 |
Method | Reasoning Speed (FPS) | Object Detection mAP50:95 | Object Detection Recall | Drivable Area mIoU | Lane Detection Accuracy |
---|---|---|---|---|---|
YOLOP (Baseline) | 90 | 76.8 | 93.8 | 96.0 | 76.5 |
+CSPNeXt | 126 | 77.5 | 94.3 | 96.1 | 77.3 |
+MosaicandMixup | 124 | 778 | 94.8 | 96.1 | 77.5 |
+Convtranspose2d | 136 | 76.8 | 94.8 | 96.1 | 78.7 |
+SepBN Head | 131 | 78.6 | 94.6 | 96.1 | 78.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zheng, X.; Lu, C.; Zhu, P.; Yang, G. Visual Multitask Real-Time Model in an Automatic Driving Scene. Electronics 2023, 12, 2097. https://doi.org/10.3390/electronics12092097
Zheng X, Lu C, Zhu P, Yang G. Visual Multitask Real-Time Model in an Automatic Driving Scene. Electronics. 2023; 12(9):2097. https://doi.org/10.3390/electronics12092097
Chicago/Turabian StyleZheng, Xinwang, Chengyu Lu, Peibin Zhu, and Guangsong Yang. 2023. "Visual Multitask Real-Time Model in an Automatic Driving Scene" Electronics 12, no. 9: 2097. https://doi.org/10.3390/electronics12092097
APA StyleZheng, X., Lu, C., Zhu, P., & Yang, G. (2023). Visual Multitask Real-Time Model in an Automatic Driving Scene. Electronics, 12(9), 2097. https://doi.org/10.3390/electronics12092097