Improved Mask R-CNN Multi-Target Detection and Segmentation for Autonomous Driving in Complex Scenes
Abstract
:1. Introduction
2. Mask R-CNN Instance Segmentation Algorithm Model
2.1. The Model Structure of Mask R-CNN
2.2. Backbone Feature Extraction Network
2.3. Regional Proposal Network—RPN
2.4. RoI Align
2.5. Mask Branch
3. Improved Mask R-CNN Model
- (1)
- In urban traffic scenarios, there are multiple targets in front of the car, such as cars, pedestrians, trucks, and so on. When there is traffic congestion or a large number of pedestrians in front, the problem of overlapping targets is likely to occur. This will lead to individual targets in front of the car not being accurately detected and segmented.
- (2)
- In complicated and variable weather, the detection and segmentation of targets are easily affected by light and weather, which leads to false detection or low detection accuracy.
3.1. ResNet Backbone Network Improvements
3.2. Feature Pyramid Network Improvements
3.3. Efficient Channel Attention Module ECA
3.4. Optimization of Loss Functions
4. Experimental Results and Discussion
4.1. Experimental Environment Configuration
4.2. Dataset
4.3. Evaluation Indicators
4.4. Model Training and Experimental Parameter Configuration
4.5. Ablation Experimental Results and Analysis
4.6. Migration Experiment Results and Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2022, 37, 362–386. [Google Scholar] [CrossRef] [Green Version]
- Janai, J.; Güney, F.; Behl, A.; Geiger, A. Computer vision for autonomous vehicles: Problems, datasets and state of the art. Found. Trends® Comput. Graph. Vis. 2020, 12, 1–308. [Google Scholar] [CrossRef]
- Su, L.; Sun, Y.-X.; Yuan, S.-Z. A survey of instance segmentation research based on deep learning. CAAI Trans. Intell. Syst. 2022, 17, 16. [Google Scholar]
- Joseph, R.; Santosh, D.; Ross, G.; Ali, F. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-F.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I. Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Girshick, R. Fast r-cnn. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001; Volume 2015. [Google Scholar] [CrossRef] [Green Version]
- Bai, M.; Urtasun, R. Deep watershed transform for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5221–5229. [Google Scholar] [CrossRef]
- Gao, N.-Y.; Shan, Y.; Wang, Y.; Zhao, X.; Yu, Y.; Yang, M.; Huang, K. Ssap: Single-shot instance segmentation with affinity pyramid. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 642–651. [Google Scholar] [CrossRef]
- Dai, J.-F.; He, K.; Sun, J. Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3150–3158. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
- Li, Y.; Qi, H.; Dai, J.; Ji, X.; Wei, Y. Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2359–2367. [Google Scholar] [CrossRef] [Green Version]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9157–9166. [Google Scholar] [CrossRef]
- Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. Solo: Segmenting objects by locations. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XVIII. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 649–655. [Google Scholar]
- Ke, L.; Tai, Y.-W.; Tang, C.-K. Deep occlusion-aware instance segmentation with overlapping bilayers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4019–4028. [Google Scholar]
- Zhang, T.; Wei, S.; Ji, S. E2ec: An end-to-end contour-based method for high-quality high-speed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4443–4452. [Google Scholar]
- He, J.-J.; Li, P.; Geng, Y.; Xie, X. FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation. arXiv 2023, arXiv:2303.08594. [Google Scholar]
- Zhang, H.; Li, F.; Xu, H.; Huang, S.; Liu, S.; Ni, L.M.; Zhang, L. MP-Former: Mask-Piloted Transformer for Image Segmentation. arXiv 2023, arXiv:2303.07336. [Google Scholar]
- Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access 2020, 12, 58443–58469. [Google Scholar] [CrossRef]
- Peng, Y.; Liu, X.; Shen, C.; Huang, H.; Zhao, D.; Cao, H.; Guo, X. An improved optical flow algorithm based on mask-R-CNN and K-means for velocity calculation. Appl. Sci. 2019, 9, 2808. [Google Scholar] [CrossRef] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar] [CrossRef] [Green Version]
- Lu, J.-H. Analysis and Comparison of Three Classical Color Image Interpolation Algorithms. J. Phys. Conf. Ser. 2021, 1802, 032124. [Google Scholar] [CrossRef]
- Vinod, N.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
- Jonathan, L.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef] [Green Version]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar] [CrossRef] [Green Version]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef] [Green Version]
- Wang, Q.-L.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
- Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
- Hu, J.; Li, S.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
- Zhang, B.; Fang, S.-Q.; Li, Z.-X. Research on Surface Defect Detection of Rare-Earth Magnetic Materials Based on Improved SSD. Complexity 2021, 2021, 4795396. [Google Scholar] [CrossRef]
- Zheng, Z.-H.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2022, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Part V. Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar] [CrossRef] [Green Version]
APcar | APpedestrian | APtruck | APbus | APrider | mAPdet | |
---|---|---|---|---|---|---|
Mask R-CNN | 72.21 | 54.71 | 42.11 | 63.78 | 56.62 | 57.89 |
Mask R-CNN + ResNeXt | 73.81 | 55.36 | 43.95 | 65.82 | 58.66 | 59.52 |
Mask R-CNN + ResNeXt + Improved FPN | 75.94 | 57.05 | 46.41 | 64.18 | 60.89 | 60.90 |
Mask R-CNN + ResNeXt + Improved FPN + ECA | 75.63 | 59.86 | 47.06 | 66.86 | 60.75 | 62.03 |
Mask R-CNN + ResNeXt + Improved FPN + ECA + CIoU | 75.79 | 60.82 | 46.68 | 68.16 | 61.65 | 62.62 |
APcar | APpedestrian | APtruck | APbus | APrider | mAPseg | |
---|---|---|---|---|---|---|
Mask R-CNN | 66.41 | 46.08 | 42.31 | 60.57 | 52.74 | 53.62 |
Mask R-CNN + ResNeXt | 68.42 | 49.17 | 43.15 | 60.24 | 54.55 | 55.11 |
Mask R-CNN + ResNeXt + Improved FPN | 68.97 | 51.55 | 45.07 | 64.18 | 54.46 | 56.85 |
Mask R-CNN + ResNeXt + Improved FPN + ECA | 68.91 | 54.36 | 45.32 | 63.48 | 55.48 | 57.51 |
Mask R-CNN + ResNeXt + Improved FPN + ECA + CIoU | 69.14 | 54.17 | 44.74 | 65.20 | 54.67 | 57.58 |
FPS/S−1 | Parm/M | FLOPs/G | mAP_det/% | mAP_seg/% | |
---|---|---|---|---|---|
Mask R-CNN | 20.38 | 43.997043 | 194.4708 | 57.89 | 53.62 |
Mask R-CNN + ResNeXt | 20.11 | 43.468915 | 197.6479 | 59.52 | 55.11 |
Mask R-CNN + FPN | 19.85 | 46.461245 | 199.4276 | 58.66 | 55.96 |
Mask R-CNN + ECA | 21.25 | 43.997064 | 194.5018 | 59.41 | 54.43 |
Mask R-CNN + CIoU | 19.98 | 43.997043 | 194.4708 | 58.17 | 53.90 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fang, S.; Zhang, B.; Hu, J. Improved Mask R-CNN Multi-Target Detection and Segmentation for Autonomous Driving in Complex Scenes. Sensors 2023, 23, 3853. https://doi.org/10.3390/s23083853
Fang S, Zhang B, Hu J. Improved Mask R-CNN Multi-Target Detection and Segmentation for Autonomous Driving in Complex Scenes. Sensors. 2023; 23(8):3853. https://doi.org/10.3390/s23083853
Chicago/Turabian StyleFang, Shuqi, Bin Zhang, and Jingyu Hu. 2023. "Improved Mask R-CNN Multi-Target Detection and Segmentation for Autonomous Driving in Complex Scenes" Sensors 23, no. 8: 3853. https://doi.org/10.3390/s23083853
APA StyleFang, S., Zhang, B., & Hu, J. (2023). Improved Mask R-CNN Multi-Target Detection and Segmentation for Autonomous Driving in Complex Scenes. Sensors, 23(8), 3853. https://doi.org/10.3390/s23083853