ESAMask: Real-Time Instance Segmentation Fused with Efficient Sparse Attention
Abstract
:1. Introduction
- (1)
- We introduce RSPA to the backbone network, which supports differentiated attention for different semantic features in a sparse, adaptive manner.
- (2)
- We design GSInvSAM, which removes redundant information and strengthens feature associations between different channels during bidirectional pyramid feature aggregation.
- (3)
- We added the MRFCPM to the prototype branch, which performs multi-level modeling of global, regional, and local representations, which helps to improve the segmentation effect of targets of different scales.
- (4)
- The design of the entire model and each component follows the principles of being lightweight, effective, and efficient. Experimental results show that our model achieves a better balance between accuracy and efficiency.
2. Related Work
2.1. Instance Segmentation
2.2. Attention for Instance Segmentation
3. Methods
3.1. Overall Architecture
3.2. Related Semantic Perceived Attention
3.3. GSInvSAM
3.4. Global Content-Aware Module
3.5. Mixed Receptive Field Context Perception Module
4. Experiments
4.1. Dataset and Evaluation Metrics
4.2. Implementation Details
4.3. Main Results
4.4. Ablation Study
4.5. Visualization of Results
5. Limitation and Future Work
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Wu, Y.; Meng, F.; Qin, Y.; Qian, Y.; Xu, F.; Jia, L. UAV imagery based potential safety hazard evaluation for high-speed railroad using Real-time instance segmentation. Adv. Eng. Inform. 2023, 55, 101819. [Google Scholar] [CrossRef]
- Cerón JC, Á.; Ruiz, G.O.; Chang, L.; Ali, S. Real-time instance segmentation of surgical instruments using attention and multi-scale feature fusion. Med. Image Anal. 2022, 81, 102569. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Huang, Z.; Huang, L.; Gong, Y.; Huang, C.; Wang, X. Mask scoring r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6409–6418. [Google Scholar]
- Kirillov, A.; Wu, Y.; He, K.; Girshick, R. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9799–9808. [Google Scholar]
- Tang, C.; Chen, H.; Li, X.; Li, J.; Zhang, Z.; Hu, X. Look closer to segment better: Boundary patch refinement for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13926–13935. [Google Scholar]
- Cheng, T.; Wang, X.; Huang, L.; Liu, W. Boundary-preserving mask r-cnn. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 660–676. [Google Scholar]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.G. Yolact: Real-time instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9157–9166. [Google Scholar]
- Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.G. Yolact++: Better real-time instance segmentation. arXiv 2019, arXiv:1912.06218. [Google Scholar] [CrossRef] [PubMed]
- Fu, C.Y.; Shvets, M.; Berg, A.C. RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free. arXiv 2019, arXiv:1901.03353. [Google Scholar]
- Xie, E.; Sun, P.; Song, X.; Wang, W.; Liu, X.; Liang, D.; Shen, C.; Luo, P. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12193–12202. [Google Scholar]
- Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. Blendmask: Top-down meets bottom-up for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8573–8581. [Google Scholar]
- Pei, S.; Ni, B.; Shen, T.; Zhou, Z.; Chen, Y.; Qiu, M. RISAT: Real-time instance segmentation with adversarial training. Multimed. Tools Appl. 2023, 82, 4063–4080. [Google Scholar] [CrossRef]
- Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics (Version8.0.0) [Computer software]. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 6 March 2023).
- Jocher, G. YOLOv5 by Ultralytics (Version 7.0) [Computer Software]. 2020. Available online: https://zenodo.org/record/7347926 (accessed on 8 October 2020).
- Zheng, J.; Wu, H.; Zhang, H.; Wang, Z.; Xu, W. Insulator-defect detection algorithm based on improved YOLOv7. Sensors 2022, 22, 8801. [Google Scholar] [CrossRef] [PubMed]
- Gallo, I.; Rehman, A.U.; Dehkordi, R.H.; Landro, N.; Grassa, R.L.; Boschetti, M. Deep object detection of crop weeds: Performance of YOLOv7 on a real case dataset from UAV images. Remote Sens. 2023, 15, 539. [Google Scholar] [CrossRef]
- Dewi, C.; Chen, A.P.S.; Christanto, H.J. Deep Learning for Highly Accurate Hand Recognition Based on Yolov7 Model. Big Data Cogn. Comput. 2023, 7, 53. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–7 October 2021; pp. 10012–10022. [Google Scholar]
- Ke, L.; Danelljan, M.; Li, X.; Tai, Y.; Tang, C.K.; Yu, F. Mask transfiner for high-quality instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2022; pp. 4412–4421. [Google Scholar]
- Fang, Y.; Yang, S.; Wang, X.; Li, Y.; Fang, C.; Shan, Y.; Feng, B.; Liu, W. Instances as queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–7 October 2021; pp. 6910–6919. [Google Scholar]
- Dong, B.; Zeng, F.; Wang, T.; Zhang, X.; Wei, Y. Solq: Segmenting objects by learning queries. Adv. Neural Inf. Process. Syst. 2021, 34, 21898–21909. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2022; pp. 1290–1299. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, G.; Jones, L.; Gomez, A.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Hassani, A.; Walton, S.; Li, J.; Li, S.; Shi, H. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 6185–6194. [Google Scholar]
- Hassani, A.; Shi, H. Dilated neighborhood attention transformer. arXiv 2022, arXiv:2209.15001. [Google Scholar]
- Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
- Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
- Girshick, R. Faster r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Zhang, G.; Lu, X.; Tan, J.; Li, J.; Zhang, Z.; Li, Q.; Hu, X. Refinemask: Towards high-quality instance segmentation with fine-grained features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6861–6869. [Google Scholar]
- Zhu, C.; Zhang, X.; Li, Y.; Qiu, L.; Han, K.; Han, X. SharpContour: A contour-based boundary refinement approach for efficient and accurate instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2022; pp. 4392–4401. [Google Scholar]
- Lee, Y.; Park, J. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13906–13915. [Google Scholar]
- Wang, X.; Kong, T.; Shen, C.; Jiang, Y.; Li, L. Solo: Segmenting objects by locations. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 649–665. [Google Scholar]
- Wang, X.; Zhang, R.; Kong, T.; Li, L.; Shen, C. Solov2: Dynamic and fast instance segmentation. Adv. Neural Inf. Process. Syst. 2020, 33, 17721–17732. [Google Scholar]
- Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
- Li, F.; Zhang, H.; Xu, H.; Liu, S.; Zhang, L.; Ni, L.M.; Shum, H.Y. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 3041–3050. [Google Scholar]
- Nguyen, D.K.; Ju, J.; Booij, O.; Oswald, M.R.; Snoek, C.M. Boxer: Box-attention for 2d and 3d transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2022; pp. 4773–4782. [Google Scholar]
- Lee, Y.; Hwang, J.; Lee, S.; Bae, Y.; Park, J. An energy and GPU-computation efficient backbone network for real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Cheng, T.; Wang, X.; Chen, S.; Zhang, W.; Zhang, Q.; Huang, C.; Zhang, Z.; Liu, W. Sparse instance activation for real-time instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2022; pp. 4433–4442. [Google Scholar]
- Zhang, T.; Wei, S.; Ji, S. E2ec: An end-to-end contour-based method for high-quality high-speed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2022; pp. 4443–4452. [Google Scholar]
- Li, Y.; Chang, Y.; Yu, C.; Yan, L. Close the loop: A unified bottom-up and top-down paradigm for joint image deraining and segmentation. Proc. AAAI Conf. Artif. Intell. 2022, 36, 1438–1446. [Google Scholar] [CrossRef]
Methods | Backbone | Time | FPS | mAP | APS | APM | APL |
---|---|---|---|---|---|---|---|
PANet [4] | R-50 | 212.8 | 4.7 | 36.6 | 16.3 | 38.1 | 53.1 |
Mask RCNN [3] | R-101 | 116.3 | 8.6 | 35.7 | 15.5 | 38.1 | 52.4 |
Point Rend [6] | R-101 | 100.0 | 10.0 | 38.2 | 19.1 | 40.6 | 55.7 |
RetinaMask [11] | R-101 | 166.7 | 6.0 | 34.7 | 14.3 | 36.7 | 50.5 |
PolarMask [12] | R-101 | 81.3 | 12.3 | 32.1 | 14.7 | 33.8 | 45.2 |
YOLACT [9] | R-101 | 30.3 | 33.0 | 29.8 | 10.1 | 32.2 | 50.1 |
YOLACT++ [10] | R-101 | 36.9 | 27.1 | 34.6 | 11.9 | 36.8 | 55.1 |
SparseInst [42] | R-50 | 25.0 | 40.0 | 37.9 | 15.7 | 39.4 | 56.9 |
E2EC [43] | DLA-34 | 33.2 | 30.1 | 33.8 | - | - | - |
SharpContour [33] | R-50 | 82.6 | 12.1 | 41.9 | 24.3 | 49.4 | 59.1 |
QueryInst [22] | R-101 | 163.9 | 6.1 | 41.7 | 24.2 | 43.9 | 53.9 |
Transfiner [21] | R-101 | 153.8 | 6.5 | 40.7 | 23.1 | 42.8 | 53.8 |
Mask2Former [24] | R-101 | 128.2 | 7.8 | 44.2 | 23.8 | 47.7 | 66.7 |
BoxeR [39] | R-101 | 80.0 | 12.5 | 43.8 | 25.0 | 46.5 | 57.9 |
NA [26] | NAT | 40.2 | 24.9 | 44.5 | - | - | - |
DiNA [27] | DiNAT | 40.0 | 25.0 | 45.1 | - | - | - |
YOLOv5-seg [16] | CSPDarknet | 21.0 | 47.6 | 40.1 | 22.3 | 45.4 | 55.2 |
YOLOv8-seg [15] | CSPDarknet | 20.9 | 47.8 | 42.6 | 23.5 | 47.3 | 57.8 |
Ours (ESAMask) | CSPDarknet | 22.1 | 45.2 | 45.4 | 25.2 | 49.5 | 61.1 |
Methods | Backbone | Time | FPS | mAP | Params | GFLOPs |
---|---|---|---|---|---|---|
Mask RCNN [3] | R-101 | 116.3 | 8.6 | 35.7 | 135.0 | - |
Point Rend [6] | R-101 | 100.0 | 10.0 | 38.2 | 147.2 | - |
Mask2Former [24] | R-101 | 128.2 | 7.8 | 44.2 | 63.0 | 293.0 |
BoxeR [39] | R-101 | 80.0 | 12.5 | 43.8 | 40.1 | 240.0 |
NA [26] | NAT | 40.2 | 24.9 | 44.5 | 85.0 | 737.0 |
DiNA [27] | DiNAT | 40.0 | 25.0 | 45.1 | 85.0 | 737.0 |
YOLOv5-seg [16] | CSPDarknet | 21.0 | 47.6 | 40.1 | 47.9 | 147.7 |
YOLOv8-seg [15] | CSPDarknet | 20.9 | 47.8 | 42.6 | 43.8 | 220.5 |
Ours (ESAMask) | CSPDarknet | 22.1 | 45.2 | 45.4 | 42.6 | 218.9 |
RSPA | GSInvSAM | MRFCPM | mAP | FPS | Time | Params | GFLOPs |
---|---|---|---|---|---|---|---|
42.6 | 47.8 | 20.9 | 43.84 | 220.5 | |||
√ | 43.8 | 46.9 | 21.3 | 46.61 | 220.9 | ||
√ | 43.3 | 48.5 | 20.6 | 38.09 | 201.2 | ||
√ | 43.5 | 45.1 | 22.2 | 45.54 | 237.8 | ||
√ | √ | 44.6 | 48.3 | 20.7 | 40.86 | 201.6 | |
√ | √ | √ | 45.3 | 45.2 | 22.1 | 42.56 | 218.9 |
M | k | mAP | FPS | Time |
---|---|---|---|---|
7 | 4 | 43.9 | 45.1 | 22.2 |
8 | 4 | 43.8 | 46.9 | 21.3 |
8 | 6 | 43.9 | 44.8 | 22.3 |
10 | 6 | 43.6 | 45.2 | 22.1 |
Bottleneck | mAP | FPS | Time | Params | GFLOPs |
---|---|---|---|---|---|
Base | 42.6 | 47.8 | 20.9 | 43.84 | 220.5 |
GSConv + GSConv | 42.1 | 49.1 | 20.4 | 35.10 | 191.1 |
GSConv + InvertConv (r = 2) | 42.8 | 48.8 | 20.5 | 38.10 | 201.2 |
GSConv + InvertConv (r = 4) | 42.9 | 47.9 | 20.9 | 40.53 | 209.4 |
GSConv + InvertConv + SAM | 43.3 | 48.5 | 20.6 | 38.10 | 201.2 |
k | mAP | FPS | Time | Params |
---|---|---|---|---|
Base | 42.6 | 47.8 | 20.9 | 43.844 |
5 | 43.2 | 45.3 | 22.1 | 45.792 |
7 | 43.5 | 45.1 | 22.2 | 45.795 |
9 | 43.6 | 44.5 | 22.5 | 45.799 |
11 | 43.5 | 43.9 | 22.8 | 45.804 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Q.; Chen, L.; Shao, M.; Liang, H.; Ren, J. ESAMask: Real-Time Instance Segmentation Fused with Efficient Sparse Attention. Sensors 2023, 23, 6446. https://doi.org/10.3390/s23146446
Zhang Q, Chen L, Shao M, Liang H, Ren J. ESAMask: Real-Time Instance Segmentation Fused with Efficient Sparse Attention. Sensors. 2023; 23(14):6446. https://doi.org/10.3390/s23146446
Chicago/Turabian StyleZhang, Qian, Lu Chen, Mingwen Shao, Hong Liang, and Jie Ren. 2023. "ESAMask: Real-Time Instance Segmentation Fused with Efficient Sparse Attention" Sensors 23, no. 14: 6446. https://doi.org/10.3390/s23146446
APA StyleZhang, Q., Chen, L., Shao, M., Liang, H., & Ren, J. (2023). ESAMask: Real-Time Instance Segmentation Fused with Efficient Sparse Attention. Sensors, 23(14), 6446. https://doi.org/10.3390/s23146446