CommuSpotter: Scene Text Spotting with Multi-Task Communication
Abstract
:1. Introduction
Category | Methods | Backbone | Proposals | Customized | Seg | Rec |
---|---|---|---|---|---|---|
CNN | Qin et al. [5] | CNN | RPN | RoI Masking | Instance | Att |
Mask TextSpotter v2 [19] | CNN | RPN | Box Detection | Instance | Att | |
Mask TextSpotter v3 [6] | CNN | - | Seg Proposals | Instance | Att | |
ABCNet [8] | CNN | - | Bezier Curve | Instance | CTC | |
MANGO [7] | CNN | - | Mask Attention | Instance | Att | |
Trans | SwinTextSpotter [14] | CNN + Trans | Query Box | Recognition Conversion | Instance | Att |
TextTranSpotter [15] | CNN + Trans | Query Box | Hungarian Match | Instance | Att |
2. Related Work
2.1. Scene Text Spotter of Two-Stage Paradigm
2.2. Scene Text Spotter of One-Stage Paradigm
2.3. Scene Text Spotter with Back-Propagation
3. Text Spotter with Communication
3.1. Architecture
3.2. Text Detection Communication
3.3. Text Recognition Communication
3.4. Optimization
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Ablation Study
4.4. Incidental Texts
4.5. Arbitrary Texts
4.6. Inference Speed
4.7. Qualitative Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Al-Zaidy, R.; Fung, B.; Youssef, A.; Fortin, F. Mining criminal networks from unstructured text documents. Digit. Investig. 2012, 8, 147–160. [Google Scholar] [CrossRef]
- Sivic, Z. Video Google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision, Nice, France, 14–17 October 2003; pp. 1470–1477. [Google Scholar]
- Looije, R.; Neerincx, M.; Cnossen, F. Persuasive robotic assistant for health self-management of older adults: Design and evaluation of social behaviors. Int. J. Hum. Comput. Stud. 2010, 68, 386–397. [Google Scholar] [CrossRef]
- Jung, S.; Lee, U.; Jung, J.; Shim, D. Real-time Traffic Sign Recognition system with deep convolutional neural network. In Proceedings of the 13th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), Xi’an, China, 19–22 August 2016; pp. 31–34. [Google Scholar]
- Qin, S.; Bissacco, A.; Raptis, M.; Fujii, Y.; Xiao, Y. Towards unconstrained end-to-end text spotting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4704–4714. [Google Scholar]
- Liao, M.; Pang, G.; Huang, J.; Hassner, T.; Bai, X. Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 706–722. [Google Scholar]
- Qiao, L.; Chen, Y.; Cheng, Z.; Xu, Y.; Niu, Y.; Pu, S.; Wu, F. Mango: A mask attention guided one-stage scene text spotter. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 2467–2476. [Google Scholar]
- Liu, Y.; Chen, H.; Shen, C.; He, T.; Jin, L.; Wang, L. ABCNet: Real-time scene text spotting with adaptive bezier-curve network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 9809–9818. [Google Scholar]
- Liu, X.; Liang, D.; Yan, S.; Chen, D.; Qiao, Y.; Yan, J. Fots: Fast oriented text spotting with a unified network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5676–5685. [Google Scholar]
- Busta, M.; Neumann, L.; Matas, J. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2204–2212. [Google Scholar]
- Cheng, M.; Sun, Y.; Wang, L.; Zhu, X.; Yao, K.; Chen, J.; Song, G.; Han, J.; Liu, J.; Ding, E.; et al. ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 5184–5193. [Google Scholar]
- Liu, J.; Liu, X.; Sheng, J.; Liang, D.; Li, X.; Liu, Q. Pyramid mask text detector. arXiv 2019, arXiv:1903.11800. [Google Scholar]
- Lyu, P.; Liao, M.; Yao, C.; Wu, W.; Bai, X. Mask textspotter: An end-to-end trainable neural network for spotting text with arbitrary shapes. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 67–83. [Google Scholar]
- Huang, M.; Liu, Y.; Peng, Z.; Liu, C.; Lin, D.; Zhu, S.; Yuan, N.; Ding, K.; Jin, L. SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4593–4603. [Google Scholar]
- Kittenplon, Y.; Lavi, I.; Fogel, S.; Bar, Y.; Manmatha, R.; Perona, P. Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 4604–4613. [Google Scholar]
- Long, S.; Qin, S.; Panteleev, D.; Bissacco, A.; Fujii, Y.; Raptis, M. Towards End-to-End Unified Scene Text Detection and Layout Analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1049–1059. [Google Scholar]
- Chen, J.; Yu, H.; Ma, J.; Li, B.; Xue, X. Text gestalt: Stroke-aware scene text image super-resolution. In Proceedings of the AAAI Conference on Artificial Intelligence, Arlington, VA, USA, 17–19 November 2022; pp. 285–293. [Google Scholar]
- Peng, D.; Wang, X.; Liu, Y.; Zhang, J.; Huang, M.; Lai, S.; Li, J.; Zhu, S.; Lin, D.; Shen, C.; et al. SPTS: Single-Point Text Spotting. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 4272–4281. [Google Scholar]
- Liao, M.; Lyu, P.; He, M.; Yao, C.; Wu, W.; Bai, X. Mask TextSpotter: An End-to-End Trainable Neural Network for Spotting Text with Arbitrary Shapes. arXiv 2019, arXiv:1908.08207. [Google Scholar] [CrossRef] [PubMed]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Qiao, L.; Tang, S.; Cheng, Z.; Xu, Y.; Niu, Y.; Pu, S.; Wu, F. Text perceptron: Towards end-to-end arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11899–11907. [Google Scholar]
- Sun, Y.; Zhang, C.; Huang, Z.; Liu, J.; Han, J.; Ding, E. Textnet: Irregular text reading from images with an end-to-end trainable network. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; pp. 83–99. [Google Scholar]
- Baek, Y.; Shin, S.; Baek, J.; Park, S.; Lee, J.; Nam, D.; Lee, H. Character region attention for text spotting. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 504–521. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Xing, L.; Tian, Z.; Huang, W.; Scott, M. Convolutional character networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9126–9136. [Google Scholar]
- Liu, Y.; Shen, C.; Jin, L.; He, T.; Chen, P.; Liu, C.; Chen, H. Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8048–8064. [Google Scholar]
- Wang, H.; Lu, P.; Zhang, H.; Yang, M.; Bai, X.; Xu, Y.; He, M.; Wang, Y.; Liu, W. All you need is boundary: Toward arbitrary-shaped text spotting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12160–12167. [Google Scholar]
- Feng, W.; He, W.; Yin, F.; Zhang, X.; Liu, C. TextDragon: An end-to-end framework for arbitrary shaped text spotting. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9076–9085. [Google Scholar]
- Wu, J.; Lyu, P.; Lu, G.; Zhang, C.; Yao, K.; Pei, W. Decoupling recognition from detection: Single shot self-reliant scene text spotter. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 1319–1328. [Google Scholar]
- Zhong, H.; Tang, J.; Wang, W.; Yang, Z.; Yao, C.; Lu, T. Arts: Eliminating inconsistency between text detection and recognition with auto-rectification text spotter. arXiv 2021, arXiv:2110.10405. [Google Scholar]
- Lin, T.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Kim, W.; Son, B.; Kim, I. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 5583–5594. [Google Scholar]
- Zhao, L.; Wu, Z.; Wu, X.; Wilsbacher, G.; Wang, S. Background-Insensitive Scene Text Recognition with Text Semantic Segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 163–182. [Google Scholar]
- Xu, X.; Zhang, Z.; Wang, Z.; Price, B.; Wang, Z.; Shi, H. Rethinking text segmentation: A novel dataset and a text-specific refinement approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 12045–12055. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
- Yang, X.; He, D.; Zhou, Z.; Kifer, D.; Giles, C. Learning to Read Irregular Text with Attention Mechanisms. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; p. 3. [Google Scholar]
- He, T.; Tian, Z.; Huang, W.; Shen, C.; Qiao, Y.; Sun, C. An end-to-end textspotter with explicit alignment and attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5020–5029. [Google Scholar]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic Data for Text Localisation in Natural Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar]
- Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.; Mestre, S.; Mas, J.; Mota, D.; Almazan, J.; Heras, L.P.d. A ICDAR 2013 robust reading competition. In Proceedings of the 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.; Lu, S. ICDAR 2015 competition on Robust Reading. In Proceedings of the 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
- Ch’ng, C.; Chan, C. Total-text: A comprehensive dataset for scene text detection and recognition. In Proceedings of the 14th IAPR International Conference on Document Analysis and Recognition, Kyoto, Japan, 9–15 November 2017; pp. 935–942. [Google Scholar]
- Zhong, Z.; Jin, L.; Zhang, S.; Feng, Z. DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images. arXiv 2016, arXiv:1605.07314. [Google Scholar]
- Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv 2016, arXiv:1601.07140. [Google Scholar]
- Nayef, N.; Patel, Y.; Busta, M.; Chowdhury, P.; Karatzas, D.; Khlif, W.; Matas, J.; Pal, U.; Burie, J.; Liu, C.; et al. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-2019. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 1582–1587. [Google Scholar]
- Liu, Y.; Jin, L.; Zhang, S.; Luo, C.; Zhang, S. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognit. 2015, 90, 337–345. [Google Scholar] [CrossRef]
- Ronen, R.; Tsiper, S.; Anschel, O.; Lavi, I.; Markovitz, A.; Manmatha, R. Glass: Global to local attention for scene-text spotting. In Proceedings of the 17th European Conference of Computer Vision, Tel Aviv, Israel, 24–28 October 2022; pp. 249–266. [Google Scholar]
- Long, S.; Ruan, J.; Zhang, W.; He, X.; Wu, W.; Yao, C. Textsnake: A flexible representation for detecting text of arbitrary shapes. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 20–36. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Liu, X.; Liang, D.; Yang, Z.; Lu, T.; Shen, C. Pan++: Towards efficient and accurate end-to-end spotting of arbitrarily-shaped text. IEEE Trans Pattern Anal. Mach. Intell. 2021, 44, 5349–5367. [Google Scholar] [CrossRef] [PubMed]
Methods | Extensive Representation | CM-P | Conversation Mechanism | F-Score |
---|---|---|---|---|
E2E-baseline | 82.1 | |||
w/ER | ✓ | 82.5 | ||
w/CM-P | ✓ | 83.7 | ||
w/CM-P | ✓ | ✓ | 84.6 | |
Full Setting | ✓ | ✓ | ✓ | 85.8 |
Methods | Extensive Representation | CM-P | Conversation Mechanism | F-Score |
---|---|---|---|---|
E2E-baseline | 77.4 | |||
w/ER | ✓ | 78.4 | ||
w/CM-P | ✓ | 80.3 | ||
w/CM-P | ✓ | ✓ | 81.7 | |
Full Setting | ✓ | ✓ | ✓ | 83.4 |
Methods | Pre-Train | Mix-Train or/and Fine-Tune | GPU | ||
---|---|---|---|---|---|
Data | Iter. | Data | Iter. | ||
MANGO [7] | SynthText | 600 K | CST, COCO, MLT, IC13, 1C15, Total | 250 K | ∼1344 |
ABCNet [8] | CST, COCO, MLT | 260 K | Total, CTW | 150 K | 288 |
SwinTextSpotter [14] | CST, MLT, IC13, IC15, TT | 450 K | IC13, IC15, Total, MLT, CTW | 90 K | ∼336 |
TTS [15] | SynthText | - | SynthText, IC13, IC15, Total, SCUT, COCO | - | ∼1200 |
Mask TextSpotter v2 [19] | SynthText | 270 K | SynthText, IC13, IC15, Total, SCUT | 150 K | ∼432 |
Ours | SynthText | 270 K | IC13, IC15, Total, SCUT | 90 K | ∼312 |
Methods | Detection | End-to-End | ||||
---|---|---|---|---|---|---|
P | R | F | S | W | G | |
FOTS [9] | 91.0 | 85.2 | 88.0 | 81.1 | 75.9 | 60.8 |
Qin et al. [5] | 89.4 | 85.8 | 87.5 | 83.4 | 79.9 | 67.9 |
Mask TextSpotter v1 [13] | 91.6 | 81.0 | 86.0 | 79.3 | 73.0 | 62.4 |
Text Perceptron [21] | 91.6 | 81.8 | 86.4 | 80.5 | 76.6 | 65.1 |
TextDragon [28] | 92.5 | 83.8 | 87.9 | 82.5 | 78.3 | 65.2 |
CharNet [25] | 91.2 | 88.3 | 89.7 | 80.1 | 74.5 | 62.2 |
ABCNet v2 [26] | - | - | - | 82.7 | 78.5 | 73.0 |
CRAFTS [23] | 89.0 | 85.3 | 87.1 | 83.1 | 74.9 | |
Boundary [27] | 89.8 | 87.5 | 88.6 | 79.7 | 75.2 | 64.1 |
Mask TextSpotter v2 [19] | 86.6 | 87.0 | 83.0 | 77.7 | 73.5 | |
Mask TextSpotter v3 [6] | - | - | - | 83.3 | 78.1 | 74.2 |
MANGO [7] | - | - | - | 81.8 | 78.9 | 67.3 |
SwinTextSpotter [14] | - | - | - | 83.9 | 77.3 | 70.5 |
TTS [15] | - | - | - | 85.2 | 81.7 | 77.4 |
GLASS [48] | - | - | - | 84.7 | 80.1 | 76.3 |
SRSTS [29] | 96.1 | 82.0 | 88.4 | 85.6 | 81.7 | 74.5 |
Ours | 91.4 | 88.3 | 89.8 | 85.8 | 80.2 | 74.9 |
Methods | Detection | End-to-End | |||
---|---|---|---|---|---|
P | R | F | None | Full | |
TextSnake [49] | 82.7 | 74.5 | 78.4 | - | - |
Mask TextSpotter v1 [13] | 69.0 | 55.0 | 61.3 | 52.9 | 71.8 |
TextNet [22] | 68.2 | 59.5 | 63.5 | 54.0 | - |
Mask TextSpotter v2 [19] | 81.8 | 75.4 | 78.5 | 65.3 | 77.4 |
Mask TextSpotter v3 [6] | - | - | - | 71.2 | 78.4 |
Boundary [27] | 85.2 | 83.5 | 84.3 | 65.0 | 76.1 |
CRAFTS [23] | 89.5 | 85.4 | 87.4 | 78.7 | - |
ABCNet [8] | - | - | - | 64.2 | 75.7 |
ABCNet v2 [26] | - | - | 87.0 | 70.4 | 78.1 |
PAN++ [50] | - | - | - | 68.6 | 78.6 |
MANGO [7] | - | - | - | 71.7 | 82.6 |
SwinTextSpotter-R [14] | - | - | 87.2 | 72.4 | 83.0 |
SwinTextSpotter [14] | - | - | 88.0 | 74.3 | 84.1 |
GLASS [48] | - | - | - | 76.6 | 83.0 |
SRSTS [29] | 92.0 | 83.0 | 87.2 | 78.8 | 86.3 |
TTS [15] | - | - | - | 75.6 | 84.4 |
Ours | 90.4 | 91.5 | 90.1 | 73.0 | 83.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, L.; Wilsbacher, G.; Wang, S. CommuSpotter: Scene Text Spotting with Multi-Task Communication. Appl. Sci. 2023, 13, 12540. https://doi.org/10.3390/app132312540
Zhao L, Wilsbacher G, Wang S. CommuSpotter: Scene Text Spotting with Multi-Task Communication. Applied Sciences. 2023; 13(23):12540. https://doi.org/10.3390/app132312540
Chicago/Turabian StyleZhao, Liang, Greg Wilsbacher, and Song Wang. 2023. "CommuSpotter: Scene Text Spotting with Multi-Task Communication" Applied Sciences 13, no. 23: 12540. https://doi.org/10.3390/app132312540
APA StyleZhao, L., Wilsbacher, G., & Wang, S. (2023). CommuSpotter: Scene Text Spotting with Multi-Task Communication. Applied Sciences, 13(23), 12540. https://doi.org/10.3390/app132312540