Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN
Abstract
:1. Introduction
- This paper regards scene image text segmentation as a style transfer problem between images. From a scene text image with a complex background, after processing, it is transformed into a text image with a concise background, which provides convenience for subsequent processing. This transformation process actually reveals a novel perspective.
- This paper uses CycleGAN for the text segmentation of scene images. The detail processing of the edge of the text segmentation image generated by CycleGAN is not enough and not clear enough. The ASPP module is introduced into CycleGAN’s generator to obtain multi-scale text features and retain the details generated by image text edges.
- This paper conducts experiments on synthetic data sets, IIIT 5k word data sets, MACT data sets and MTWI data sets. Experimental results show that our method can effectively segment text while retaining clear edge information.
2. Related Work
2.1. Scene Text Segmentation
2.2. Image-to-Image Style Transfer
3. Scene Text Segmentation Image Generation
3.1. Overview of the Architecture
3.2. Network Architecture
3.2.1. ASPP
3.2.2. Generator
3.2.3. Discriminator
3.3. Loss Functions
3.3.1. Adversarial Loss
3.3.2. Cycle-Consistent Loss
3.3.3. Identity Loss
3.3.4. Overall Loss
4. Experiments
4.1. Experimental Setting
4.1.1. Data Preparation
4.1.2. Experimental Parameter Settings
4.1.3. Evaluation Criteria
4.2. Experimental Results
4.2.1. Qualitative Evaluation
4.2.2. Quantitative Evaluation
4.3. Experimental Effects on Other Data Sets
4.3.1. Data Preparation and Training Details
4.3.2. Qualitative Evaluation
4.3.3. Quantitative Evaluation
4.3.4. Synthetic to Real Adaptation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jiao, S.; Li, X.; Lu, X. An improved Ostu method for image segmentation. In Proceedings of the 2006 8th international Conference on Signal Processing, Guilin, China, 16–20 November 2006; Volume 2. [Google Scholar]
- Jaynes, E.T. On the rationale of maximum-entropy methods. Proc. IEEE 1982, 70, 939–952. [Google Scholar] [CrossRef]
- Hageman, L.A.; Young, D.M. Applied Iterative Methods; Courier Corporation: North Chelmsford, MA, USA, 2012. [Google Scholar]
- Plenge, E.; Poot, D.H.; Bernsen, M.; Kotek, G.; Houston, G.; Wielopolski, P.; van der Weerd, L.; Niessen, W.J.; Meijering, E. Super-resolution methods in MRI: Can they improve the trade-off between resolution, signal-to-noise ratio, and acquisition time? Magn. Reson. Med. 2012, 68, 1983–1993. [Google Scholar] [CrossRef] [PubMed]
- Khurshid, K.; Siddiqi, I.; Faure, C.; Vincent, N. Comparison of Niblack inspired binarization methods for ancient documents. In Proceedings of the Document Recognition and Retrieval XVI; SPIE: Bellingham, WA, USA, 2009; Volume 7247, pp. 267–275. [Google Scholar]
- Kia, O.E.; Sanvola, J. Active multimedia documents for mobile services. In Proceedings of the 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No. 98EX175), Redondo Beach, CA, USA, 7–9 December 1998; pp. 227–232. [Google Scholar]
- Ye, Q.; Gao, W.; Huang, Q. Automatic text segmentation from complex background. In Proceedings of the 2004 International Conference on Image Processing, ICIP’04, Singapore, 24–27 October 2004; Volume 5, pp. 2905–2908. [Google Scholar]
- Mishra, A.; Alahari, K.; Jawahar, C. An MRF model for binarization of natural scene text. In Proceedings of the 2011 International Conference on Document Analysis and Recognition, Beijing, China, 18–21 September 2011; pp. 11–16. [Google Scholar]
- Lee, S.; Kim, J.H. Integrating multiple character proposals for robust scene text extraction. Image Vis. Comput. 2013, 31, 823–840. [Google Scholar] [CrossRef]
- Mishra, A.; Alahari, K.; Jawahar, C. Top-down and bottom-up cues for scene text recognition. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2687–2694. [Google Scholar]
- Mancas-Thillou, C.; Gosselin, B. Color text extraction with selective metric-based clustering. Comput. Vis. Image Underst. 2007, 107, 97–107. [Google Scholar] [CrossRef]
- Kita, K.; Wakahara, T. Binarization of color characters in scene images using k-means clustering and support vector machines. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 3183–3186. [Google Scholar]
- Zhu, Y.; Sun, J.; Naoi, S. Recognizing natural scene characters by convolutional neural network and bimodal image enhancement. In Proceedings of the Camera-Based Document Analysis and Recognition: 4th International Workshop, CBDAR 2011, Beijing, China, 22 September 2011; pp. 69–82. [Google Scholar]
- Zhang, Y.; Qiu, Z.; Yao, T.; Liu, D.; Mei, T. Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 6810–6818. [Google Scholar]
- Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.; Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1989–1998. [Google Scholar]
- Chen, Y.C.; Lin, Y.Y.; Yang, M.H.; Huang, J.B. Crdoco: Pixel-level domain transfer with cross-domain consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1791–1800. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
- Bonechi, S.; Bianchini, M.; Scarselli, F.; Andreini, P. Weak supervision for generating pixel–level annotations in scene text segmentation. Pattern Recognit. Lett. 2020, 138, 1–7. [Google Scholar] [CrossRef]
- Veit, A.; Matera, T.; Neumann, L.; Matas, J.; Belongie, S. Coco-text: Dataset and benchmark for text detection and recognition in natural images. arXiv 2016, arXiv:1601.07140. [Google Scholar]
- Nayef, N.; Yin, F.; Bizid, I.; Choi, H.; Feng, Y.; Karatzas, D.; Luo, Z.; Pal, U.; Rigaud, C.; Chazalon, J.; et al. Icdar2017 robust reading challenge on multi-lingual scene text detection and script identification-RRC-MLT. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 1454–1459. [Google Scholar]
- Wang, C.; Zhao, S.; Zhu, L.; Luo, K.; Guo, Y.; Wang, J.; Liu, S. Semi-supervised pixel-level scene text segmentation by mutually guided network. IEEE Trans. Image Process. 2021, 30, 8212–8221. [Google Scholar] [CrossRef] [PubMed]
- Hertzmann, A.; Jacobs, C.E.; Oliver, N.; Curless, B.; Salesin, D.H. Image analogies. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 12–17 August 2001; pp. 327–340. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Zhu, J.Y.; Zhang, R.; Pathak, D.; Darrell, T.; Efros, A.A.; Wang, O.; Shechtman, E. Toward multimodal image-to-image translation. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8798–8807. [Google Scholar]
- Shaohui, L. A Synthetic Data Generator for Text Recognition. Available online: https://gitcode.net/mirrors/Belval/TextRecognitionDataGenerator.git (accessed on 19 May 2024).
- Mishra, A.; Alahari, K.; Jawahar, C. Scene text recognition using higher order language priors. In Proceedings of the BMVC-British Machine Vision Conference, BMVA, Surrey, UK, 3–7 September 2012. [Google Scholar]
- Wang, K.; Yi, Y.; Tang, Z.; Peng, J. Multi-scene ancient Chinese text recognition with deep coupled alignments. Appl. Soft Comput. 2021, 108, 107475. [Google Scholar] [CrossRef]
IS | FID | |
---|---|---|
UNet | 0.9985 | 248.1416 |
UNet+ASPP | 2.3317 | 277.6030 |
The original generator | 1.9735 | 176.2462 |
Ours | 2.8725 | 155.7249 |
Training Set | Test Set | |||
---|---|---|---|---|
Domain X | Domain Y | Domain X | Domain Y | |
The IIIT 5K-word | 4000 | 5100 | 1000 | 1140 |
MACT | 13,000 | 13,463 | 1400 | 1588 |
The IIIT 5K-Word | MACT | |||
---|---|---|---|---|
IS | FID | IS | FID | |
UNet | 3.0557 | 191.0428 | 1.7400 | 208.5827 |
UNet+ASPP | 3.9968 | 198.7815 | 1.8932 | 192.9745 |
The original generator | 2.7973 | 247.5222 | 1.9537 | 99.8899 |
Ours | 4.1251 | 129.5635 | 2.0235 | 89.1945 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, X.; Yang, F.; Guo, W. Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN. Appl. Sci. 2024, 14, 4420. https://doi.org/10.3390/app14114420
Liu X, Yang F, Guo W. Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN. Applied Sciences. 2024; 14(11):4420. https://doi.org/10.3390/app14114420
Chicago/Turabian StyleLiu, Xian, Fang Yang, and Wei Guo. 2024. "Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN" Applied Sciences 14, no. 11: 4420. https://doi.org/10.3390/app14114420
APA StyleLiu, X., Yang, F., & Guo, W. (2024). Unsupervised Scene Image Text Segmentation Based on Improved CycleGAN. Applied Sciences, 14(11), 4420. https://doi.org/10.3390/app14114420