Attention Guided Feature Encoding for Scene Text Recognition
Abstract
:1. Introduction
- A novel deep neural network for scene text recognition based on an RNN-based encoder–decoder. The encoder consists of: (i) a convolutional neural network enabled with an attention mechanism to extract deep convolutional features, and (ii) bidirectional LSTM layers to convert input features into sequence representation. The VGG16 architecture is used as the basis, and is redesigned for the convolutional neural structure in the proposed method. The decoder is made up of a hierarchy of LSTM layers, and the entire proposed network is trained end-to-end, with CTC loss minimization as the learning goal. Our method demonstrates that spatial-attention-based feature extraction improves the efficacy of feature sequence encoding.
- The proposed design was thoroughly validated using ICDAR2013, ICDAR2015, IIIT5K, and SVT text datasets with a variety of geometric properties and shapes. The results from different experiments demonstrate that the proposed network is an efficient solution for recognizing natural scene segments with fancy, oriented, and curved text appearances. Further, the results also establish that the proposed method outperforms many recent methods of scene text recognition.
2. Literature Survey
3. Convolutional Recurrent Neural Network (CRNN) for Encoder–Decoder
3.1. The Encoder Design
3.2. Design of the Attention Block
3.3. The Decoder Design
4. Experiments and Analysis
- 1.
- ICDAR2013 [35]: The dataset is a collection of natural images with horizontal and near-horizontal text appearances. The collection consists of 229 training and 233 testing images with character and word level bounding box annotations and corresponding annotations.
- 2.
- ICDAR2015 [36]: The dataset is released as the fourth challenge in the ICDAR 2015 robust reading competition (incidental scene text detection). The dataset consists of 1500 images, of which were used 1000 for training purposes and the remaining images were used for testing. The images are real-life scenes captured from Google Glass in an incidental manner, with the annotations available as quadrangle text bounding boxes with corresponding Unicode transcription.
- 3.
- IIIT5K [6]: The dataset contains a set of 3000 test and 2000 train images collected from the web. The images are associated with a short 50-word lexicon and a long 1000-word lexicon. The lexicons contain the exact ground truth word and some randomly selected words.
- 4.
- Street-view text (SVT) [28]: The dataset consists of 100 training and 250 testing images gathered from Google street view. In total, the training and testing sets consist of 211 and 514 word images. The images have an annotated axis aligned bounding-boxes around word occurrences, with corresponding labels. In addition, the images are annotated with the 50-word lexicon.
4.1. Network Training and Hyperparameters
- (i)
- In the first type of network learning, the proposed scene text recognition network was pre-trained on a small set of examples from ICDAR2015, IIITK and SVT datasets, and then the model was trained on different evaluation datasets. The pre-training set consisted of 5% of the randomly selected training images from both datasets. The pre-training was performed for 20 epochs with a slow learning rate of 0.0001 and the batch size was fixed at 16. It was necessary to initialize the network weight parameters with the domain data distribution. The pre-training step also helps the network train on small datasets, as training from scratch on these datasets with randomly initialized weights would not be effective. For subsequent evaluations on different datasets, we tuned the network learning rate between (0.0001 and 0.005). The final learning rate and the batch size for all experiments are given in Table 1. Further, the network learning rate was reduced by half every 5 epochs after crossing half of the total number of training epochs. The images with a height bigger than width were rotated clockwise by .
- (ii)
- In the second type, the proposed network was trained on the Synth90k synthetic dataset [39] with an initial learning rate of 0.002 and batch size of 16. The Synth90k dataset consisted of 9 million synthetic word images generated with a dictionary of 90k English words by applying random transformations and backgrounds to word images. Each image was annotated with the corresponding word label. The network was trained for 40 epochs, with learning rate decay fixed to half after 20 epochs at the step of the 5 epochs. Again, these parameters were selected based on the discussions in [38].
4.2. Results and Discussion
4.3. Analysis of Attention Block Performance
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Yue, X.; Kuang, Z.; Lin, C.; Sun, H.; Zhang, W. Robustscanner: Dynamically enhancing positional clues for robust text recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 135–151. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd ICML, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
- Huang, Y.; Gu, C.; Wang, S.; Huang, Z.; Chen, K.; Region, H.A. Spatial Aggregation for Scene Text Recognition. In Proceedings of the 32nd British Machine Vision Conference 2021, BMVC 2021, Online, 22–25 November 2021. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Neumann, L.; Matas, J.E.S. Real-time scene text localization and recognition. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Mishra, A.; Alahari, K.; Jawahar, C.V. Scene Text Recognition using Higher Order Language Priors. In Proceedings of the BMVC, Surrey, UK, 3–7 September 2012. [Google Scholar]
- Yao, C.; Bai, X.; Shi, B.; Liu, W. Strokelets: A learned multi-scale representation for scene text recognition. In Proceedings of the IEEE CVPR, Columbus, OH, USA, 23–28 June 2014; pp. 4042–4049. [Google Scholar]
- Yi, C.; Tian, Y. Scene text recognition in mobile applications by character descriptor and structure configuration. IEEE TIP 2014, 23, 2972–2982. [Google Scholar] [CrossRef]
- Lee, C.Y.; Bhardwaj, A.; Di, W.; Jagadeesh, V.; Piramuthu, R. Region-based discriminative feature pooling for scene text recognition. In Proceedings of the IEEE CVPR, Columbus, OH, USA, 24–27 June 2014; pp. 4050–4057. [Google Scholar]
- Liu, X.; Meng, G.; Pan, C. Scene text detection and recognition with advances in deep learning: A survey. Int. J. Doc. Anal. Recognit. (IJDAR) 2019, 22, 143–162. [Google Scholar] [CrossRef]
- Long, S.; He, X.; Yao, C. Scene text detection and recognition: The deep learning era. Int. J. Comput. Vis. 2021, 129, 161–184. [Google Scholar] [CrossRef]
- Bissacco, A.; Cummins, M.; Netzer, Y.; Neven, H. PhotoOCR: Reading Text in Uncontrolled Conditions. In Proceedings of the 2013 IEEE ICCV, Sydney, Australia, 1–8 December 2013; pp. 785–792. [Google Scholar] [CrossRef] [Green Version]
- Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Reading Text in the Wild with Convolutional Neural Networks. IJCV 2015, 116, 1–20. [Google Scholar] [CrossRef] [Green Version]
- Cai, H.; Sun, J.; Xiong, Y. Revisiting classification perspective on scene text recognition. arXiv 2021, arXiv:2102.10884. [Google Scholar]
- Su, B.; Lu, S. Accurate scene text recognition based on recurrent neural network. In Proceedings of the Asian Conference on Computer Vision, Singapore, 1–5 November 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 35–48. [Google Scholar]
- Bai, F.; Cheng, Z.; Niu, Y.; Pu, S.; Zhou, S. Edit Probability for Scene Text Recognition. In Proceedings of the 2018 IEEE/CVF CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1508–1516. [Google Scholar] [CrossRef] [Green Version]
- Liu, Z.; Li, Y.; Ren, F.; Goh, W.L.; Yu, H. SqueezedText: A Real-Time Scene Text Recognition by Binary Convolutional Encoder-Decoder Network. In Proceedings of the AAAI, New Orleans, LO, USA, 2–8 February 2018. [Google Scholar]
- Cheng, Z.; Bai, F.; Xu, Y.; Zheng, G.; Pu, S.; Zhou, S. Focusing Attention: Towards Accurate Text Recognition in Natural Images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5086–5094. [Google Scholar]
- Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust scene text recognition with automatic rectification. In Proceedings of the IEEE CVPR, Las Vegas, NE, USA, 27–30 June 2016; pp. 4168–4176. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Fang, S.; Xie, H.; Zha, Z.J.; Sun, N.; Tan, J.; Zhang, Y. Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling. In Proceedings of the 26th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA, 22–26 October 2018; pp. 248–256. [Google Scholar] [CrossRef]
- Liao, M.; Zhang, J.; Wan, Z.; Xie, F.; Liang, J.; Lyu, P.; Yao, C.; Bai, X. Scene text recognition from two-dimensional perspective. AAAI Conf. Artif. Intell. 2019, 33, 8714–8721. [Google Scholar] [CrossRef] [Green Version]
- Xie, H.; Fang, S.; Zha, Z.J.; Yang, Y.; Li, Y.; Zhang, Y. Convolutional Attention Networks for Scene Text Recognition. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 1–17. [Google Scholar] [CrossRef]
- Yin, F.; Wu, Y.C.; Zhang, X.Y.; Liu, C.L. Scene text recognition with sliding convolutional character models. arXiv 2017, arXiv:1709.01727. [Google Scholar]
- Wu, Y.C.; Yin, F.; Zhang, X.Y.; Liu, L.; Liu, C.L. SCAN: Sliding convolutional attention network for scene text recognition. arXiv 2018, arXiv:1806.00578. [Google Scholar]
- Yan, R.; Peng, L.; Xiao, S.; Yao, G. Primitive representation learning for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 284–293. [Google Scholar]
- Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. PAMI 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
- Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on Computer Vision. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464. [Google Scholar] [CrossRef]
- Busta, M.; Neumann, L.; Matas, J. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In Proceedings of the IEEE ICCV, Venice, Italy, 22–29 October 2017; pp. 2204–2212. [Google Scholar]
- Zhan, F.; Lu, S. Esir: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE/CVF CVPR, Long Beach, CA, USA, 15–20 June 2019; pp. 2059–2068. [Google Scholar]
- Bartz, C.; Yang, H.; Meinel, C. SEE: Towards semi-supervised end-to-end scene text recognition. In Proceedings of the AAAI, New Orleans, LO, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Liu, W.; Chen, C.; Wong, K.Y.K.; Su, Z.; Han, J. Star-net: A spatial attention residue network for scene text recognition. In Proceedings of the BMVC, York, UK, 19–22 September 2016; Volume 2, p. 7. [Google Scholar]
- Burkhard, W.A.; Keller, R.M. Some approaches to best-match file searching. Commun. ACM 1973, 16, 230–236. [Google Scholar] [CrossRef]
- Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; De Las Heras, L.P. ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493. [Google Scholar]
- Karatzas, D.; Gomez-Bigorda, L.; Nicolaou, A.; Ghosh, S.; Bagdanov, A.; Iwamura, M.; Matas, J.; Neumann, L.; Chandrasekhar, V.R.; Lu, S.; et al. ICDAR 2015 competition on robust reading. In Proceedings of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, Tunisia, 23–26 August 2015; pp. 1156–1160. [Google Scholar]
- Tieleman, T.; Hinton, G. Lecture 6, COURSERA: Neural Networks for Machine Learning. 2012. Available online: https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjh6vWihMv6AhVm6zgGHaLBDSUQFnoECA0QAQ&url=https (accessed on 6 September 2022).
- Bengio, Y. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 437–478. [Google Scholar]
- Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Synthetic data and artificial neural networks for natural scene text recognition. arXiv 2014, arXiv:1406.2227. [Google Scholar]
- Liang, Q.; Xiang, S.; Wang, Y.; Sun, W.; Zhang, D. RNTR-Net: A Robust Natural Text Recognition Network. IEEE Access 2020, 8, 7719–7730. [Google Scholar] [CrossRef]
- Wan, Z.; He, M.; Chen, H.; Bai, X.; Yao, C. Textscanner: Reading characters in order for robust scene text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12120–12127. [Google Scholar]
- Yang, M.; Guan, Y.; Liao, M.; He, X.; Bian, K.; Bai, S.; Yao, C.; Bai, X. Symmetry-Constrained Rectification Network for Scene Text Recognition. In Proceedings of the 2019 IEEE/CVF ICCV, Seoul, Korea, 27 October–2 November 2019; pp. 9146–9155. [Google Scholar] [CrossRef] [Green Version]
- Li, H.; Wang, P.; Shen, C.; Zhang, G. Show, attend and read: A simple and strong baseline for irregular text recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8610–8617. [Google Scholar]
- Cheng, Z.; Xu, Y.; Bai, F.; Niu, Y.; Pu, S.; Zhou, S. AON: Towards Arbitrarily-Oriented Text Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE Computer Society: Los Alamitos, CA, USA, 2018; pp. 5571–5579. [Google Scholar] [CrossRef]
- Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NE, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar]
Dataset | Initial | # of Epochs | Batch Size | # of Epochs for Decay |
---|---|---|---|---|
ICDAR2013 | 0.001 | 50 | 16 | 25 |
ICDAR2015 | 0.0005 | 50 | 16 | 25 |
IIIT5K | 0.0005 | 50 | 16 | 25 |
SVT | 0.001 | 60 | 24 | 30 |
Method | Training Data | ICDAR2013 | ICDAR2015 | IIIT5K | SVT | |||
---|---|---|---|---|---|---|---|---|
None | None | 50 | 1K | None | 50 | None | ||
SqueezeText * [17] | - | 92.9 | - | 97.0 | 94.1 | 87.0 | 95.2 | - |
RARE [19] | Synth90k | 88.6 | - | 96.2 | 93.8 | 81.9 | 95.5 | 81.9 |
CRNN [27] | Synth90k | 86.7 | - | 97.6 | 94.4 | 78.2 | 96.4 | 80.8 |
Yin et al. [24] | Synth90k | 85.2 | - | 98.9 | 96.7 | 81.6 | 95.1 | 76.5 |
STAR-Net [33] | Synth90k | 89.1 | - | 97.7 | 94.5 | 83.3 | 95.5 | 83.6 |
RNTR-Net [40] | Synth90k | 90.1 | - | 98.7 | 96.4 | 84.7 | 95.7 | 80.0 |
Fang et al. [21] | Synth90k | 93.5 | 71.2 | 98.5 | 96.8 | 86.7 | 97.8 | 86.7 |
SCAN [25] | Synth90k | 90.4 | - | 99.1 | 97.2 | 84.9 | 95.7 | 85.0 |
CA-FCN [22] | SynthText | 91.5 | - | 99.8 | 98.8 | 91.9 | 98.8 | 86.4 |
ESIR [30] | Synth90k and SynthText | 91.3 | 76.9 | 99.6 | 98.8 | 93.3 | 97.4 | 90.2 |
AON [44] | Synth90k and SynthText | - | 68.2 | 99.6 | 98.1 | 87.0 | 96.0 | 82.8 |
Bai et al. [16] | Synth90k and SynthText | 94.4 | 73.9 | 99.5 | 97.9 | 88.3 | 96.6 | 87.5 |
FAN [18] | Synth90k and SynthText | 93.3 | 85.3 | 99.3 | 97.5 | 87.4 | 97.1 | 85.9 |
ScRN [42] | Synth90k and SynthText | 93.9 | 78.7 | 99.5 | 98.8 | 94.4 | 97.2 | 88.9 |
SAR [43] | Synth90k, SynthText, real data | 94.0 | 78.8 | 99.4 | 98.2 | 95.0 | 98.5 | 91.2 |
TextScanner [41] | Synth90k, SynthText, real data | 94.9 | 83.5 | 99.8 | 99.5 | 95.7 | 99.4 | 92.7 |
Ours | real data | 93.4 | 79.1 | 98.7 | 98.0 | 92.3 | 95.7 | 87.9 |
Ours | Synth90k | 93.7 | 79.3 | 98.9 | 97.8 | 92.4 | 96.1 | 88.1 |
Method | ICDAR2015 | IIIT5K | SVT | Average Processing Time in Seconds | |||
---|---|---|---|---|---|---|---|
None | 50 | 1K | None | 50 | None | ||
Without and | 76.3 | 97.5 | 96.2 | 90.8 | 92.6 | 85.5 | 0.131 |
With | 78.4 | 98.1 | 97.4 | 91.6 | 93.7 | 87.2 | 0.146 |
With | 78.7 | 98.4 | 97.2 | 91.9 | 94.0 | 87.1 | 0.147 |
Example Image | Without Attention | With | With | With & |
---|---|---|---|---|
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hassan, E.; V. L., L. Attention Guided Feature Encoding for Scene Text Recognition. J. Imaging 2022, 8, 276. https://doi.org/10.3390/jimaging8100276
Hassan E, V. L. L. Attention Guided Feature Encoding for Scene Text Recognition. Journal of Imaging. 2022; 8(10):276. https://doi.org/10.3390/jimaging8100276
Chicago/Turabian StyleHassan, Ehtesham, and Lekshmi V. L. 2022. "Attention Guided Feature Encoding for Scene Text Recognition" Journal of Imaging 8, no. 10: 276. https://doi.org/10.3390/jimaging8100276
APA StyleHassan, E., & V. L., L. (2022). Attention Guided Feature Encoding for Scene Text Recognition. Journal of Imaging, 8(10), 276. https://doi.org/10.3390/jimaging8100276