Gradual OCR: An Effective OCR Approach Based on Gradual Detection of Texts
Abstract
:1. Introduction
- We introduce novel OCR datasets featuring Korean chart data. These datasets comprise (a) real-world noisy chart data sourced from the web, (b) bar charts constructed with various small fonts, and (c) bar charts featuring diverse small, rotated texts (Section 3).
- We unveil a novel and effective OCR approach built on a range of techniques. Specifically, we introduce the “gradual text detection” algorithm to reduce false negatives and a distinct “gradual filtering” process to minimize false positives (Section 4).
- We carry out detailed experiments to evaluate the performance of our approach. Through the systematic application of the techniques that we propose, we gauge the improvements in OCR performance. Our final model considerably outshines the best open-source multilingual OCR applications on our datasets, as evidenced by the Jaccard similarity-based error rate measure (Section 5).
- Can our approach significantly enhance the final text recognition results by gradually altering the behaviors of the text detection and recognition models?
2. Related Work
3. Dataset Construction
- They comprise bar charts, with each chart showcasing 5 to 10 vertical bars.
- Each vertical bar is assigned a random value ranging from 0 to 100, rounded to two decimal places.
- The Okt package in KoNLPy [23] was employed to extract the top 500 most frequent nouns that formed all the labels in the charts.
- Every vertical bar has a corresponding label composed of 1 to 4 Korean nouns.
- The x and y labels are randomly assigned between 3 and 6 Korean nouns.
- Chart titles are formed from a random assortment of 5 to 10 Korean nouns.
- The y-axis labels are vertically placed, while the x-axis labels are horizontally positioned in Dataset 2 and randomly rotated in Dataset 3.
4. Gradual OCR
4.1. Overview of Our Approach
- Image Preprocessor: This component, while optional, upscales the input image before the text detection or recognition steps. We anticipate an enhancement in recognizing diminutive text, as found in Datasets 2 and 3, given that the existing literature indicates a performance boost from super-resolution when dealing with small text [3].
- Text Detection Module: Before proceeding to text recognition, we employ a text detection module—a strategy commonly adopted in optical character recognition. Similar to EasyOCR [2], we utilize the CRAFT algorithm [12] for text detection. Differing from EasyOCR, we integrate a link refiner to optimize the results. Additionally, we introduce our “gradual text detection” method, which iteratively performs text detection and link refinement using varied thresholds.
- Text Region Processor: Following text detection, the conventional next step is text recognition. However, we incorporate supplementary techniques. Considering that text recognition modules are typically trained on unrotated images, our system transforms detected text regions into unrotated ones to improve the recognition accuracy. As an optional speed-enhancing step, we have designed a “batch decoding” method that merges all text regions into a single region.
- TrOCR (Text Recognition Module): For each identified region, the TrOCR [24] model, based on the Vision Transformer [14], is utilized for text recognition. This model is composed of both the TrOCR encoder and decoder. We discovered that a straightforward “autoregression” method yields optimal results in text recognition.
- Filtering Module: The output might contain false positives for various reasons. We introduce an innovative method to pinpoint these inaccuracies by assessing the decoding results’ quality. A basic filtering technique is employed to remove these errors. Concurrently, we introduce the “gradual low-quality filtering” strategy, which synergizes effectively with our “gradual text detection” method.
4.2. Image Preprocessor
4.3. Text Detection Module
- 1.
- 2.
- 3.
- 4.
- Initialize an empty list, L.
- Execute text detection and link refinement with thresholds and incorporate the resulting boxes into L.
- For a parameter S (indicating steps), repeat for S iterations:
- 1.
- In the ith iteration, apply thresholds.
- 2.
- Reinitiate text detection and the link refiner using the updated thresholds.
- 3.
- For each newly identified text region r, if r has zero overlap with boxes in L, incorporate r into L.
4.4. Text Region Processor
4.5. TrOCR
- TrOCR1: team-lucid/trocr-small-korean (https://huggingface.co/team-lucid/trocr-small-korean, accessed on 25 September 2023). This model’s TrOCR encoder is based on DeiT, while the TrOCR decoder derives from RoBERTa. The model was trained using six million images generated by synthtiger [25].
- TrOCR2: daekeun-ml/ko-trocr-base-nsmc-news-chatbot (https://huggingface.co/daekeun-ml/ko-trocr-base-nsmc-news-chatbot, accessed on 25 September 2023). In this setup, the ”facebook/deit-base-distilled-patch16-384” model acts as the TrOCR encoder, while the “klue/roberta-base” model functions as the decoder. The training data encompass a range of sources, including a news summarization dataset.
- TrOCR3: ddobokki/ko-trocr (https://huggingface.co/ddobokki/ko-trocr, accessed on 25 September 2023). For this model, microsoft/trocr-base-stage1 is the chosen TrOCR encoder, and snunlp/KR-BERT-char16424 is the decoder. The OCR training data were sourced from AI-Hub (https://www.aihub.or.kr/, accessed on 25 September 2023).
4.6. Filtering Module
5. Experiments
5.1. Experimental Setup
5.2. Experimental Results
- Tesseract v5.3.1 Windows: While Tesseract is a recognized OCR solution for the Korean language, it does not deliver satisfactory results on our datasets. Specifically, for our first dataset, which features real-world noise, the Macro JER exceeds 0.5. Its performance is somewhat better on our second dataset, which uses a variety of small fonts. These results indicate Tesseract’s challenges with noisy datasets. Since our dataset is exclusively in Korean, the Korean-only recognition setting (“ko”) outperforms the combined Korean and English setting (“ko + en”).
- EasyOCR 1.7.0: EasyOCR is among the most popular open-source OCR packages available. Consistent with its reputation, it demonstrated commendable performance across all our datasets. In our tests, it consistently surpassed Tesseract, echoing previous studies that found EasyOCR superior to Tesseract, especially in recognizing Korean text [4,6]. This was observed when the settings were tailored for either exclusively Korean or both Korean and English. Nevertheless, much like Tesseract, EasyOCR encountered challenges with real-world noisy data (Dataset 1) and with datasets containing rotated text (Dataset 3).
- TrOCR Models: We employed three TrOCR models, as introduced in Section 4.5 (TrOCR1, TrOCR2, and TrOCR3). Although TrOCR1 outperforms the other two models, its Macro/Micro JER exceeds even 0.9, indicating that most recognition results are incorrect. These results suggest that using TrOCR by itself, without additional techniques, is challenging when applied to both real-world and artificial datasets.
- Text Detection() + TrOCR Models: Recall that TrOCR in isolation exhibits suboptimal performance on our datasets. However, when supplemented with the text detection techniques outlined in Section 4.3, even without further strategies such as link refinement or perspective transformation, its performance substantially improves. Specifically, while TrOCR1 alone demonstrates a Macro JER of 0.9487 in Dataset 1, its performance markedly improves to a Macro JER of 0.0552 when using the text detection technique, albeit with a tenfold increase in elapsed time. Through our analyses, TrOCR1 is found to be most effective on Dataset 1, while TrOCR3 excels on Datasets 2 and 3. Given that TrOCR3 demands significantly more elapsed time than TrOCR1 without outperforming it in subsequent experiments, we exclusively report results from the TrOCR1 model in ensuing experiments.
- Text Detection() + Link Refiner + Warp + Rotation + TrOCR1(L): In order to enhance the recognition quality, we employ the link refinement technique, as detailed in Section 4.3, and the perspective transformation and rotation techniques, introduced in Section 4.4. These methods prove particularly effective in Datasets 2 and 3, where images contain various types of text fragments, making accurate text detection crucial. Practically, the Macro JER in Dataset 3 witnessed a substantial decrease from 0.2237 to 0.0820. We also explored the impact of the text generation length limits, adjusting them from 20 (the original setting) to 512. A longer length limit resulted in a higher JER in Dataset 1, particularly increasing the Micro JER, suggesting that the recognized text for some entries did not terminate appropriately, thereby generating excessively long and inaccurate sequences. This issue can be mitigated using our gradual text detection and gradual low-quality filtering techniques, the results of which are demonstrated in subsequent experiments. Therefore, in the following experiments, we utilize a maximum length of 512.
- Antialiasing(M) + Text Detection() + Link Refiner + Warp + Rotation + TrOCR1(512): We additionally employ the image resizing and antialiasing techniques described in Section 4.2. Here, “M” denotes the multiplication factor for image resizing (e.g., M = 2 implies 2× width and 2× height, while M = 4 implies 4× width and 4× height). Initially, we resize using factor M and subsequently refine the resized image using the antialiasing technique. Experimental results indicate that when M = 2, the Macro JER in Dataset 1 is enhanced from 0.0536 to 0.0489, and the Micro JER is also improved, moving from 0.0794 to 0.0530. However, in artificial datasets with minimal noisy text (Dataset 2 and Dataset 3), the JER does not exhibit an increase. Despite slight improvements, we opt not to utilize this technique in subsequent experiments because the elapsed time significantly increases when the image size is already substantial (as in Dataset 2 and 3).
- Real ESRGAN(M) + Text Detection() + Link Refiner + Warp + Rotation + TrOCR1(512): We additionally apply image resizing and the Real ESRGAN technique introduced in Section 4.2. As previously described, “M” denotes the multiplication factor for image resizing. Surprisingly, Real ESRGAN does not outperform the antialiasing technique. This underperformance may stem from the fact that the TrOCR models were not trained on data refined by Real ESRGAN.
- Text Detection(T) + Link Refiner + Warp + Rotation + TrOCR1(512): In these experiments, we varied the text detection threshold T from the original to , , and , respectively. The results are somewhat interesting: in Dataset 1, when using , both the Macro and Micro JER significantly deteriorate compared to using . Conversely, in Dataset 3, they improve. These outcomes suggest that the optimal parameters can vary depending on the dataset or images, inspiring our concept of gradual text detection and gradual low-quality filtering.
- Gradual Detection(S) + Link Refiner + Warp + Rotation + TrOCR1(512): We use the gradual detection technique from Section 4.3 instead of a fixed threshold. The best JER is achieved when , which is better than when or 3. Since increasing S does not consistently improve the recognition, selecting the correct S value remains a challenge.
- Text Detection() + Link Refiner + Warp + Rotation + TrOCR1(512) + Autoregression + Batch Decoding: In these experiments, we additionally applied autoregression techniques and batch decoding techniques, as introduced in Section 4.4 and Section 4.5. The notation “AR Only” in Table 2 indicates that only the autoregression technique was used, without batch decoding, while “AR + BD” signifies the employment of both autoregression and batch decoding. The experimental results reveal that utilizing autoregression substantially improves the recognition performance. Specifically, without the autoregression technique, the Micro JER is 0.794 in Dataset 1; with the technique, it markedly decreases to 0.0475. Unfortunately, the batch decoding technique did not demonstrate satisfactory performance. Although it significantly reduced the elapsed time, its recognition performance notably degraded, especially in Dataset 1. In Dataset 1, texts typically had varying background colors and font sizes, factors that likely contributed to the diminished recognition performance.
- Gradual Detection(S) + Link Refiner + Warp + Rotation + TrOCR1(512) + Autoregression + Low-Quality Filter: Here, we additionally apply both the gradual detection and low-quality filtering techniques. As mentioned in Section 4.6, the low-quality filtering technique employs a threshold of 0.8. Experimental results reveal that optimal recognition performance is achieved when or . These findings contrast with those obtained using only the gradual detection technique, where delivers the best performance. This underscores the significance of filtering incorrectly recognized results when employing our gradual detection technique.
- Gradual Detection(S) + Link Refiner + Warp + Rotation + TrOCR1(512) + Autoregression + Gradual Low-Quality Filter(S): This marks our final experiment utilizing both the gradual detection and gradual low-quality filtering techniques. As the parameter “S” is varied from 0 to 3, we observe a consistent increase in recognition performance, alleviating concerns over the specific choice of the S parameter. The final results are somewhat surprising: in terms of Micro JER, when S = 3, all datasets yield Micro JER values below 0.05. Particularly in Dataset 2, the Micro JER is significantly lower at 0.0226. Considering that the JER between “I am a boy.” and “I am a boy” is 0.0666, it is evident that our recognizer accurately identifies the majority of texts within our complex datasets.
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Smith, R. An overview of the Tesseract OCR engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Curitiba, Brazil, 23–26 September 2007; Volume 2, pp. 629–633. [Google Scholar]
- EasyOCR. JaidedAI. 2023. Available online: https://github.com/JaidedAI/EasyOCR (accessed on 25 September 2023).
- Cho, W.; Kwon, J.; Kwon, S.; Yoo, J. A Comparative Study on OCR using Super-Resolution for Small Fonts. Int. J. Adv. Smart Converg. 2019, 8, 95–101. [Google Scholar]
- Kim, C.Y.; An, S.Y.; Jeon, E.J.; Cha, B.R. A Study on the Survey and Demonstration Test of OCR based on Open Source for AI OCR. In Proceedings of the Symposium of the Korean Institute of Communications and Information Sciences, Seoul, Republic of Korea, 23–25 August 2023; pp. 851–852. [Google Scholar]
- Hyeong, K.K.; Duke, C.W. Development of OCR-based algorithm for information extraction from Material Safety Data Sheets (MSDS). In Proceedings of the Symposium of the Korean Institute of Communications and Information Sciences, Seoul, Republic of Korea, 23–25 August 2023; pp. 986–987. [Google Scholar]
- Moon, J.; Kim, D.; Kim, E.; Oh, Y.; Jung, S.K.; Jang, J.; Kim, D. Development of OCR-based Spell check EduTech tool for Handwriting Education for children. In Proceedings of the Korea Computer Congress, Jeju, Republic of Korea, 25–28 June 2023; pp. 1640–1642. [Google Scholar]
- Luo, J.; Li, Z.; Wang, J.; Lin, C.Y. Chartocr: Data extraction from charts images via a deep hybrid framework. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikola, HI, USA, 5–9 January 2021; pp. 1917–1925. [Google Scholar]
- Li, S.; Lu, C.; Li, L.; Zhou, H. Chart-RCNN: Efficient Line Chart Data Extraction from Camera Images. arXiv 2022, arXiv:2211.14362. [Google Scholar]
- Balaji, A.; Ramanathan, T.; Sonathi, V. Chart-text: A fully automated chart image descriptor. arXiv 2018, arXiv:1812.10636. [Google Scholar]
- Kantharaj, S.; Do, X.L.; Leong, R.T.K.; Tan, J.Q.; Hoque, E.; Joty, S. Opencqa: Open-ended question answering with charts. arXiv 2022, arXiv:2210.06628. [Google Scholar]
- Kantharaj, S.; Leong, R.T.K.; Lin, X.; Masry, A.; Thakkar, M.; Hoque, E.; Joty, S. Chart-to-text: A large-scale benchmark for chart summarization. arXiv 2022, arXiv:2203.06486. [Google Scholar]
- Baek, Y.; Lee, B.; Han, D.; Yun, S.; Lee, H. Character region awareness for text detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9365–9374. [Google Scholar]
- Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
- Kim, S.; Park, J.; Kim, S.; Na, Y.; Jang, Y. Multi-book Label Detection Model using Object Detection and OCR. J. Korean Inst. Inf. Technol. 2023, 21, 1–8. [Google Scholar]
- Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1905–1914. [Google Scholar]
- Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
- Huang, Y.; Li, S.; Wang, L.; Tan, T. Unfolding the alternating optimization for blind super resolution. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2020; Volume 33, pp. 5632–5643. [Google Scholar]
- Wei, P.; Xie, Z.; Lu, H.; Zhan, Z.; Ye, Q.; Zuo, W.; Lin, L. Component divide-and-conquer for real-world image super-resolution. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Part VIII 16; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 101–117. [Google Scholar]
- Ji, X.; Cao, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F. Real-world super-resolution via kernel estimation and noise injection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 466–467. [Google Scholar]
- Zhang, K.; Liang, J.; Van Gool, L.; Timofte, R. Designing a practical degradation model for deep blind image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4791–4800. [Google Scholar]
- Park, E.L.; Cho, S. KoNLPy: Korean Natural Language Processing in Python. In Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, Chuncheon, Korea, 10–11 October 2014; Volume 6, pp. 133–136. [Google Scholar]
- Li, M.; Lv, T.; Chen, J.; Cui, L.; Lu, Y.; Florencio, D.; Wei, F. Trocr: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13094–13102. [Google Scholar]
- Yim, M.; Kim, Y.; Cho, H.-C.; Park, S. SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text Recognition Models. In Proceedings of the International Conference on Document Analysis and Recognition, San Jose, CA, USA, 21–26 August 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 109–124. [Google Scholar]
- Rajaraman, A.; Ullman, J.D. Mining of Massive Datasets; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Dataset | # of Images | Description |
---|---|---|
Dataset 1 | 91 | Collected Charts (Bar/Pie/Line/Scatterplot) from the Web |
Dataset 2 | 500 | Artificially Generated Charts (Bar Charts) |
Dataset 3 | 500 | Artificially Generated Charts (Bar Charts with Rotated Labels) |
Dataset 1 | Dataset 2 | Dataset 3 | |||||||
---|---|---|---|---|---|---|---|---|---|
(#) Model Name | Macro JER | Micro JER | Time (s) | Macro JER | Micro JER | Time (s) | Macro JER | Micro JER | Time (s) |
Existing Approaches | |||||||||
Tesseract v5.3.1 Windows | |||||||||
(1) en | 0.7662 | 0.7448 | 15.3270 | 0.5419 | 0.5373 | 135.0326 | 0.5349 | 0.5249 | 134.9901 |
(2) ko | 0.5227 | 0.4621 | 13.7473 | 0.2480 | 0.2304 | 126.7941 | 0.3577 | 0.3398 | 127.7314 |
(3) en + ko | 0.5609 | 0.5074 | 23.3183 | 0.2748 | 0.2609 | 169.1845 | 0.3910 | 0.3755 | 170.3775 |
EasyOCR 1.7.0 | |||||||||
(4) en | 0.5096 | 0.5189 | 15.2087 | 0.4924 | 0.4931 | 176.6375 | 0.4631 | 0.4647 | 178.3048 |
(5) ko | 0.1795 | 0.1839 | 15.2353 | 0.1039 | 0.1030 | 173.0024 | 0.1843 | 0.1823 | 184.9896 |
(6) en + ko | 0.1745 | 0.1794 | 15.7230 | 0.1050 | 0.1041 | 176.6355 | 0.1873 | 0.1854 | 184.6961 |
TrOCR Models | |||||||||
(7) TrOCR1 | 0.9487 | 0.9505 | 8.6719 | 0.9559 | 0.9540 | 47.7345 | 0.9674 | 0.9659 | 44.4016 |
(8) TrOCR2 | 0.9979 | 0.9975 | 17.9105 | 0.9959 | 0.9961 | 235.6769 | 0.9965 | 0.9968 | 242.9406 |
(9) TrOCR3 | 0.9982 | 0.9976 | 10.9051 | 0.9956 | 0.9950 | 44.2932 | 0.9943 | 0.9937 | 41.3378 |
Our Approaches | |||||||||
Text Detection() + TrOCR Models | |||||||||
(10) TrOCR1 | 0.0552 | 0.0606 | 80.3815 | 0.1162 | 0.1150 | 401.6401 | 0.2237 | 0.2242 | 477.5391 |
(11) TrOCR2 | 0.6668 | 0.6689 | 306.4827 | 0.5302 | 0.5282 | 1430.5761 | 0.6053 | 0.6058 | 1400.2860 |
(12) TrOCR3 | 0.0707 | 0.0757 | 222.1472 | 0.1125 | 0.1107 | 1024.3570 | 0.2150 | 0.2144 | 995.9418 |
Text Detection() + Link Refiner + Warp + Rotation + TrOCR1(L) | |||||||||
(13) L = 20 | 0.0496 | 0.0580 | 73.0700 | 0.0459 | 0.0449 | 379.3420 | 0.0820 | 0.0811 | 366.7101 |
(14) L = 512 | 0.0536 | 0.0794 | 74.6489 | 0.0459 | 0.0449 | 366.9069 | 0.0820 | 0.0811 | 362.8143 |
Antialiasing(M) + Text Detection() + Link Refiner + Warp + Rotation + TrOCR1(512) | |||||||||
(15) M = 2 | 0.0489 | 0.0530 | 81.8446 | 0.0459 | 0.0449 | 543.4094 | 0.0839 | 0.0831 | 516.9085 |
(16) M = 4 | 0.0523 | 0.0561 | 112.1780 | 0.0466 | 0.0456 | 1204.3357 | 0.0847 | 0.0839 | 1205.7066 |
Real ESRGAN(M) + Text Detection() + Link Refiner + Warp + Rotation + TrOCR1(512) | |||||||||
(17) M = 2 | 0.0597 | 0.0688 | 161.6895 | 0.0531 | 0.0519 | 2306.2874 | 0.0928 | 0.0915 | 4119.9295 |
(18) M = 4 | 0.0672 | 0.0734 | 348.6670 | 0.0522 | 0.0508 | 6734.3283 | 0.0895 | 0.0884 | 12333.8697 |
Text Detection(T) + Link Refiner + Warp + Rotation + TrOCR1(512) | |||||||||
(19) T = | 0.1027 | 0.0976 | 73.3372 | 0.0459 | 0.0449 | 380.7345 | 0.0604 | 0.0599 | 392.1491 |
(20) T = | 0.2469 | 0.2420 | 62.7877 | 0.0612 | 0.0606 | 378.5012 | 0.0721 | 0.0718 | 393.4250 |
(21) T = | 0.2910 | 0.2885 | 61.2715 | 0.0928 | 0.0933 | 380.1300 | 0.1102 | 0.1118 | 399.6759 |
Gradual Detection(S) + Link Refiner + Warp + Rotation + TrOCR1(512) | |||||||||
(22) S = 1 | 0.0494 | 0.0743 | 85.8151 | 0.0372 | 0.0366 | 454.0669 | 0.0705 | 0.0701 | 474.0587 |
(23) S = 2 | 0.0498 | 0.0749 | 99.8336 | 0.0401 | 0.0399 | 519.2830 | 0.0734 | 0.0736 | 525.2052 |
(24) S = 3 | 0.0600 | 0.0882 | 115.3246 | 0.0453 | 0.0467 | 599.2412 | 0.0836 | 0.0870 | 620.1675 |
Text Detection() + Link Refiner + Warp + Rotation + TrOCR1(512) + Autoregression + Batch Decoding | |||||||||
(25) AR Only | 0.0396 | 0.0475 | 122.7754 | 0.0360 | 0.0352 | 562.3832 | 0.0728 | 0.0721 | 595.6792 |
(26) AR+BD | 0.2294 | 0.2910 | 77.6673 | 0.0745 | 0.0753 | 394.0620 | 0.1148 | 0.1153 | 394.5043 |
Gradual Detection(S) + Link Refiner + Warp + Rotation + TrOCR1(512) + Autoregression + Low-Quality Filter | |||||||||
(27) S = 1 | 0.0432 | 0.0488 | 141.3093 | 0.0288 | 0.0282 | 697.1296 | 0.0575 | 0.0565 | 687.0164 |
(28) S = 2 | 0.0424 | 0.0481 | 156.3434 | 0.0283 | 0.0277 | 780.0101 | 0.0553 | 0.0543 | 795.0077 |
(29) S = 3 | 0.0424 | 0.0481 | 170.1998 | 0.0279 | 0.0275 | 856.4727 | 0.0544 | 0.0535 | 896.8290 |
Gradual Detection(S) + Link Refiner + Warp + Rotation + TrOCR1(512) + Autoregression + Gradual Low-Quality Filter(S) | |||||||||
(30) S = 0 | 0.0396 | 0.0474 | 119.2158 | 0.0354 | 0.0346 | 560.9463 | 0.0690 | 0.0680 | 520.0968 |
(31) S = 1 | 0.0352 | 0.0420 | 132.7426 | 0.0244 | 0.0239 | 659.6777 | 0.0538 | 0.0531 | 640.4872 |
(32) S = 2 | 0.0346 | 0.0414 | 144.3790 | 0.0235 | 0.0230 | 777.9616 | 0.0512 | 0.0505 | 744.6360 |
(33) S = 3 | 0.0344 | 0.0412 | 158.9390 | 0.0231 | 0.0226 | 828.3957 | 0.0504 | 0.0497 | 835.1660 |
Model | # of FN + FP | FN (False Negatives) | FP (False Positives) |
---|---|---|---|
Sample Image 97 of Dataset 2 | |||
Existing Approaches | |||
Model #2 | 78 | .......00000011111222223344444 556667789999결고과권데뒤란 람만문반부사사상새세약월은 잡제죽진짜큼태테표형화회 | 서저칸포 |
Model #5 | 14 | 0계권란부새약테형 | #관네데작 |
Model #7 | 101 | ........0000000001112222223334 4444555666778999999 가결계고과권나날데도뒤란람 만문반발버부사사상새세 아약용월은이작잡장제제죽지 진짜큼태테표하형화회회 | :::전 |
Our Approaches | |||
Model #10 | 14 | .006 고권반부새월은테형 | 2 |
Model #14 | 8 | 006계반월테 | 례 |
Model #19 | 6 | 0계고 | ’-례 |
Model #20 | 4 | 0 | ’-○ |
Model #21 | 12 | 006사잡회 | ’.†○가임 |
Model #33 | 5 | 0계반 | ○례 |
Sample Image 134 of Dataset 2 | |||
Existing Approaches | |||
Model #2 | 121 | .......22333344444455555666777 9999 결경과관길동동뒤듯뜻련로물 살상서스안알역우운위점정집 | (())))000001<<=======[[|||×ㅎ ㅠ고교끄내내너더딱때때때띠 메몰버브비시써아아애에여울 으으이임학호회 |
Model #5 | 12 | ...0서어위집 | ,,,꽃 |
Model #7 | 88 | .......00000000012223333344444 4555556667779999 결경과관길너동동뒤듯뜻련로 물바살상서스안알어역우운위 이이점정지집 | ””,:......나나넌는 |
Our Approaches | |||
Model #10 | 7 | 0서알어위집 | ’ |
Model #14 | 2 | 0알 | |
Model #19 | 5 | 0길뒤 | 11 |
Model #20 | 12 | 0길동뒤뜻살운이 | ○끄빛삼 |
Model #21 | 14 | 03길너동뒤뜻운이 | †권녀및튀 |
Model #33 | 2 | 0 | ○ |
Dataset 1 | Dataset 2 | Dataset 3 | |||||||
---|---|---|---|---|---|---|---|---|---|
Model | JER | CER | WER | JER | CER | WER | JER | CER | WER |
Model #2 (Tesseract) | 0.5227 | 0.6450 | 0.8406 | 0.2480 | 0.3525 | 0.5332 | 0.3577 | 0.4855 | 0.6839 |
Model #5 (EasyOCR) | 0.1795 | 0.2795 | 0.5767 | 0.1039 | 0.1631 | 0.3367 | 0.1843 | 0.2755 | 0.4936 |
Model #33 (Gradual OCR) | 0.0344 | 0.0602 | 0.1803 | 0.0231 | 0.0411 | 0.0921 | 0.0504 | 0.0872 | 0.1698 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Park, Y.; Shin, Y. Gradual OCR: An Effective OCR Approach Based on Gradual Detection of Texts. Mathematics 2023, 11, 4585. https://doi.org/10.3390/math11224585
Park Y, Shin Y. Gradual OCR: An Effective OCR Approach Based on Gradual Detection of Texts. Mathematics. 2023; 11(22):4585. https://doi.org/10.3390/math11224585
Chicago/Turabian StylePark, Youngki, and Youhyun Shin. 2023. "Gradual OCR: An Effective OCR Approach Based on Gradual Detection of Texts" Mathematics 11, no. 22: 4585. https://doi.org/10.3390/math11224585
APA StylePark, Y., & Shin, Y. (2023). Gradual OCR: An Effective OCR Approach Based on Gradual Detection of Texts. Mathematics, 11(22), 4585. https://doi.org/10.3390/math11224585