The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset
Abstract
:1. Background and Summary
2. Data Description
3. Methods
3.1. Script Identification Algorithm
- (1)
- Remove all symbols in the text, except characters and spaces, and replace them with a space. Spaces cannot be filtered directly, as this will lead to the loss of some linguistic features, which affects the efficiency of LI or other natural language processing tasks.
- (2)
- Replace consecutive spaces with a single space. Filter out space at the beginning and end of the text.
- (3)
- The text matches the regular expressions of all scripts separately. Replace consecutive spaces in each match with a single space. Also filter out space at the beginning and end of the text. This is because the regular expression of every script contains the encoding of a space. As a result, there will be consecutive spaces in some matches. The purpose of matching is to determine if there is relevant script content in the text. If the matching result is empty, it means that the text does not contain the content described in the corresponding script. If the script is not empty, it means that the text contains the content described in the corresponding script, and the matching result and the corresponding script are stored in the dictionary.
- (4)
- After all regex matching operations have been performed, the dictionary content consists of the relevant scripted content in the text.
3.2. Toy Example
3.3. Script Identification Experiment
3.4. Script-Based LI Dataset
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Choong, C.Y.; Mikami, Y.; Marasinghe, C.A.; Nandasara, S.T. Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages. Int. J. Adv. ICT Emerg. Reg. 2009, 2, 21–28. [Google Scholar] [CrossRef]
- Botha, G.R.; Barnard, E. Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 2012, 26, 307–320. [Google Scholar] [CrossRef]
- Abainia, K.; Ouamour, S.; Sayoud, H. Effective language identification of forum texts based on statistical approaches. Inf. Process. Manag. Int. J. 2016, 52, 491–512. [Google Scholar] [CrossRef]
- Selamat, A.; Akosu, N. Word-length algorithm for language identification of under-resourced languages. J. King Saud Univ. Comput. Inf. Sci. 2015, 28, 457–469. [Google Scholar] [CrossRef]
- Jauhiainen, T.; Lui, M.; Zampieri, M.; Baldwin, T.; Linden, K. Automatic language identification in texts: A survey. J. Artif. Intell. Res. 2019, 65, 675–782. [Google Scholar] [CrossRef]
- Zampieri, M.; Malmasi, S.; Ljubešić, N.; Nakov, P.; Ali, A.; Tiedemann, J.; Scherrer, Y.; Aepli, N. Findings of the VarDial Evaluation Campaign. In Proceedings of the VarDial Workshop, Valencia, Spain, 3 April 2017. [Google Scholar]
- Apple. Language Identification from Very Short Strings. 2019. Available online: https://machinelearning.apple.com/research/language-identification-from-very-short-strings (accessed on 10 February 2021).
- Toftrup, M.; Srensen, S.A.; Ciosici, M.R.; Assent, I. A reproduction of apple’s bi-directional lstm models for language identification in short strings. In Proceedings of the 16th Conference of the European Chapter of the Associationfor Computational Linguistics: Student Research Workshop, Virtual, 19–23 April 2021; pp. 36–42. [Google Scholar]
- Hanif, F.; Latif, F.; Khiyal, M.S.H. Unicode Aided Language Identification across Multiple Scripts and Heterogeneous Data. Inf. Technol. J. 2007, 6, 534–540. [Google Scholar] [CrossRef]
- Maimaitiyiming, H.; Wushour, S. On hierarchical text language-identification algorithms. Algorithms 2018, 11, 39. [Google Scholar] [CrossRef]
- Hasimu, M.; Silamu, W. Three-stage short text language identification algorithm. J. Digit. Inf. Manag. 2017, 15, 354–371. [Google Scholar]
- Majlis, M. Yet another language identifier. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 26 April 2012; Association for Computational Linguistics: Kerrville, TX, USA, 2012. [Google Scholar]
- Brown, R.D. Selecting and Weighting N-Grams to Identify 1100 Languages; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Lui, M.; Baldwin, T. Accurate Language Identification of Twitter Messages. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), Gothenburg, Sweden, 26–30 April 2014. [Google Scholar]
- Blodgett, S.L.; Wei, J.; O’Connor, B. A Dataset and Classifier for Recognizing Social Media English. In Proceedings of the 3rd Workshop on Noisy User-Generated Text, Copenhagen, Denmark, 7 September 2017. [Google Scholar]
- Baldwin, T.; Lui, M. Language Identification: The Long and the Short of the Matter. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, CA, USA, 1–6 June 2010; pp. 229–237. [Google Scholar]
- Tan, L.; Zampieri, M.; Ljubešic, N.; Tiedemann, J. Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), Reykjavik, Iceland, 27 May 2014. [Google Scholar]
- Vatanen, T.; Vyrynen, J.J.; Virpioja, S. Language identification of short text segments with n-gram models. In Proceedings of the International Conference on Language Resources & Evaluation DBLP, Valletta, Malta, 17–23 May 2010. [Google Scholar]
- Scherrer, Y.; Jauhiainen, T.; Ljubešić, N.; Nakov, P.; Tiedemann, J.; Zampieri, M. Findings of the VarDial Evaluation Campaign 2023. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023); Association for Computational Linguistics: Dubrovnik, Croatia, 2023; pp. 251–261. [Google Scholar]
- Goldhahn, D.T.; Eckart, U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In Proceedings of the 8th International Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 23–25 May 2012. [Google Scholar]
- Leipzig Corpora Collection. Available online: https://cls.corpora.uni-leipzig.de/en (accessed on 28 June 2024).
- Leipzig Corpora Collection Download Page. Available online: https://wortschatz-leipzig.de/en/download (accessed on 5 July 2024).
- ISO 639-2 Code. Available online: https://www.loc.gov/standards/iso639-2/php/code_list.php (accessed on 5 July 2024).
- ISO 639-3 Code. Available online: https://iso639-3.sil.org/code_tables/639/data (accessed on 5 July 2024).
- Chinese and Japanese. Available online: https://unicode.org/faq/han_cjk.html (accessed on 8 July 2024).
- List of Unicode Groups and Block Ranges. Available online: https://www.unicodepedia.com/groups/ (accessed on 5 July 2024).
- Updated Proposal to Encode the Tulu-Tigalari Script in Unicode. Available online: https://www.unicode.org/L2/L2022/22031-tulu-tigalari-prop.pdf (accessed on 8 July 2024).
Script | Language |
---|---|
Arabic | ara, arz, ckb, fas, glk, kur, pes, pnb, prs, pus, skr, snd, uig, urd, mzn |
CJK | wuu, zho, gan, cmn, jpn, kor |
Cyrillic | bak, bel, bew, bua, bul, che, chv, kaz, kbd, khk, kir, koi, kom, krc, mhr, mkd, mkw, mon, mrj, myv, oss, rue, rus, sah, srp, tat, tgk, tyv, udm, ukr, uzn-uz |
Devanagari | hin, mar, nep, new, san, bih |
Bengali | asm, ben, bpy |
Ethiopic | amh, tir |
Georgian | kat, xmf |
Greek | ell, pnt |
Hebrew | heb, ydd, yid |
Kannada | kan, tcy |
Latin | ace, ach, afr, aka, als, anw, arg, ast, aym, aze, azj, bam, ban, bar, bcl, bik, bjn, bos, bre, bug, cat, cdo, ceb, ces, cos, csb, cym, dan, deu, diq, dsb, dyu, ekk, emk, eml, eng, epo, est, eus, ewe, ext, fao, fin, fon, fra, frr, fry, fuc, ful, gle, glg, glv, gom, grn, gsw, hat, hau, hbs, hif, hil, hrv, hsb, hun, ibb, ibo, ido, ile, ilo, ina, ind, isl, ita, jav, kab, kal, kbp, kea, kik, kin, kon, ksh, lad, lat, lav, lij, lim, lin, lit, lmo, ltz, lug, lup, lus, lvs, mad_id, min, mlg, mlt, mri, msa, mwl, nan, nap-tara, nav, nbl, ndo, nds, ngl, nld, nob, nno, nor, nso, nya, nyn, oci-fr, orm, pag, pam, pap, pcm, pfl, plt, pms, pol, por, que, roh, rom, ron, run, scn, sco, she, sgs, slk, slv, sme, smi, sna-zw, snk, som, sot-za, spa, sqi, srd, ssw-za, suk, sun, sus, swa, swe, swh, szl, tem, tgl, tiv, tsn, tso, tuk, tum, tur, uzb, vec, ven-za, vie, vls, vol, vro, war, wln, wol, xho-za, yor, zea, zha, zsm, zul-za |
Script | Language | Sentence Number | Script | Language | Sentence Number |
---|---|---|---|---|---|
Armenian | hye | 10,000 | Sinhala | sin | 10,000 |
Gujarati | guj | 10,000 | Tamil | tam | 10,000 |
Gurmukhi | pan | 10,000 | Telugu | tel | 10,000 |
Khmer | khm | 1773 | Thaana | div | 10,000 |
Lao | lao | 10,000 | Thai | tha | 10,000 |
Oria | ori | 10,000 | Tibetan | bod | 7525 |
Script | Language Number | Sentence Number | Script | Language Number | Sentence Number |
---|---|---|---|---|---|
Arabic | 15 | 140,088 | Latin | 178 | 1,387,071 |
CJK | 6 | 59,898 | Georgian | 2 | 20,000 |
Cyrillic | 31 | 299,245 | Greek | 2 | 11,564 |
Devanagari | 6 | 60,000 | Hebrew | 2 | 30,000 |
Bengali | 3 | 30,000 | Kannada | 2 | 20,000 |
Ethiopic | 2 | 11,379 |
Script | Regular Expression |
---|---|
Arabic | Arabic = “[\\u0600-\\u06FF\\u0750-\\u077F\\u08A0-\\u08FF\\uFB50-\\uFDFF\\uFE70-\\uFEFF\\u0020]” |
Armenian | Armenian = “[\\u0530-\\u058F\\uFB00-\\uFB17\\u0020]” |
CJK | CJK = “[\\u4E00-\\u9FFF\\u3400-\\u4DBF\\uF900-\\uFAFF\\u2E80-\\u2EFF\\u31C0-\\u31EF\\u3000-\\u303F\\u2FF0-\\u2FFF\\u3300-\\u33FF\\uFE30-\\uFE4F\\uF900-\\uFAFF\\u3200-\\u32FF\\u2F00-\\u2FDF\\u4E00-\\u9FBF\\u3040-\\u309F\\u30A0-\\u30FF\\uAC00-\\uD7AF\\u1100-\\u11FF\\u3130-\\u318F\\u3200-\\u32FF\\uA960-\\uA97F\\uD7B0-\\uD7FF\\uFF00-\\uFFEF\\u0020]” |
Cyrillic | Cyrillic = “[\\u0400-\\u04FF\\u0500-\\u052F\\u2DE0-\\u2DFF\\uA640-\\uA69F\\u1C80-\\u1C8F\\u0020]” |
Devanagari | Devanagari = “[\\u0900-\\u097F\\uA8E0-\\uA8FF\\u1CD0-\\u1CFF\\u0020]” |
Bengali | Bengali = “[\\u0980-\\u09FF\\u0964\\u0020]” |
Ethiopic | Ethiopic = “[\\u1200-\\u137C\\u2D80-\\u2DDE\\uAB01-\\uAB2E\\u1380-\\u139E\\u0020]” |
Georgian | Georgian = “[\\u10A0-\\u10FF\\u2D00-\\u2D2F\\u0020]” |
Greek | Greek = “[\\u0370-\\u03FF\\u1F00-\\u1FFF\\u0020]” |
Gujarati | Gujarati = “[\\u0A80-\\u0AFF\\u0020]” |
Gurmukhi | Gurmukhi = “[\\u0A01-\\u0A75\\ u0964\\u0020]” |
Hebrew | Hebrew = “[\\u0590-\\u05FF\\uFB1D-\\uFB4F\\u0020]” |
Kannada | Kannada = “[\\u0C80-\\u0CFF\\u0020]” |
Khmer | Khmer = “[\\u1780-\\u17FF\\u19E0-\\u19FF\\u0020]” |
Latin | Latin = “[\\u0041-\\u005A\\u005F-\\u007A\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u00FF\\u0100-\\u017F\\u0180-\\u024F\\u0250-\\u02AF\\u02B0-\\u02FF\\u1D00-\\u1D7F\\u1D80-\\u1DBF\\u1E00-\\u1EFF\\u2070-\\u209F\\u2100-\\u214F\\u2150-\\u218F\\u2C60-\\u2C7F\\uA720-\\uA7FF\\uAB30-\\uAB6F\\uFF21-\\uFF3A\\uFF41-\\uFF5A\\u0020]” |
Lao | Lao = “[\\u0E80-\\u0EFF\\u0020]” |
Oria | Odia = “[\\u0B00-\\u0B7F\\u0964\\u0020]” |
Sinhala | Sinhala = “[\\u0D80-\\u0DFF\\u0020]” |
Tamil | Tamil = “[\\u0B80-\\u0BFF\\u0020]” |
Telugu | Telugu = “[\\u0C00-\\u0C7F\\u0020]” |
Thaana | Thaana = “[\\u0780-\\u07B1\\u0020]” |
Thai | Thai = “[\\u0E00-\\u0E7F\\u0020]” |
Tibetan | Tibetan = “[\\u0F00-\\u0FFF\\u0020]” |
SI Process | The Result of Each Step |
---|---|
Remove all symbols in the text, except characters and spaces, and replace them with a space. | Horizon Forbidden West выйдeт нa PlayStation и PlayStation мeнee чeм чepeз мecяц φeвpaля |
Replace consecutive spaces with a single space. | Horizon Forbidden West выйдeт нa PlayStation и PlayStation мeнee чeм чepeз мecяц φeвpaля |
The text matches the regular expressions of all scripts separately. Replace consecutive spaces in each match with a single space. If the script is not empty, select the matching result and the corresponding script. | The identified Latin-scripted content is “Horizon Forbidden West PlayStation PlayStation”. The identified Cyrillic-scripted content is “выйдeт нa и мeнee чeм чepeз мecяц φeвpaля”. |
The number of texts that belong to the script | The number of texts that do not belong to the script | |
The SI algorithm determines the number of texts that belong to the script | true positive—TP | false positive—FP |
The SI algorithm determines the number of texts that do not belong to the script | false negative—FN | true negative—TN |
Experiment | Precision | Recall | Micro F-Score |
---|---|---|---|
Train | 0.9930 | 0.9930 | 0.9930 |
Test | 0.9929 | 0.9929 | 0.9929 |
Arabic | CJK | Cyrillic | Tibetan | Latin | |
Arabic | 27,701 | 0 | 1 | 136 | 0 |
CJK | 0 | 11,947 | 0 | 33 | 0 |
Cyrillic | 0 | 3 | 59,629 | 2213 | 0 |
Latin | 6 | 57 | 370 | 296,988 | 0 |
Tibetan | 1 | 116 | 1 | 10 | 1374 |
Script | Total Sentences | Hybrid Sentences | Percentage |
---|---|---|---|
Arabic | 28,018 | 2168 | 7.74% |
Armenian | 2000 | 147 | 7.35% |
CJK | 11,980 | 2498 | 20.85% |
Cyrillic | 61,849 | 11,560 | 18.69% |
Devanagari | 12,000 | 434 | 3.62% |
Bengali | 6000 | 219 | 3.65% |
Ethiopic | 2276 | 154 | 6.77% |
Georgian | 4000 | 347 | 8.68% |
Greek | 2313 | 362 | 15.65% |
Gujarati | 2000 | 130 | 6.50% |
Gurmukhi | 2000 | 75 | 3.75% |
Hebrew | 6000 | 142 | 2.37% |
Kannada | 4000 | 207 | 5.18% |
Khmer | 355 | 71 | 20.00% |
Latin | 297,424 | 1580 | 0.53% |
Lao | 2000 | 53 | 2.65% |
Oria | 2000 | 146 | 7.30% |
Sinhala | 2000 | 186 | 9.30% |
Tamil | 2000 | 130 | 6.50% |
Telugu | 2000 | 61 | 3.05% |
Thaana | 2000 | 348 | 17.40% |
Thai | 2000 | 495 | 24.75% |
Tibetan | 1505 | 261 | 17.34% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qasim, M.; Silamu, W.; Qiu, M. The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset. Data 2024, 9, 134. https://doi.org/10.3390/data9110134
Qasim M, Silamu W, Qiu M. The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset. Data. 2024; 9(11):134. https://doi.org/10.3390/data9110134
Chicago/Turabian StyleQasim, Mamtimin, Wushour Silamu, and Minghui Qiu. 2024. "The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset" Data 9, no. 11: 134. https://doi.org/10.3390/data9110134
APA StyleQasim, M., Silamu, W., & Qiu, M. (2024). The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset. Data, 9(11), 134. https://doi.org/10.3390/data9110134