The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Qasim, Mamtimin; Silamu, Wushour; Qiu, Minghui

doi:10.3390/data9110134

Open AccessData Descriptor

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

by

Mamtimin Qasim

^1,*

,

Wushour Silamu

^2,3 and

Minghui Qiu

¹

School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China

²

School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China

³

Key Multi-Lingual Laboratory of Xinjiang, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

Data 2024, 9(11), 134; https://doi.org/10.3390/data9110134

Submission received: 22 July 2024 / Revised: 11 October 2024 / Accepted: 8 November 2024 / Published: 11 November 2024

(This article belongs to the Section Information Systems and Data Management)

Download

Browse Figure

Versions Notes

Abstract

:

Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.

Keywords:

script; script identification; language identification; language identification dataset

1. Background and Summary

Text language identification (LI) is a text classification task. LI matches the language of a text to one in a pre-defined set of known languages [1]. Most text processing tools are developed for a specific language. LI is the first step in many text processing tasks [2,3], such as machine translation, semantic understanding, classification, storage, information retrieval, and email interception [4]. The LI of short strings and similar languages is known to be challenging for existing LI techniques [5,6,7,8]. The construction of LI datasets is, therefore, of great significance for the research and development of LI algorithms.

Languages are written in different scripts, and each script has a unique defined code range in Unicode. This helps to identify different script parts within a document [9]. Script identification (SI) is easier than LI. An increase in the number of languages in LI leads to the degradation of LI performance [2,3]. In order to improve the efficiency of LI, taking into account the above situation, researchers have designed LI algorithms that first identify the script of a text and then identify any languages that use the same script [7,10,11]. The authors of [7] mentioned that mixing scripts in a single classifier leads to more confusion; they also discuss that SI is easy to implement and uses LI algorithms for nine languages using Latin scripts and three languages using Chinese ideographs, respectively. In our previous study [10,11], we designed a hierarchical LI algorithm, which first identified the script and then further identified the languages that use the same script. In a sentence-level language identification experiment using 97 languages, the performance of hierarchical LI was higher than that of the open-source LI tool—langid [10]. However, the above three studies did not discuss or present any relevant data in enough detail to achieve SI.

When constructing an LI dataset, we classify text data by their language. Manually annotating an LI dataset requires experts who are familiar with each language, but the cost of this process is very high [5]. Typically, text data are downloaded from Wikipedia, Twitter, and news sites, as well as Bible translations, to build LI datasets. Majlis constructed the LI dataset by downloading data from Wikipedia in 124 languages [12]. Brown used Wikipedia and Bible translations to construct an LI dataset for 1100 languages [13]. Lui and Baldwin used Twitter to construct an LI dataset for 65 languages [14]. Blodgett et al. also used Twitter to construct an LI dataset for 70 languages [15]. Baldwin and Lui used Government Documents, news texts, and Wikipedia to construct an LI dataset for 81 languages [16]. Tan et al. used news texts to construct an LI dataset for 81 languages [17]. Vatanen et al. used the Universal Declaration of Human Rights to construct an LI dataset for 281 languages [18]. There are also publicly available assessment corpora for similar language and dialect identification tasks [6,19].

However, no systematic study on SI involving multiple languages and determining how to construct relevant LI datasets has been conducted. Mixing scripts in a single classifier leads to more confusion when designing LI algorithms [7]. Therefore, in this paper, we study an SI algorithm for 261 languages belonging to 23 script groups and construct an LI dataset based on these script groups.

2. Data Description

The Leipzig Corpora Collection is very suitable for constructing LI datasets, as it presents corpora in different languages using the same format and comparable sources [20]. The data for each language in this corpus are structured in sentence units. Sentences in each language are obtained from relevant news sites or other websites, and sentences in different languages are removed. Datasets of different sizes are available for each language [21]. Therefore, in this study, the Leipzig Corpora Collection was selected to study the SI algorithm and construct an LI dataset. Data are currently available in more than 270 languages from the Leipzig Corpora Collection’s website [22].

To construct an LI dataset, we first downloaded all 273 linguistic data sources (corpora) that could be downloaded from the Leipzig Corpora Collection. When downloading the data, we tried to download as many corpora as possible that contained 10,000 sentences per language. When analyzing the downloaded data, it was found that in the relevant files, most of the linguistic data were organized according to the language’s ISO code, and some data were organized with a country code after the language ISO code. Table 1 and Table 2 also use codes to list the languages to which this study relates. Through language encoding, we were able to search for the corresponding language name and country on the relevant website [23,24].

In order to test the impact of SI algorithms when constructing an LI dataset, we selected 261 language corpora that were written in only one script. Among these 261 corpora, 208 were classified as 10 K (containing 10,000 sentences each), while the other 53 languages contained fewer than 10,000 sentences.

According to the character scripts used in each text, the data for each language were assigned to the corresponding script group, as shown in Table 1 and Table 2. The 261 languages discussed in this study were written using 23 scripts. Of these, the 11 scripts used for multiple languages are shown in Table 1; the other 12 scripts, used in only one language, are shown in Table 2.

Table 3 shows the statistics of scripts used in multiple languages in this study. The scripts that were used in the most languages were Latin, Cyrillic, and Arabic. Latin script was used by 171 languages, Cyrillic script was used by 31 languages, and Arabic script was used by 15 languages.

Chinese ideographs were used in the writing systems of the Chinese and Japanese languages, as well as occasionally being used in Korean and, historically, in Vietnamese [25]. Therefore, in the Unicode encoding scheme, the encoding of the above three languages is arranged in the CJK section. CJK is a commonly used abbreviation for “Chinese, Japanese, and Korean”. In this study, text in the Chinese, Japanese, and Korean languages was encoded as CJK.

The range and types of linguistic data collated in the Leipzig Corpora Collection are numerous and include the use of scripts and dialects of language scripts in different regions, such as the data in Table 2, which correspond to the codes wuu, gan, and cmn, belonging to the CJK group.

3. Methods

3.1. Script Identification Algorithm

Languages are written in scripts. Different scripts are classified using different code ranges in Unicode. Therefore, by using the analysis text’s Unicode range, we can identify different scripts in a text. We can find the encoding ranges of different scripts by using the Unicode official website [26,27]; the regular expression of each script is constructed on this basis. Table 4 lists the regular expressions.

When analyzing the encoding range of the Latin script section, only the Latin characters’ encoding is retained, and the encoding ranges of various punctuation marks and numbers are not considered. This is because Latin numbers and punctuation are also used when languages are written in non-Latin scripts. Words are separated by a space when writing in languages other than Chinese and Japanese, which are encoded in the Unicode CJK region. Words are also separated by a space in the Korean language, which is also encoded in the Unicode CJK region. Therefore, when constructing regular expressions of each script’s added spaces, we use the Unicode code “u0020”. The goal is to preserve the space between words in the text.

We found that Devanagari characters encoded as “u0964” also appeared in most of the sentences in script groups such as Eastern Nagari, Odia, and Gurmukhi. Therefore, in this article, the Unicode code “u0964” is used when constructing the regular expressions for Eastern Nagari, Odia, and Gurmukhi scripts.

Using the Unicode encoding ranges of different scripts, an SI algorithm based on regular expression matching is proposed. The workflow of the algorithm is shown in Figure 1 and is described below:

(1): Remove all symbols in the text, except characters and spaces, and replace them with a space. Spaces cannot be filtered directly, as this will lead to the loss of some linguistic features, which affects the efficiency of LI or other natural language processing tasks.
(2): Replace consecutive spaces with a single space. Filter out space at the beginning and end of the text.
(3): The text matches the regular expressions of all scripts separately. Replace consecutive spaces in each match with a single space. Also filter out space at the beginning and end of the text. This is because the regular expression of every script contains the encoding of a space. As a result, there will be consecutive spaces in some matches. The purpose of matching is to determine if there is relevant script content in the text. If the matching result is empty, it means that the text does not contain the content described in the corresponding script. If the script is not empty, it means that the text contains the content described in the corresponding script, and the matching result and the corresponding script are stored in the dictionary.

The encoding code “u0964” is included in the regular expressions of the Devanagari, Eastern Nagari, Odia, and Gurmukhi script groups. If the text containing the characters corresponding to the code “u0964” matches the regular expressions of the above four script groups, one of the matching results matches multiple types of characters, and the remaining three matching results only contain the characters corresponding to the code “u0964”. Therefore, when analyzing the matching results of the above four script groups, it is necessary to check the types of characters in the matching results. If the type of character in the matching result is greater than 1, the corresponding script is selected as the script for that text.

(4): After all regex matching operations have been performed, the dictionary content consists of the relevant scripted content in the text.

Most natural language processing tools are developed for a specific language. Therefore, it is necessary to determine the language before processing the text. The SI algorithm proposed in this paper can identify content written by different scripts in the text. SI is easier to implement than LI. Reducing the number of languages in the LI system can improve the efficiency of LI. In a particular application, the content described by different scripts in the text is first identified; then, for each kind of scripted content, the language in which the script is written is identified. Finally, the text is further processed by the text processing tool corresponding to that language.

3.2. Toy Example

To illustrate the SI process, we chose a toy example. A sentence containing multi-scripted content was selected from the corpus. The sentence is as follows: “Horizon Forbidden West выйдeт нa PlayStation 4 и PlayStation 5 мeнee чeм чepeз мecяц—18 φeвpaля”. The above sentence contains content written in both Latin script and Cyrillic script. Therefore, after the sentence was matched to the regular expression of 23 scripts and consecutive spaces were replaced with a single space, only the Latin and Cyrillic scripts had a non-zero matching result. Therefore, two types of different scripts were identified in the above sentence; the script SI is shown in Table 5.

3.3. Script Identification Experiment

SI is a type of text classification; testing its performance requires the labeling of data. The data source for this study was the Leipzig Corpora Collection. The constituent unit of each language corpus was a sentence. Based on the script groups in Table 1 and Table 2, each sentence in a particular language was classified into a script group. According to the above labeling specification, an SI dataset was constructed, and an SI experiment was carried out using the SI dataset. The number of sentences in each script group is shown in Table 2 and Table 3.

Some of the sentences in the SI dataset contained more than one script. The sentences in each language in the Leipzig Corpora Collection were obtained from relevant news sites or other websites, and some sentences in the corpus contained other scripted content. This means that a small number of sentences in the preliminarily constructed SI dataset analyzed in the previous paragraph contained content written in other scripts.

SI is a text classification task. Each text in the text classification corpus must have a label, and the text classification model must return a label when predicting the text. When evaluating the performance of the SI algorithm, the SI algorithm is used to determine which script is contained in the text; the number of words in each script is counted separately, and the script with the most words is selected as the main script of the text. Languages other than the CJK group are written with a space between words. For matches in these script groups, the number of words separated by a space are counted. Because CJK scripts encode more information in a shorter number of characters than, for example, the Latin scripts, the Chinese and Korean languages in the CJK group are written without a space between words. On the contrary, the Korean language in the CJK group is written with a space between words. However, words in Korean are composed of monosyllabic words of length 1, two-syllable words of length 2, and polysyllabic words. The length of the polysyllabic words is also not too long. So, for the CJK group, the length of the matching result is taken into account, with the exception of the space.

The performance of text classification is evaluated using Precision, Recall, and F-score. In this study, the above three performance indicators were used to evaluate the performance of the SI. In the multi-category text classification task, the macro average and micro mean value were used to comprehensively evaluate the performance of the classifier in classifying the whole text set. SI is also part of a multi-class classification task; therefore, this paper used

MicroP

,

MicroR

, and

{MicroF}_{1}

to evaluate the performance of the algorithm. The formulas related to the evaluation indicators are shown in Equations (1)–(3), where C is the set of all scripts in the SI dataset, |C| denotes the number of scripts in the SI dataset, and the meanings of TP, FP, and FN are shown in Table 6.

MicroP = \frac{\sum_{i = 1}^{|C|} {TP}_{i}}{\sum_{i = 1}^{|C|} {TP}_{i} + \sum_{i = 1}^{|C|} {FP}_{i}}

(1)

MicroR = \frac{\sum_{i = 1}^{|C|} {TP}_{i}}{\sum_{i = 1}^{|C|} {TP}_{i} + \sum_{i = 1}^{|C|} {FN}_{i}}

(2)

{MicroF}_{1} = \frac{2 \times MicroP \times MicroR}{MicroP + MicroR}

(3)

In order to evaluate the performance of the SI algorithm, firstly, the data for each language in Table 2 and Table 3 were randomly sorted, and then, 80% of the data for each language were used for training and 20% for testing. Table 7 shows the results of the SI experiment; the performance of the SI algorithm was shown to be ideal. The reason behind the SI error was that some sentences in the corpus contained more content written in other scripts than content written in the original script. The partial confusion matrix in the SI test experiment is shown in Table 8. The content in this table demonstrates the identification errors between scripts with many languages. Most SI errors occurred between Latin scripts and other scripts.

3.4. Script-Based LI Dataset

In order to improve the performance of the LI, researchers have proposed a method of first identifying scripts and then identifying languages that use the same script [7,10,11]. Comparative experiments in the relevant literature prove that the proposed method can effectively improve the efficiency of LI. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted.

To construct the above-mentioned hierarchical LI model, firstly, the language identification dataset with the script group as the unit was constructed. The data in each script dataset only contain what the script describes. In this way, the LI algorithm can be trained for the language text that uses the script. Therefore, when constructing a certain script dataset, a SI algorithm is used to find out what other scripts have written and filter it out from the current text.

It is also common for some text in certain languages to contain content written in other scripts. Mixing scripts in a single classifier leads to more confusion when designing LI algorithms [7]. Therefore, when constructing an LI dataset, it is necessary to filter out other-script content that appears in texts written in certain scripts. In order to filter out the other-script content that appears in a text, the SI is first used to identify this other-script content. Content written in other scripts will then be deleted. Since it is not possible to determine the language of the text at this stage, it is not yet possible to add it to a certain language corpus in the script data.

Using the language corpus from the test data, experiments were performed to filter out other scripts in the text. Table 9 shows the statistics of sentences containing multiple types of scripts in each script group. By analyzing the data in Table 9, it was found that each type of script dataset contained a certain number of sentences with other-script content. This justifies the need to filter out for additional scripted content that appears in the corpus. Filtering out other-script content in the corpus and reducing noise can improve the efficiency of LI and improve the performance of other natural language processing tasks.

4. Conclusions

There is a lack of script identification studies and related datasets in multiple languages. Mixing scripts in a single classifier leads to more confusion when designing LI algorithms. Different scripts are arranged in different encoding ranges in the Unicode encoding scheme. Depending on the coding ranges of these different scripts, it is possible to identify the content written in different scripts in texts. Therefore, SI is easier to achieve than LI. We can reduce the number of languages in the LI system to improve its identification efficiency. In order to improve the performance of the LI system, we downloaded 261 linguistic data sources from the Leipzig Corpora Collection. According to the scripts used in different languages, the data were assigned to different script groups, forming a total of 23 script groups. According to different scripts in the Unicode encoding scheme, the regular expressions corresponding to each language were constructed. On this basis, an SI algorithm based on regular expression matching was designed. The experimental results showed that the proposed algorithm had a good performance and could identify the content written in different scripts in the text.

Other-script content that appears in the text creates noise during the training and testing of the text processing model, resulting in a degraded model performance. Therefore, when constructing a text corpus, it is necessary to remove other-script content that appears in the text. In this study, an SI algorithm was used to analyze and count other-script content in texts, which was classified into 1 of 23 types of script groups. The results of the analysis showed that a certain percentage of sentences written in different scripts appeared in each script group. This points to the need to remove other-script content when constructing an LI dataset or other corpora. The results of this study are helpful for researchers needing to delete other-script content when constructing text corpora.

Text processing algorithms are developed for specific languages. Text can also appear in several languages using the same script. Further research investigating how to identify content in short texts that use the same script but in different languages is needed.

Author Contributions

M.Q. (Mamtimin Qasim) designed the algorithm, analyzed the data, and wrote the paper. W.S. supervised the writing and reviewed the paper. M.Q. (Minghui Qiu) participated in the literature survey and reviewed the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (62137002), the Shenzhen Science and Technology Innovation Commission project (GJGJZD20210408092806017), and the 14th Five-Year Plan of Guangdong Association of Higher Education 2022 Higher Education Research Project (22GYB159).

Data Availability Statement

All experimental data in this study are from the Leipzig Corpora Collection: https://cls.corpora.uni-leipzig.de/en (accessed on 2 June 2024). If the data information and code for this paper are required, please contact us, and we will provide the relevant data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Choong, C.Y.; Mikami, Y.; Marasinghe, C.A.; Nandasara, S.T. Optimizing n-gram order of an n-gram based language identification algorithm for 68 written languages. Int. J. Adv. ICT Emerg. Reg. 2009, 2, 21–28. [Google Scholar] [CrossRef]
Botha, G.R.; Barnard, E. Factors that affect the accuracy of text-based language identification. Comput. Speech Lang. 2012, 26, 307–320. [Google Scholar] [CrossRef]
Abainia, K.; Ouamour, S.; Sayoud, H. Effective language identification of forum texts based on statistical approaches. Inf. Process. Manag. Int. J. 2016, 52, 491–512. [Google Scholar] [CrossRef]
Selamat, A.; Akosu, N. Word-length algorithm for language identification of under-resourced languages. J. King Saud Univ. Comput. Inf. Sci. 2015, 28, 457–469. [Google Scholar] [CrossRef]
Jauhiainen, T.; Lui, M.; Zampieri, M.; Baldwin, T.; Linden, K. Automatic language identification in texts: A survey. J. Artif. Intell. Res. 2019, 65, 675–782. [Google Scholar] [CrossRef]
Zampieri, M.; Malmasi, S.; Ljubešić, N.; Nakov, P.; Ali, A.; Tiedemann, J.; Scherrer, Y.; Aepli, N. Findings of the VarDial Evaluation Campaign. In Proceedings of the VarDial Workshop, Valencia, Spain, 3 April 2017. [Google Scholar]
Apple. Language Identification from Very Short Strings. 2019. Available online: https://machinelearning.apple.com/research/language-identification-from-very-short-strings (accessed on 10 February 2021).
Toftrup, M.; Srensen, S.A.; Ciosici, M.R.; Assent, I. A reproduction of apple’s bi-directional lstm models for language identification in short strings. In Proceedings of the 16th Conference of the European Chapter of the Associationfor Computational Linguistics: Student Research Workshop, Virtual, 19–23 April 2021; pp. 36–42. [Google Scholar]
Hanif, F.; Latif, F.; Khiyal, M.S.H. Unicode Aided Language Identification across Multiple Scripts and Heterogeneous Data. Inf. Technol. J. 2007, 6, 534–540. [Google Scholar] [CrossRef]
Maimaitiyiming, H.; Wushour, S. On hierarchical text language-identification algorithms. Algorithms 2018, 11, 39. [Google Scholar] [CrossRef]
Hasimu, M.; Silamu, W. Three-stage short text language identification algorithm. J. Digit. Inf. Manag. 2017, 15, 354–371. [Google Scholar]
Majlis, M. Yet another language identifier. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 26 April 2012; Association for Computational Linguistics: Kerrville, TX, USA, 2012. [Google Scholar]
Brown, R.D. Selecting and Weighting N-Grams to Identify 1100 Languages; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Lui, M.; Baldwin, T. Accurate Language Identification of Twitter Messages. In Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM), Gothenburg, Sweden, 26–30 April 2014. [Google Scholar]
Blodgett, S.L.; Wei, J.; O’Connor, B. A Dataset and Classifier for Recognizing Social Media English. In Proceedings of the 3rd Workshop on Noisy User-Generated Text, Copenhagen, Denmark, 7 September 2017. [Google Scholar]
Baldwin, T.; Lui, M. Language Identification: The Long and the Short of the Matter. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, CA, USA, 1–6 June 2010; pp. 229–237. [Google Scholar]
Tan, L.; Zampieri, M.; Ljubešic, N.; Tiedemann, J. Merging Comparable Data Sources for the Discrimination of Similar Languages: The DSL Corpus Collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC), Reykjavik, Iceland, 27 May 2014. [Google Scholar]
Vatanen, T.; Vyrynen, J.J.; Virpioja, S. Language identification of short text segments with n-gram models. In Proceedings of the International Conference on Language Resources & Evaluation DBLP, Valletta, Malta, 17–23 May 2010. [Google Scholar]
Scherrer, Y.; Jauhiainen, T.; Ljubešić, N.; Nakov, P.; Tiedemann, J.; Zampieri, M. Findings of the VarDial Evaluation Campaign 2023. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023); Association for Computational Linguistics: Dubrovnik, Croatia, 2023; pp. 251–261. [Google Scholar]
Goldhahn, D.T.; Eckart, U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In Proceedings of the 8th International Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 23–25 May 2012. [Google Scholar]
Leipzig Corpora Collection. Available online: https://cls.corpora.uni-leipzig.de/en (accessed on 28 June 2024).
Leipzig Corpora Collection Download Page. Available online: https://wortschatz-leipzig.de/en/download (accessed on 5 July 2024).
ISO 639-2 Code. Available online: https://www.loc.gov/standards/iso639-2/php/code_list.php (accessed on 5 July 2024).
ISO 639-3 Code. Available online: https://iso639-3.sil.org/code_tables/639/data (accessed on 5 July 2024).
Chinese and Japanese. Available online: https://unicode.org/faq/han_cjk.html (accessed on 8 July 2024).
List of Unicode Groups and Block Ranges. Available online: https://www.unicodepedia.com/groups/ (accessed on 5 July 2024).
Updated Proposal to Encode the Tulu-Tigalari Script in Unicode. Available online: https://www.unicode.org/L2/L2022/22031-tulu-tigalari-prop.pdf (accessed on 8 July 2024).

Figure 1. Flowchart of the SI algorithm.

Table 1. Scripts that are used in multiple languages.

Script	Language
Arabic	ara, arz, ckb, fas, glk, kur, pes, pnb, prs, pus, skr, snd, uig, urd, mzn
CJK	wuu, zho, gan, cmn, jpn, kor
Cyrillic	bak, bel, bew, bua, bul, che, chv, kaz, kbd, khk, kir, koi, kom, krc, mhr, mkd, mkw, mon, mrj, myv, oss, rue, rus, sah, srp, tat, tgk, tyv, udm, ukr, uzn-uz
Devanagari	hin, mar, nep, new, san, bih
Bengali	asm, ben, bpy
Ethiopic	amh, tir
Georgian	kat, xmf
Greek	ell, pnt
Hebrew	heb, ydd, yid
Kannada	kan, tcy
Latin	ace, ach, afr, aka, als, anw, arg, ast, aym, aze, azj, bam, ban, bar, bcl, bik, bjn, bos, bre, bug, cat, cdo, ceb, ces, cos, csb, cym, dan, deu, diq, dsb, dyu, ekk, emk, eml, eng, epo, est, eus, ewe, ext, fao, fin, fon, fra, frr, fry, fuc, ful, gle, glg, glv, gom, grn, gsw, hat, hau, hbs, hif, hil, hrv, hsb, hun, ibb, ibo, ido, ile, ilo, ina, ind, isl, ita, jav, kab, kal, kbp, kea, kik, kin, kon, ksh, lad, lat, lav, lij, lim, lin, lit, lmo, ltz, lug, lup, lus, lvs, mad_id, min, mlg, mlt, mri, msa, mwl, nan, nap-tara, nav, nbl, ndo, nds, ngl, nld, nob, nno, nor, nso, nya, nyn, oci-fr, orm, pag, pam, pap, pcm, pfl, plt, pms, pol, por, que, roh, rom, ron, run, scn, sco, she, sgs, slk, slv, sme, smi, sna-zw, snk, som, sot-za, spa, sqi, srd, ssw-za, suk, sun, sus, swa, swe, swh, szl, tem, tgl, tiv, tsn, tso, tuk, tum, tur, uzb, vec, ven-za, vie, vls, vol, vro, war, wln, wol, xho-za, yor, zea, zha, zsm, zul-za

Table 2. Scripts that are used in only one language.

Script	Language	Sentence Number	Script	Language	Sentence Number
Armenian	hye	10,000	Sinhala	sin	10,000
Gujarati	guj	10,000	Tamil	tam	10,000
Gurmukhi	pan	10,000	Telugu	tel	10,000
Khmer	khm	1773	Thaana	div	10,000
Lao	lao	10,000	Thai	tha	10,000
Oria	ori	10,000	Tibetan	bod	7525

Table 3. The statistics of scripts used in multiple languages.

Script	Language Number	Sentence Number	Script	Language Number	Sentence Number
Arabic	15	140,088	Latin	178	1,387,071
CJK	6	59,898	Georgian	2	20,000
Cyrillic	31	299,245	Greek	2	11,564
Devanagari	6	60,000	Hebrew	2	30,000
Bengali	3	30,000	Kannada	2	20,000
Ethiopic	2	11,379

Table 4. Unicode encoding area for each script.

Script	Regular Expression
Arabic	Arabic = “[\\u0600-\\u06FF\\u0750-\\u077F\\u08A0-\\u08FF\\uFB50-\\uFDFF\\uFE70-\\uFEFF\\u0020]”
Armenian	Armenian = “[\\u0530-\\u058F\\uFB00-\\uFB17\\u0020]”
CJK	CJK = “[\\u4E00-\\u9FFF\\u3400-\\u4DBF\\uF900-\\uFAFF\\u2E80-\\u2EFF\\u31C0-\\u31EF\\u3000-\\u303F\\u2FF0-\\u2FFF\\u3300-\\u33FF\\uFE30-\\uFE4F\\uF900-\\uFAFF\\u3200-\\u32FF\\u2F00-\\u2FDF\\u4E00-\\u9FBF\\u3040-\\u309F\\u30A0-\\u30FF\\uAC00-\\uD7AF\\u1100-\\u11FF\\u3130-\\u318F\\u3200-\\u32FF\\uA960-\\uA97F\\uD7B0-\\uD7FF\\uFF00-\\uFFEF\\u0020]”
Cyrillic	Cyrillic = “[\\u0400-\\u04FF\\u0500-\\u052F\\u2DE0-\\u2DFF\\uA640-\\uA69F\\u1C80-\\u1C8F\\u0020]”
Devanagari	Devanagari = “[\\u0900-\\u097F\\uA8E0-\\uA8FF\\u1CD0-\\u1CFF\\u0020]”
Bengali	Bengali = “[\\u0980-\\u09FF\\u0964\\u0020]”
Ethiopic	Ethiopic = “[\\u1200-\\u137C\\u2D80-\\u2DDE\\uAB01-\\uAB2E\\u1380-\\u139E\\u0020]”
Georgian	Georgian = “[\\u10A0-\\u10FF\\u2D00-\\u2D2F\\u0020]”
Greek	Greek = “[\\u0370-\\u03FF\\u1F00-\\u1FFF\\u0020]”
Gujarati	Gujarati = “[\\u0A80-\\u0AFF\\u0020]”
Gurmukhi	Gurmukhi = “[\\u0A01-\\u0A75\\ u0964\\u0020]”
Hebrew	Hebrew = “[\\u0590-\\u05FF\\uFB1D-\\uFB4F\\u0020]”
Kannada	Kannada = “[\\u0C80-\\u0CFF\\u0020]”
Khmer	Khmer = “[\\u1780-\\u17FF\\u19E0-\\u19FF\\u0020]”
Latin	Latin = “[\\u0041-\\u005A\\u005F-\\u007A\\u00C0-\\u00D6\\u00D8-\\u00F6\\u00F8-\\u00FF\\u0100-\\u017F\\u0180-\\u024F\\u0250-\\u02AF\\u02B0-\\u02FF\\u1D00-\\u1D7F\\u1D80-\\u1DBF\\u1E00-\\u1EFF\\u2070-\\u209F\\u2100-\\u214F\\u2150-\\u218F\\u2C60-\\u2C7F\\uA720-\\uA7FF\\uAB30-\\uAB6F\\uFF21-\\uFF3A\\uFF41-\\uFF5A\\u0020]”
Lao	Lao = “[\\u0E80-\\u0EFF\\u0020]”
Oria	Odia = “[\\u0B00-\\u0B7F\\u0964\\u0020]”
Sinhala	Sinhala = “[\\u0D80-\\u0DFF\\u0020]”
Tamil	Tamil = “[\\u0B80-\\u0BFF\\u0020]”
Telugu	Telugu = “[\\u0C00-\\u0C7F\\u0020]”
Thaana	Thaana = “[\\u0780-\\u07B1\\u0020]”
Thai	Thai = “[\\u0E00-\\u0E7F\\u0020]”
Tibetan	Tibetan = “[\\u0F00-\\u0FFF\\u0020]”

Table 5. Example of script identification process.

SI Process	The Result of Each Step
Remove all symbols in the text, except characters and spaces, and replace them with a space.	Horizon Forbidden West выйдeт нa PlayStation и PlayStation мeнee чeм чepeз мecяц φeвpaля
Replace consecutive spaces with a single space.	Horizon Forbidden West выйдeт нa PlayStation и PlayStation мeнee чeм чepeз мecяц φeвpaля
The text matches the regular expressions of all scripts separately. Replace consecutive spaces in each match with a single space. If the script is not empty, select the matching result and the corresponding script.	The identified Latin-scripted content is “Horizon Forbidden West PlayStation PlayStation”. The identified Cyrillic-scripted content is “выйдeт нa и мeнee чeм чepeз мecяц φeвpaля”.

Table 6. Definitions related to evaluation indicators.

	The number of texts that belong to the script	The number of texts that do not belong to the script
The SI algorithm determines the number of texts that belong to the script	true positive—TP	false positive—FP
The SI algorithm determines the number of texts that do not belong to the script	false negative—FN	true negative—TN

Table 7. Script identification results.

Experiment	Precision	Recall	Micro F-Score
Train	0.9930	0.9930	0.9930
Test	0.9929	0.9929	0.9929

Table 8. The part of the confusion matrix with the most identification errors in the SI test.

	Arabic	CJK	Cyrillic	Tibetan	Latin
Arabic	27,701	0	1	136	0
CJK	0	11,947	0	33	0
Cyrillic	0	3	59,629	2213	0
Latin	6	57	370	296,988	0
Tibetan	1	116	1	10	1374

Table 9. Statistics of other-script content that appears in texts in various languages in test data.

Script	Total Sentences	Hybrid Sentences	Percentage
Arabic	28,018	2168	7.74%
Armenian	2000	147	7.35%
CJK	11,980	2498	20.85%
Cyrillic	61,849	11,560	18.69%
Devanagari	12,000	434	3.62%
Bengali	6000	219	3.65%
Ethiopic	2276	154	6.77%
Georgian	4000	347	8.68%
Greek	2313	362	15.65%
Gujarati	2000	130	6.50%
Gurmukhi	2000	75	3.75%
Hebrew	6000	142	2.37%
Kannada	4000	207	5.18%
Khmer	355	71	20.00%
Latin	297,424	1580	0.53%
Lao	2000	53	2.65%
Oria	2000	146	7.30%
Sinhala	2000	186	9.30%
Tamil	2000	130	6.50%
Telugu	2000	61	3.05%
Thaana	2000	348	17.40%
Thai	2000	495	24.75%
Tibetan	1505	261	17.34%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qasim, M.; Silamu, W.; Qiu, M. The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset. Data 2024, 9, 134. https://doi.org/10.3390/data9110134

AMA Style

Qasim M, Silamu W, Qiu M. The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset. Data. 2024; 9(11):134. https://doi.org/10.3390/data9110134

Chicago/Turabian Style

Qasim, Mamtimin, Wushour Silamu, and Minghui Qiu. 2024. "The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset" Data 9, no. 11: 134. https://doi.org/10.3390/data9110134

APA Style

Qasim, M., Silamu, W., & Qiu, M. (2024). The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset. Data, 9(11), 134. https://doi.org/10.3390/data9110134

Article Menu

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Abstract

1. Background and Summary

2. Data Description

3. Methods

3.1. Script Identification Algorithm

3.2. Toy Example

3.3. Script Identification Experiment

3.4. Script-Based LI Dataset

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI