1. Introduction
The task of automatic speech recognition (ASR) refers to converting any acoustic signal containing human speech into the corresponding word sequence [
1]. The development of the graphics processing units (GPUs) and deep neural networks (DNNs) [
2], the availability of transcribed speech corpora in the public domain [
3,
4,
5], and the wide use of voice interaction services that support hundreds of languages (e.g., Alexa, Google Voice Assistant, and Siri) have led to ASR solutions achieving—and even exceeding—human performance [
6]. That said, most ASR efforts have been directed towards developing models for languages for which large corpora exist (i.e., higher-resourced languages (The terms
lower- and
higher-resourced languages are used throughout the paper to emphasize the continuum existing across languages in terms of resources available for speech technology development.)), such as English, Mandarin, and Japanese (see, e.g., [
4,
7,
8]). In turn, ASR models built for lower-resourced languages can rarely boast robustness and reliability due to the insufficient amount of training data.
To address the problem of the inadequacy of training data for lower-resourced languages, such techniques as transfer learning [
9], data augmentation [
10], and high resource transliteration [
11], to name the most notable few, have been proposed. Special attention has also been devoted to the development of multilingual models, which enables the use of common linguistic features across languages, thus alleviating challenging data requirements [
12]. Most work on multilingual ASR for lower-resourced languages focuses on combining the data of similar languages and performing cross-language optimization, by utilizing positive transfer from higher-resourced languages during training [
13,
14,
15]. Research in transfer learning, too, has shown that linguistic similarity and relatedness generally lead to improved robustness of ASR models, particularly in resource-constrained settings [
16]. For example, linguistic relatedness and similarities have been made use of to build multilingual ASR models for lower-resourced Indian [
17,
18] and Ethiopian [
19] languages and Arabic dialects [
20]. The use of unrelated languages, however, generally results in a trade-off between quality and quantity, with models yielding performance comparable only with those of monolingual models [
21] and with no significant improvement due to minimal linguistic overlap.
This work aims to make a contribution to the development of multilingual ASR for lower-resourced Turkic languages. To date, there have been studies conducted to develop multilingual models recognizing Turkic languages [
21,
22], but few Turkic languages were considered in the models or were recognized along with languages belonging to other language families (e.g., English, Persian, Russian, Swahili, etc.). In contrast, in this study, we exclusively focus on ten Turkic languages—namely, Azerbaijani, Bashkir, Chuvash, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Uyghur, and Uzbek.
According to various sources, the ten languages under consideration are at present spoken by 125–150 million speakers [
23,
24]. Spread over the vast area of Eurasia, the languages fall into several branches (see
Table 1). With the exception of Chuvash and Sakha, which have peculiarities stemming from the early detachment from Common Turkic of the former and the influence of the Tungusic languages on the latter [
23], the languages are, on the whole, remarkably similar in terms of lexis, phonology, and morphology. This is reflected in a certain degree of mutual intelligibility across the languages, with some of the most frequent words in the Turkic languages being exactly alike [
25]. We therefore hypothesize that utilizing such features common for the ten languages is more likely to result in a robust multilingual ASR model than when unrelated languages are used, with some of the lower-resourced Turkic languages (e.g., Azerbaijani, Chuvash, and Sakha) benefiting from other Turkic languages for which more training resources are available (e.g., Bashkir, Kazakh, and Uzbek).
To contribute to the development of multilingual ASR for Turkic languages,
We compare the results of multilingual models trained on the data of the ten Turkic languages with the results of monolingual models trained for each of the languages;
We compare the results of the multilingual models with the results of models trained on the data of the ten Turkic languages and two non-Turkic languages (English and Russian);
We create the largest open-source speech corpus for the Turkish language that contains 218.2 h of transcribed speech.
The remainder of the paper is organized as follows:
Section 2 provides an overview of existing work on multilingual ASR, focusing on both related and unrelated languages. In
Section 3, we provide a description of the datasets used in the study and the procedures adopted to pre-process and split the data, as well as the details of the experimental setup.
Section 4 describes the results obtained and a discussion of these results.
Section 5 concludes the paper.
2. Related Work
The proliferation of studies in the field of ASR in recent years can be attributed to several factors, including a reduction in training time thanks to the use of the GPUs in deep learning [
2], publicly available datasets (e.g., LibriSpeech [
4] and DiDiSpeech [
7]), and regular speech recognition competitions (e.g., CHiME-6 Challenge [
26]). Demonstrating a significant performance boost [
27,
28,
29,
30,
31], reporting a word error rate (WER) as low as 2–3% on popular datasets [
32], and even achieving human parity [
6], ASR research may create the false impression that the task is almost solved. However, the vast majority of the research focuses on mainstream languages for which extensive resources (e.g., recorded speech and human-labeled speech corpora) are available. For example, the whole Corpus of Spontaneous Japanese [
8] contains a speech signal of about 661 h; the DiDiSpeech corpus of Mandarin [
7] and the LibriSpeech corpus of read English speech [
4] consist of about 800 and 1000 h of data, respectively. Consequently, languages that suffer from lower data availability can hardly afford the development of high-quality ASR systems.
One of the proposed ways to get around this problem is the application of transfer-learning techniques [
33]. Even though the original idea explores reusing the weights of a previously trained DNN for a new task, it can be extrapolated to the problem of data insufficiency. In [
9], the use of transfer learning in adapting a neural network originally trained for English ASR to German resulted in faster training, lower resource requirements, and reduced costs. Some studies propose similar methods, where the core idea is to train an ASR model jointly on multiple languages with the expectation that the system will perform better than systems trained on a single specific language. This approach is commonly referred to as multilingual ASR [
34].
The earlier experiments with multilingual ASR [
35,
36] mostly explored the cases with only a few languages at a time and did not produce meaningful results except in language identification (LID) tasks. Language identifiers (IDs) are used as an additional input signal when multiple languages are involved, proving to be useful in both code-switching [
37,
38] and multilingual ASR [
12,
39]. There are two common ways to incorporate language IDs: (1) using special LID tokens at the beginning of output [
37,
38], thus using one-hot vector representation as an additional feature [
12], or (2) using an auxiliary classifier in a multi-task setting [
39].
Some recent advances in multilingual ASR assume that the presence of higher-resourced languages in the training set positively affects the performance of a model for lower-resourced languages [
14,
15,
17,
18,
19,
20,
40]. In [
14], the scholars showed that it is possible to train a massive single ASR architecture for 51 different languages and more than 16,000 h of speech across them, which, in practice, is significantly less time-consuming to tune than developing 51 individual monolingual baselines. It was also reported that training ASR multilingual models can improve recognition performance for all the languages involved, with the lower-resource languages observing a more significant reduction of WER and character error rate (CER) for East Asian languages. In another study [
19] exploring ASR for lower-resourced languages, multilingual systems for four Ethiopian languages were developed. One of the models trained with speech data from 22 languages other than the target languages achieved a WER of 15.79%. Furthermore, the inclusion of the speech of a closely related language (in terms of phonetic overlap) in multilingual model training resulted in a relative WER reduction of 51.41%.
Most of the studies on multilingual ASR conclude that the average increase in performance produced by multilingual models, as opposed to monolingual ones, is higher for languages with greater linguistic overlap. Moreover, the development of a unified end-to-end (E2E) solution for a large number of languages that can potentially outperform monolingual models has become one of the focal points of multilingual ASR. However, research consistently shows that a model trained on a random set of languages does not consistently outperform monolingual models, even at a very large scale where more than 40 languages are used in the training set [
12,
14,
15]. The authors of [
14] have demonstrated that this is the case for higher-resourced languages, as the multilingual model failed to beat the baseline WER and CER scores for all higher-resourced settings.
This has led to the realization that using a dataset of languages with high linguistic overlap between them might yield better results. One of the ways to select these languages is to draw upon the language families to which they belong, as it is clear that the linguistic overlap between these languages is much greater than for languages with no inherent linguistic connections [
19]. As a result, several recent studies into multilingual ASR have been carried out at the level of language families [
17,
18,
19,
20,
40].
The authors of [
18,
20] developed E2E ASR systems for Indian and Arabic languages, respectively. Both papers report on average performance improvements over monolingual models, but were still unsuccessful in outperforming them in several languages. The findings were also consistent in the case of Ethiopian languages [
19], where the scholars were able to obtain comparable results without having a target language in the training set. It is also important to note that the quality of training data may hinder the transfer learning capacity of the model, as was shown in [
17]. The scholars were not able to achieve a significant improvement over monolingual experiments while using a dataset that contained systematic linguistic errors.
Most of the Turkic languages in our study are lower-resourced with few studies and datasets available. As can be seen from
Table 1, these languages can be divided into five branches. Apart from Chuvash and Sakha, each belonging to a distinct subfamily, there are three major branches: Karluk, Kipchak, and Oghuz. To the best of our knowledge, while there are large open-source corpora for some of the languages belonging to the Karluk and Kipchak branches (e.g., the Bashkir set in Common Voice Corpus 10.0 (CVC) [
3], Kazakh Speech Corpus (KSC) [
41], and Uzbek Speech Corpus (USC) [
42]), there are no similar or sufficiently large publicly available datasets for most of the languages under consideration. For example, in [
43], a high-accuracy Tatar speech recognition system was trained on a proprietary dataset and the Tatar portion of CVC. Specifically, the model was trained on 328 h of unlabeled data and then finetuned on 129 h of annotated data, achieving a WER of 5.37% on the CVC test set. It should be noted that in this work, the ASR model was trained on a full Tatar CVC training set (28 h), which has 100% text overlap with the corresponding test set. Similarly, the authors of [
44] developed an Uzbek ASR system trained on the Uzbek CVC (127 h) and Speechocean (
https://en.speechocean.com/datacenter/details/1907.html (accessed on 22 January 2023)) (80 h) datasets and obtained a CER score of 5.41% on the Uzbek CVC test split. However, it is unclear whether the authors used part of the invalidated Uzbek CVC for training purposes, nor does the paper make mention of utterance overlap. In [
45], different language models and acoustic training methodologies for the Azerbaijani language were investigated. Speech data of 80 h were collected from emergency calls. However, the data remain confidential, as they contain sensitive information about emergency cases.
As for the Turkish language, the corpus prepared by the Middle East Technical University (METU) [
46,
47] contains speech from 193 speakers (89 female and 104 male). Each speaker read 40 sentences that were selected randomly from a 2462-sentence set. Another Turkish speech corpus, containing broadcast news, was developed by Boğaziçi University [
48] and has a total length of 194 h. The largest Turkish dataset [
49] contains 350.27 h of validated speech. However, the data, which come from films and crowdsourcing, are not publicly available. A detailed comparison between the existing Turkish ASR corpora and the Turkish Speech Corpus (TSC) can be found in
Table 2.
4. Results and Discussion
The performance of the models on the test sets is given in
Table 6 and
Table 7. Considering the uncomparable distribution of data across the training, development, and test sets for some of the languages for which more than one dataset was available (i.e., Kazakh, Turkish, and Uzbek), we considered it fair and reasonable to evaluate the developed multilingual models separately on the CVC and the KSC, TSC, and USC test sets. While
Table 6 provides the results obtained by the models on the CVC test sets exclusively,
Table 7 contains the CER and WER scores for the models evaluated on the KSC, TSC, and USC test sets only. For readability, the dashed line separates the monolingual baselines from the multilingual models, and the green shading indicates the best results.
As can be seen from
Table 6, for the CVC test sets, the
all_turkic model, trained on the datasets of the Turkic languages, performed best, achieving the lowest CER and WER scores for six out of the ten target languages. The
all_languages model, trained on all the 15 datasets in the study (with the addition of English and Russian), produced the lowest CER and WER scores for Tatar, Turkish, and Uzbek. Of note is Kazakh, for which the lowest CER score was achieved by
all_turkic, while the lowest WER score was obtained by
all_languages. However, the difference between the scores was negligibly small.
What stands out in
Table 7 is that, when evaluated on the KSC, TSC, and USC test sets, the
all_turkic and
all_languages models mostly produced second best CER/WER scores, yielding to
ksc_turkic and
tsc_turkic, although not considerably. Nevertheless,
all_turkic was able to achieve even lower CER/WER scores for Uzbek than
all_languages, evaluated on the corresponding CVC test set.
4.1. Monolingual versus Multilingual Models
In
Table 6, it is noticeable that all the monolingual models were outperformed by the multilingual models. To better illustrate how the ten Turkic languages were recognized by the monolingual models and the best performing
all_turkic model, we present some of the decoded samples in
Table 8.
From
Table 6, we can see that improvement was at its peak for the lowest-resourced language in the study, Azerbaijani. With only a 0.13-hour-long dataset available, a significant CER/WER reduction from 107.6% to 26.7% and from 325.7% to 75.9%, respectively, was observed for this language. In
Table 8, the monolingual
az_cvc model appears to have output the same sequence
dә based on the likelihood model and thus did not produce correct results. In comparison, the
all_turkic model generated both an intelligible and a comprehensible text with respect to the reference text, although it systematically failed to correctly predict words with the character ə, representing the /e/ sound, in the Azerbaijani utterance. Presumably, due to the lack of Azerbaijani training data, the model therefore proposed similar-sounding words that only slightly differed in spelling, originating from Turkish (
illerde,
faaliyetine) and Uzbek (
muxtalif).
The CER/WER reduction trend held for another two lower-resourced languages in the study. The multilingual models for Chuvash and Sakha—the only representatives of their branches—were able to notably decrease CER/WER for both languages, despite their considerable deviation from standard Turkic forms. The all_turkic model produced scores of 4.9%/17.2% and 15.7%/45.0% for Chuvash and Sakha, respectively, which is more than twice as low as the scores obtained by the corresponding monolingual models.
With respect to three Kipchak Turkic languages—namely, Bashkir, Kyrgyz, and Tatar—there was also a reduction in CER/WER observed, although to a different degree and thanks to different models. While the scores of 13.6%/37.9% by the Tatar monolingual model were reduced to 5.5%/16.5% by
all_languages, it was
all_turkic again that took the Kyrgyz baseline scores down to 4.9%/13.1%. That said, the monolingual model for Bashkir—the Turkic language whose CVC data were over 230 h in length—yielded CER/WER scores that were not considerably higher than the lowest scores by
all_turkic, 1.7%/5.5% and 1.5%/4.9%, respectively. These observations seem to suggest that CER/WER reduction is more notable for languages with lower amounts of (CVC) data (e.g., Azerbaijani, Chuvash, Kazakh, and Sakha) and less evident for languages with a higher number of resources (e.g., Bashkir and Uzbek). Despite the less remarkable CER/WER improvement for Bashkir than for the lower-resourced languages, it can be clearly seen in
Table 8 that, in contrast to the monolingual
ba_cvc model, the
all_turkic model was successful in recognizing loanwords, especially those taken from Russian and instantly familiar to most people in the former Soviet countries (
лoтерея,
билеты). Similary, the
all_turkic model outperformed the monolingual Chuvash model in predicting loanwords, recognizing some completely correctly (
хoлoдильник) and others to varying degrees (
заведующий →
сoвету***,
диван →
тиван).
ASR for Uyghur, a language of the Karluk branch, also seems to have notably benefited from the development of multilingual models. One can see a steady decrement in CER/WER as the data of other Turkic languages were added to the training set. The joint use of data of all the Turkic languages in the all_turkic model resulted in scores of 4.1%/11.0%.
In the case of Kazakh, Turkish, and Uzbek—the three languages in the study for which in addition to the CVC there was another speech corpus used for model development—the data in
Table 6 and
Table 7 appear to suggest that the results may vary depending on the training and test sets used. To begin with, the Kazakh and Turkish monolingual models trained on the CVC data produced notably higher CER/WER results than the monolingual models trained on the KSC and the TSC. This can probably be attributed to the marked difference in the size of the training data. It is especially the case for Kazakh, for which the total amount of the CVC data was as little as 1.60 h as opposed to the hefty 332.60 h in the KSC. Thus, it seems nothing but expected that
kk_ksc and
tr_tsc achieved the remarkable 2.0%/6.8% and 3.8%/12.6%, respectively, as compared to the 69.9%/101.2% of
kk_cvc and the 7.3%/20.1% of
tr_cvc. For example, the scores of
uz_cvc and
uz_usc were quite similar—although slightly lower for the former (4.2%/14.6% and 5.0%/16.8%, respectively), for the two Uzbek datasets were comparable in size.
As regards the multilingual models for Kazakh, Turkish, and Uzbek, when evaluated on the CVC test sets, the best performance was achieved by the all_languages model. While the CER score for Turkish and Uzbek was approximately 2.9%, the WER score held in the range of 8.7% to 10.2%. For Kazakh, the model produced the lowest WER score (28.6%), but achieved the second best CER score of 11.9%, yielding to all_turkic with 11.7%.
On the other hand, in the evaluation of the multilingual models on the KSC, TSC, and USC test sets, the best CER and WER results of 1.5% and 5.7%, respectively, in Kazakh ASR were produced by the
ksc_turkic model. Such low scores are likely to have been achieved owing to the sufficient amount of data in the training and test sets for the model to learn from and test its hypotheses on. The CER scores produced by
all_turkic and
all_languages were identical to that of
ksc_turkic, with the WER scores being only negligibly higher. For Turkish, the lowest scores were achieved by
tsc_turkic, 2.9%/9.6%. Although the model exhibited a CER result lower than that obtained on the CVC test set, the WER score was still slightly higher. Looking at the scores for Turkish ASR in
Table 6 and
Table 7, it is apparent that the multilingual models evaluated both on the CVC and TSC test sets produced somewhat similar results. In the case of the Uzbek language, the result of 2.7%/9.5% achieved by
all_turkic was the lowest in the evaluation of the multilingual models on both test sets.