Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language
Abstract
:1. Introduction
2. Related Work
3. Methodology
4. Challenges
5. Dataset
Indicator | Category | Duration | # Segments | Avg. Seg. Duration (s) |
---|---|---|---|---|
Gender | F | 2 h 8 m 42 s | 1506 | 5.13 |
M | 2 h 10 m 13 s | 1131 | 6.91 | |
Age | <14 | 10 m 44 s | 175 | 3.68 |
14–19 | 15 m 20 s | 168 | 5.48 | |
19–29 | 1 h 5 m 34 s | 676 | 5.82 | |
30–50 | 45 m 41 s | 457 | 6.00 | |
50–70 | 1 h 1 m 22 s | 674 | 5.46 | |
>70 | 1 h 0 m 14 s | 487 | 7.42 | |
MOS | 5 | 2 h 20 m 34 s | 1187 | 7.11 |
4 | 1 h 56 m 44s | 1435 | 4.88 | |
3 | 1 m 37 s | 15 | 6.47 | |
Platform | YouTube | 3 h 32 m 48 s | 2014 | 6.34 |
Vimeo | 13 m 53 s | 194 | 4.29 | |
SoundCloud | 32 m 14 s | 429 | 4.51 | |
License | CC BY | 3 h 44 m 9 s | 2169 | 6.2 |
CC BY NC | 2 m 32 s | 39 | 3.9 | |
CC BY NC SA | 32 m 14 s | 429 | 4.51 | |
Type | Read | 33 m 32 s | 467 | 4.31 |
Spontaneous | 3 h 45 m 23 s | 2170 | 6.23 |
6. Evaluation
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
ASR | Automatic Speech Recognition |
CoRoLa | The Representative Corpus of Contemporary Romanian Language |
CC BY | Creative Commons Attribution |
CC BY NC | Creative Commons Attribution Non-Commercial |
CC BY NC SA | Creative Commons Attribution Non-Commercial Share Alike |
CV | Common Voice |
MaSS | Multilingual corpus of Sentence-aligned Spoken utterances |
MOS | Mean Opinion Score |
RASC | Romanian Anonymous Speech Corpus |
RTASC | Romanian Technical Acquisition Speech Corpus |
References
- Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), Marseille, France, 11–16 May 2020; pp. 4211–4215. [Google Scholar]
- Păiș, V.; Tufiș, D. Language Report Romanian. In European Language Equality: A Strategic Agenda for Digital Language Equality; Springer International Publishing: Cham, Switzerland, 2023; pp. 199–202. [Google Scholar] [CrossRef]
- Georgescu, A.L.; Cucu, H.; Buzo, A.; Burileanu, C. RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 6606–6612. [Google Scholar]
- Păiş, V.; Ion, R.; Avram, A.M.; Irimia, E.; Mititelu, V.B.; Mitrofan, M. Human-Machine Interaction Speech Corpus from the ROBIN project. In Proceedings of the 2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, Romania, 13–15 October 2021; pp. 91–96. [Google Scholar] [CrossRef]
- Georgescu, A.; Caranica, A.; Cucu, H.; Burileanu, C. Rodigits—A Romanian Connected-Digits Speech Corpus For Automatic Speech And Speaker Recognition. Univ. Politeh. Buchar. Sci. Bull. Ser. C 2018, 80, 45–62. [Google Scholar]
- Stan, A.; Dinescu, F.; Ţiple, C.; Meza, S.; Orza, B.; Chirilă, M.; Giurgiu, M. The SWARA speech corpus: A large parallel Romanian read speech dataset. In Proceedings of the 9th International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, Romania, 6–9 July 2017; pp. 1–6. [Google Scholar]
- Stan, A.; Yamagishi, J.; King, S.; Aylett, M. The Romanian Speech Synthesis (RSS) corpus: Building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Commun. 2011, 53, 442–450. [Google Scholar] [CrossRef]
- Kabir, A.; Giurgiu, M. A Romanian corpus for speech perception and automatic speech recognition. In Proceedings of the 10th WSEAS International Conference on Electronics, Hardware, Wireless and Optical Communications, and 10th WSEAS International Conference on Signal Processing, Robotics and Automation, and 3rd WSEAS International Conference on Nanotechnology, and 2nd WSEAS International Conference on Plasma-Fusion-Nuclear Physics, Cambridge, UK, 20–22 February 2011; pp. 323–326. [Google Scholar]
- Dumitrescu, S.D.; Boroș, T.; Ion, R. Crowd-sourced, automatic speech-corpora collection—Building the Romanian Anonymous Speech Corpus. In Proceedings of the Workshop on Collaboration and Computing for Under-Resourced Languages in the Linked Open Data Era (CCURL2014), Reykjavik, Iceland, 26 May 2014; pp. 90–94. [Google Scholar]
- Wang, C.; Riviere, M.; Lee, A.; Wu, A.; Talnikar, C.; Haziza, D.; Williamson, M.; Pino, J.; Dupoux, E. VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Conference, 1–6 August 2021; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 993–1003. [Google Scholar] [CrossRef]
- Zanon Boito, M.; Havard, W.; Garnerin, M.; Le Ferrand, É.; Besacier, L. MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 6486–6493. [Google Scholar]
- Conneau, A.; Ma, M.; Khanuja, S.; Zhang, Y.; Axelrod, V.; Dalmia, S.; Riesa, J.; Rivera, C.; Bapna, A. FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; pp. 798–805. [Google Scholar] [CrossRef]
- Tufiș, D.; Mititelu, V.B.; Irimia, E.; Păiș, V.; Ion, R.; Diewald, N.; Mitrofan, M.; Onofrei, M. Little strokes fell great oaks. Creating CoRoLa, the reference corpus of contemporary Romanian. Rev. Roum. Linguist. 2019, 64, 227–240. [Google Scholar]
- Love, R.; McEnery, T. The Spoken British National Corpus 2014: Design, Compilation and Analysis. Ph.D. Thesis, Lancaster University, Lancaster, UK, 2018. [Google Scholar]
- Waclawičová, M.; Křen, M.; Válková, L. Balanced corpus of informal spoken Czech: Compilation, design and findings. In Proceedings of the Interspeech 2009, Brighton, UK, 6–10 September 2009; pp. 1819–1822. [Google Scholar] [CrossRef]
- Garnerin, M.; Rossato, S.; Besacier, L. Gender Representation in French Broadcast Corpora and Its Impact on ASR Performance. In Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery, AI4TV ’19, Nice, France, 21 October 2019; pp. 3–9. [Google Scholar] [CrossRef]
- Tatman, R.; Kasten, C. Effects of Talker Dialect, Gender & Race on Accuracy of Bing Speech and YouTube Automatic Captions. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 934–938. [Google Scholar] [CrossRef]
- Ngueajio, M.K.; Washington, G. Hey ASR System! Why Aren’t You More Inclusive? In Proceedings of the HCI International 2022—Late Breaking Papers: Interacting with eXtended Reality and Artificial Intelligence, Virtual Conference, 26 June–1 July 2022; Springer: Cham, Switzerland, 2022; pp. 421–440. [Google Scholar]
- Doğruöz, A.S.; Sitaram, S. Language Technologies for Low Resource Languages: Sociolinguistic and Multilingual Insights. In Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages, Marseille, France, 24–25 June 2022; pp. 92–97. [Google Scholar]
- Gaspari, F.; Gallagher, O.; Rehm, G.; Giagkou, M.; Piperidis, S.; Dunne, J.; Way, A. Introducing the Digital Language Equality Metric: Technological Factors. In Proceedings of the Workshop Towards Digital Language Equality (TDLE 2022; co-located with LREC 2022), Marseille, France, 20–25 June 2022; pp. 1–12. [Google Scholar]
- Meyer, J.; Rauchenstein, L.; Eisenberg, J.D.; Howell, N. Artie Bias Corpus: An Open Dataset for Detecting Demographic Bias in Speech Applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 6462–6468. [Google Scholar]
- Navarro, M.; Little, C.; Allen, G.I.; Segarra, S. Data Augmentation via Subgroup Mixup for Improving Fairness. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 7350–7354. [Google Scholar] [CrossRef]
- Loizou, P.C. Speech Quality Assessment. In Multimedia Analysis, Processing and Communications; Springer: Berlin/Heidelberg, Germany, 2011; pp. 623–654. [Google Scholar] [CrossRef]
- Straka, M.; Straková, J. Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, BC, Canada, 3–4 August 2017; pp. 88–99. [Google Scholar] [CrossRef]
- Păiș, V.; Ion, R.; Tufiș, D. A Processing Platform Relating Data and Tools for Romanian Language. In Proceedings of the 1st International Workshop on Language Technology Platforms, Marseille, France, 11–16 May 2020; pp. 81–88. [Google Scholar]
- Păiș, V.; Tufiș, D.; Ion, R. Integration of Romanian NLP tools into the RELATE platform. In Proceedings of the International Conference on Linguistic Resources and Tools for Natural Language Processing, Cluj-Napoca, Romania, 18–20 November 2019; pp. 181–192. [Google Scholar]
- Păiș, V. Multiple annotation pipelines inside the RELATE platform. In Proceedings of the 15th International Conference on Linguistic Resources and Tools for Natural Language Processing, Virtual Conference, 14–16 December 2020; pp. 65–75. [Google Scholar]
- Păiş, V.; Ion, R.; Avram, A.M.; Mitrofan, M.; Tufiș, D. In-depth evaluation of Romanian natural language processing pipelines. Rom. J. Inf. Sci. Technol. (ROMJIST) 2021, 24, 384–401. [Google Scholar]
- Boros, T.; Dumitrescu, S.D.; Burtica, R. NLP-Cube: End-to-End Raw Text Processing with Neural Networks. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, 31 October–1 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 171–179. [Google Scholar] [CrossRef]
- Schmid, H. Deep Learning-Based Morphological Taggers and Lemmatizers for Annotating Historical Texts. In Proceedings of the DATeCH. ACM, Brussels, Belgium, 8–10 May 2019; pp. 133–137. [Google Scholar]
- Qi, P.; Zhang, Y.; Zhang, Y.; Bolton, J.; Manning, C.D. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online Conference, 5–10 July 2020; pp. 101–108. [Google Scholar] [CrossRef]
- Rehm, G.; Berger, M.; Elsholz, E.; Hegele, S.; Kintzel, F.; Marheinecke, K.; Piperidis, S.; Deligiannis, M.; Galanis, D.; Gkirtzou, K.; et al. European Language Grid: An Overview. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), Marseille, France, 11–16 May 2020; pp. 3359–3373. [Google Scholar]
- Avram, A.M.; Păiș, V.; Tufis, D. Towards a Romanian end-to-end automatic speech recognition based on DeepSpeech2. Proc. Rom. Acad. Ser. A 2020, 21, 395–402. [Google Scholar]
- Avram, A.M.; Păiș, V.; Tufiș, D. Romanian speech recognition experiments from the ROBIN project. In Proceedings of the 15th International Conference on Linguistic Resources and Tools for Natural Language Processing, Online Conference, 14–16 December 2020; pp. 103–114. [Google Scholar]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, PMLR, New York City, NY, USA, 19–24 June 2016; pp. 173–182. [Google Scholar]
- Avram, A.M.; Păiș, V.; Tufiș, D. Self-Supervised Pre-Training in Speech Recognition Systems. In Speech Recognition Technology and Applications; Păiș, V., Ed.; Nova Science Publishers: Hauppauge, NY, USA, 2022; pp. 27–56. [Google Scholar]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Vancouver, BC, Canada, 6–12 December 2020; pp. 12449–12460. [Google Scholar]
- Păiș, V.; Barbu Mititelu, V.; Ion, R.; Irimia, E. Evaluating a Fine-Tuned Whisper Model on Underrepresented Romanian Speech. In Proceedings of the 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, Romania, 25–27 October 2023; pp. 141–145. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, JMLR.org, ICML’23, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Corpus | # Hours | # Utterances | # Speakers |
---|---|---|---|
RSC | 100 | 136.1 k | 164 |
RoDigits | 37.5 | 15.4 k | 154 |
SWARA | 21 | 19 k | 17 |
RO-GRID | 6.6 | 4.8 k | 12 |
RSS | 5.5 | 5.7 k | 3 |
RASC | 4.8 | 3 k | - |
RTASC | 6.5 | 3.8 k | 6 |
CV | 9 | 8k | 130 |
VoxPopuli | 83 | 27 k | 164 |
MaSS | 23 | 8.1 k | 1 |
FLEURS | 12 | - | - |
Keywords | Target Group |
---|---|
elevii te învață (“the students teach you”) probleme adolescenți (“problems teenagers”) | Young people |
sfaturi duhovnicești (“spiritual advice”) viața la pensie (“life when retired”) | Older people |
emisiune pentru femei (“women show”) feminism și literatură (“feminism and literature”) | Women |
editura (“publishing house”) antropologie (“anthropology”) | Generic |
F | M | |||||
---|---|---|---|---|---|---|
Age | Duration | # Segments | Avg. Durat. | Duration | # Segments | Avg. Duration |
<14 | 1 m 38 s | 27 | 3.63 | 9 m 06 s | 148 | 3.69 |
14–19 | - | - | - | 15 m 20 s | 168 | 5.44 |
19–29 | 54 m | 557 | 5.82 | 11 m 34 s | 119 | 5.83 |
30–50 | 33 m 51 s | 352 | 5.77 | 11 m 51 s | 105 | 6.77 |
50–70 | 26 m 26 s | 431 | 3.68 | 34 m 56 s | 243 | 8.63 |
>70 | 12 m 48 s | 139 | 5.53 | 47 m 26 s | 348 | 8.18 |
Indicator | Value | Indicator | Value |
---|---|---|---|
Text files | 2637 | UPOS Noun | 8471 |
Sentences | 6652 | UPOS Verb | 5793 |
Tokens | 48,530 | UPOS Adp | 4009 |
Unique tokens | 8221 | UPOS Adv | 3717 |
Unique lemmas | 5509 | UPOS Adj | 1952 |
Hapax legomena | 5055 | UPOS Num | 615 |
Avg. Sentence Length | 7.30 | UPOS PropN | 851 |
Baseline | USPDATRO | |||
---|---|---|---|---|
System | WER | CER | WER | CER |
RO-DS2 | 0.0991 | 0.0280 | 0.5714 | 0.3638 |
RO-DS2-ROBIN | 0.0991 | - | 0.6491 | 0.3381 |
RO-WAV2VEC2 | 0.1393 | 0.0983 | 0.9115 | 0.6675 |
RO-Whisper medium | 0.1379 | - | 0.2800 | 0.1319 |
RO-Whisper large-v2 | 0.1261 | - | 0.4330 | 0.2875 |
CoRoLa | USPDATRO | |||||
---|---|---|---|---|---|---|
Model | Param | Beam | WER | CER | WER | CER |
tiny | 39M | N | 1.2218 | 0.8135 | 1.1502 | 0.6169 |
tiny | Y | 0.7903 | 0.3439 | 0.9115 | 0.4771 | |
base | 74M | N | 0.6079 | 0.2534 | 0.9347 | 0.5086 |
base | Y | 0.6433 | 0.2625 | 0.7275 | 0.3391 | |
small | 244M | N | 0.5027 | 0.2144 | 0.6789 | 0.3169 |
small | Y | 0.5005 | 0.2143 | 0.5794 | 0.2380 | |
medium | 769M | N | 0.5562 | 0.2768 | 0.5566 | 0.2516 |
medium | Y | 0.4347 | 0.1887 | 0.5051 | 0.2064 | |
large-v2 | 1550M | N | 0.4561 | 0.2142 | 0.5104 | 0.2189 |
large-v2 | Y | 0.4052 | 0.1777 | 0.4874 | 0.1952 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Păiș, V.; Barbu Mititelu, V.; Irimia, E.; Ion, R.; Tufiș, D. Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language. Appl. Sci. 2024, 14, 9043. https://doi.org/10.3390/app14199043
Păiș V, Barbu Mititelu V, Irimia E, Ion R, Tufiș D. Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language. Applied Sciences. 2024; 14(19):9043. https://doi.org/10.3390/app14199043
Chicago/Turabian StylePăiș, Vasile, Verginica Barbu Mititelu, Elena Irimia, Radu Ion, and Dan Tufiș. 2024. "Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language" Applied Sciences 14, no. 19: 9043. https://doi.org/10.3390/app14199043
APA StylePăiș, V., Barbu Mititelu, V., Irimia, E., Ion, R., & Tufiș, D. (2024). Under-Represented Speech Dataset from Open Data: Case Study on the Romanian Language. Applied Sciences, 14(19), 9043. https://doi.org/10.3390/app14199043