Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database †
Abstract
:1. Introduction
2. Recent Advances in Speech Recognition
3. Corpora Description
3.1. Training Corpora
3.1.1. Acoustic Corpus
3.1.2. Text Corpus
3.2. Evaluation Data
RTVE2020 Database
4. Systems Description and Configuration
4.1. Novel ASR Architectures
4.1.1. Multistream CNN Based System
4.1.2. Quartznet Q15×5 Based System
4.1.3. Wav2vec2.0 Based System
4.2. Baseline ASR Architectures
4.2.1. DNN-HMM Based System
4.2.2. Quartznet Q5×5 Based System
5. Results and Resources
Processing Time and Resources
6. Conclusions and Future Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
RTVE | Corporación de Radio y Televisión Española, S. A. |
ASR | Automatic Speech Recognition |
SoA | State of the Art |
DNN | Deep Neural Network |
E2E | End-To-End |
CNN | Convolutional Neural Networks |
HMM | Hidden Markov Model |
TDNN-F | Factorised Time-Delay Neural Network |
SRU | Self attentive simple Recurrent Unit |
WER | Word Error Rate |
LM | Language Model |
AM | Acoustic Model |
HW | Hardware |
LSTM | Long Short-Term Memory |
RNN | Recurrent Neural Model |
EiTB | Euskal Irrati Telebista |
MFCC | Mel-Frequency Cepstral coefficients |
GPU | Graphical Processing Unit |
SAD | Speech Activity Detection |
RTF | Real-Time Factor |
References
- Georgescu, A.L.; Pappalardo, A.; Cucu, H.; Blott, M. Performance vs. hardware requirements in state-of-the-art automatic speech recognition. Eurasip J. Audio Speech Music. Process. 2021, 2021, 1–30. [Google Scholar] [CrossRef]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. Wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
- Hsu, W.N.; Tsai, Y.H.H.; Bolte, B.; Salakhutdinov, R.; Mohamed, A. HuBERT: How much can a bad teacher benefit ASR pre-training? In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–12 June 2021; pp. 6533–6537. [Google Scholar]
- Lleida, E.; Ortega, A.; Miguel, A.; Bazán-Gil, V.; Pérez, C.; Gómez, M.; de Prada, A. Rtve2020 Database Description. 2020. Available online: http://catedrartve.unizar.es/reto2020/RTVE2020DB.pdf (accessed on 28 December 2021).
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
- Kriman, S.; Beliaev, S.; Ginsburg, B.; Huang, J.; Kuchaiev, O.; Lavrukhin, V.; Leary, R.; Li, J.; Zhang, Y. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6124–6128. [Google Scholar]
- Alvarez, A.; Arzelus, H.; Torre, I.G.; González-Docasal, A. The Vicomtech Speech Transcription Systems for the Albayzın-RTVE 2020 Speech to Text Transcription Challenge. In Proceedings of the IberSPEECH 2021, Valladolid, Spain, 24–25 March 2021; pp. 104–107. [Google Scholar]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; pp. 173–182. [Google Scholar]
- Graves, A.; Jaitly, N. Towards End-to-end Speech Recognition with Recurrent Neural Networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, 21–26 June 2014; Volume 32, pp. 1764–1772. [Google Scholar]
- Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar]
- Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. Adv. Neural Inf. Process. Syst. 2015, 28, 577–585. [Google Scholar]
- Lu, L.; Zhang, X.; Renals, S. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5060–5064. [Google Scholar]
- Yao, Z.; Wu, D.; Wang, X.; Zhang, B.; Yu, F.; Yang, C.; Peng, Z.; Chen, X.; Xie, L.; Lei, X. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv 2021, arXiv:2102.01547. [Google Scholar]
- Chang, X.; Maekaku, T.; Guo, P.; Shi, J.; Lu, Y.J.; Subramanian, A.S.; Wang, T.; Yang, S.w.; Tsao, Y.; Lee, H.y.; et al. An exploration of self-supervised pretrained representations for end-to-end speech recognition. arXiv 2021, arXiv:2110.04590. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Han, K.J.; Pan, J.; Tadala, V.K.N.; Ma, T.; Povey, D. Multistream CNN for robust acoustic modeling. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–12 June 2021; pp. 6873–6877. [Google Scholar]
- Han, K.J.; Prieto, R.; Ma, T. State-of-the-art speech recognition using multi-stream self-attention with dilated 1d convolutions. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Sentosa, Singapore, 14–18 December 2019; pp. 54–61. [Google Scholar]
- Pan, J.; Shapiro, J.; Wohlwend, J.; Han, K.J.; Lei, T.; Ma, T. ASAPP-ASR: Multistream CNN and self-attentive SRU for SOTA speech recognition. arXiv 2020, arXiv:2005.10469. [Google Scholar]
- Li, J.; Lavrukhin, V.; Ginsburg, B.; Leary, R.; Kuchaiev, O.; Cohen, J.M.; Nguyen, H.; Gadde, R.T. Jasper: An end-to-end convolutional neural acoustic model. arXiv 2019, arXiv:1904.03288. [Google Scholar]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv 2021, arXiv:2106.07447. [Google Scholar] [CrossRef]
- Wang, Y.; Mohamed, A.; Le, D.; Liu, C.; Xiao, A.; Mahadeokar, J.; Huang, H.; Tjandra, A.; Zhang, X.; Zhang, F.; et al. Transformer-based acoustic modeling for hybrid speech recognition. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6874–6878. [Google Scholar]
- Lleida, E.; Ortega, A.; Miguel, A.; Bazán-Gil, V.; Pérez, C.; Gómez, M.; de Prada, A. Albayzin 2018 evaluation: The iberspeech-RTVE challenge on speech technologies for spanish broadcast media. Appl. Sci. 2019, 9, 5412. [Google Scholar] [CrossRef] [Green Version]
- del Pozo, A.; Aliprandi, C.; Álvarez, A.; Mendes, C.; Neto, J.P.; Paulo, S.; Piccinini, N.; Raffaelli, M. SAVAS: Collecting, Annotating and Sharing Audiovisual Language Resources for Automatic Subtitling. 2014, pp. 432–436. Available online: https://d1wqtxts1xzle7.cloudfront.net/47591506/SAVAS_Collecting_Annotating_and_Sharing_20160728-14677-ftvov1.pdf?1469710441=&response-content-disposition=inline%3B+filename%3DSAVAS_Collecting_Annotating_and_Sharing.pdf&Expires=1644576491&Signature=U30UJaJhWnxGGyeWM1nrbsH8X6OmBzgCFnHR-6~yPtA1zWbi~QwWg9nfIXFGc7-iRF5q4FvKblsf0o5-O665DKoLwhtMpAVwuEX71lTmY9qjRRaSMAa3AfEFKyrCNwuKgEWWRmlkaftiUVOTZRyt8T2S3z9Y0TmDTFtJ7Nyvd4~~096vT3y7sdjgd5j~R54br24q13pXlHh64yJozRV41xGvb76yjYF1~yS3oivFYiP0GjZ1jckXgSRB0WzwVkDkve6JThKuxKwO58VP3~WfRIIb2DUrtKYPO-C8EUkPY7e2ZrpLNTjCuTO0JqfIaXmvdKk4XMn7T0KqZA-~fbHABQ__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA (accessed on 21 December 2021).
- Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common voice: A massively-multilingual speech corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
- Casacuberta, F.; Garcia, R.; Llisterri, J.; Nadeu, C.; Pardo, J.; Rubio, A. Development of Spanish corpora for speech research (Albayzin). In Proceedings of the Workshop on International Cooperation and Standarization of Speech Databases and Speech I/O Assesment Methods, Chiavari, Italy, 26 September 1991; pp. 26–28. [Google Scholar]
- Campione, E.; Véronis, J. A multilingual prosodic database. In Proceedings of the Fifth International Conference on Spoken Language Processing, Sydney, Australia, 30 November–4 December 1998. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
- Ko, T.; Peddinti, V.; Povey, D.; Khudanpur, S. Audio augmentation for speech recognition. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Peddinti, V.; Povey, D.; Khudanpur, S. A time delay neural network architecture for efficient modeling of long temporal contexts. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015. [Google Scholar]
- Xu, H.; Chen, T.; Gao, D.; Wang, Y.; Li, K.; Goel, N.; Carmiel, Y.; Povey, D.; Khudanpur, S. A pruned rnnlm lattice-rescoring algorithm for automatic speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5929–5933. [Google Scholar]
- Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE Winter conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 464–472. [Google Scholar]
- Kucsko, G. Pyctcdecode. Available online: https://github.com/kensho-technologies/pyctcdecode (accessed on 28 December 2021).
- Conneau, A.; Baevski, A.; Collobert, R.; Mohamed, A.; Auli, M. Unsupervised cross-lingual representation learning for speech recognition. arXiv 2020, arXiv:2006.13979. [Google Scholar]
- Wang, C.; Rivière, M.; Lee, A.; Wu, A.; Talnikar, C.; Haziza, D.; Williamson, M.; Pino, J.; Dupoux, E. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv 2021, arXiv:2101.00390. [Google Scholar]
- Pratap, V.; Xu, Q.; Sriram, A.; Synnaeve, G.; Collobert, R. Mls: A large-scale multilingual dataset for speech research. arXiv 2020, arXiv:2012.03411. [Google Scholar]
- Valk, J.; Alumäe, T. Voxlingua107: A dataset for spoken language recognition. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 652–658. [Google Scholar]
- Babu, A.; Wang, C.; Tjandra, A.; Lakhotia, K.; Xu, Q.; Goyal, N.; Singh, K.; von Platen, P.; Saraf, Y.; Pino, J.; et al. XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale. arXiv 2021, arXiv:2111.09296. [Google Scholar]
- Povey, D.; Cheng, G.; Wang, Y.; Li, K.; Xu, H.; Yarmohammadi, M.; Khudanpur, S. Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 3743–3747. [Google Scholar]
- Lleida, E.; Ortega, A.; Miguel, A.; Bazán, V.; Pérez, C.; Zotano, M.; de Prada, A. RTVE2018 Database Description; Vivolab and Corporación Radiotelevisión Espanola: Zaragoza, Spain, 2018. [Google Scholar]
- Yang, S.W.; Chi, P.H.; Chuang, Y.S.; Lai, C.I.J.; Lakhotia, K.; Lin, Y.Y.; Liu, A.T.; Shi, J.; Chang, X.; Lin, G.T.; et al. SUPERB: Speech processing Universal PERformance Benchmark. arXiv 2021, arXiv:2105.01051. [Google Scholar]
Dataset | Duration | License |
---|---|---|
RTVE2018 | 112 h 30 min | Non-Commercial |
SAVAS | 160 h 58 min | Commercial/Research |
IDAZLE | 137 h 8 min | Non-Commercial |
A la Carta | 168 h 29 min | Non-Commercial |
Common Voice | 158 h 9 min | Mozilla Public License 2.0 |
Albayzin | 5 h 33 min | Commercial/Academic |
Multext | 0 h 47 min | Commercial/Academic |
Total | 743 h 35 min |
Corpus | #Words |
---|---|
Transcriptions | 7,946,991 |
RTVE2018 | 56,628,710 |
A la Carta | 106,716,060 |
Wikipedia | 489,633,255 |
Total | 660,925,016 |
TV Program | Duration | Description |
---|---|---|
Ese programa del que Ud. habla | 01:58:36 | A TV program that reviews daily political, cultural, socialand sports news |
from the perspective of comedy. | ||
Los desayunos de RTVE | 10:58:34 | The daily news, politics, interviews and debate program. |
Neverfilms | 00:11:41 | A webseries that parody humorously trailers of series and movies |
well-known to the public. | ||
Si fueras tú | 00:51:14 | Interactive series that tells the story of a young girl. |
Bajo la red | 00:59:01 | A youth fiction series whose plot is about a chain of favours on the internet. |
Comando actualidad | 04:01:31 | A show that presents a current topic through the choral gaze of |
several street reporters. | ||
Boca norte | 01:00:46 | A story of young people who dance to the rhythm of trap. |
Wake-up | 00:57:28 | A story that combines science fiction, a post-apocalyptic Madrid and lots |
of action inspired in video games. | ||
Versión española | 02:29:12 | Program dedicated to the promotion of Spanish and Latin American cinema. |
Aquí la tierra | 10:26:02 | A magazine that deals with the influence of climatology and meteorology |
both personally and globally. | ||
Mercado central | 08:39:47 | A Spanish soap opera set in a today’s Madrid market. |
Vaya crack | 05:06:00 | A contest where contestants take multiple quiz designed to test |
their abilities in several disciplines. | ||
Cómo nos reímos | 02:51:42 | A program dedicated to the great comedians and their work on RTVE programs. |
Imprescindibles | 03:12:31 | A documentary series on the most outstanding figures of Spanish culture |
in the 20th century. | ||
Millennium | 01:56:11 | Debate show for the spectators of today, accompanying them |
in the analysis of everyday events. | ||
Total duration | 55:40:16 |
AM Architecture | Model | Type | WER |
---|---|---|---|
DNN-HMM | Multistream CNN | novel | 17.60 |
CNN-TDNN-F | baseline | 19.27 | |
Quartznet | Q15×5 | novel | 22.95 |
Q5×5 | baseline | 28.42 | |
Wav2vec2.0 | Wav2Vec2-XLS-R-300M | novel | 20.68 |
(fine-tuned) |
TV Program | Multistream CNN | CNN-TDNN-F | Q15×5 | Q5×5 | Wav2vec2.0 |
---|---|---|---|---|---|
Ese programa del que Ud. habla | 23.64 | 25.67 | 29.65 | 36.15 | 26.81 |
Los desayunos de RTVE | 9.26 | 10.11 | 12.14 | 14.68 | 11.08 |
Neverfilms | 19.81 | 24.21 | 29.03 | 37.82 | 28.05 |
Si fueras tú | 24.57 | 29.31 | 36.76 | 46.73 | 36.43 |
Bajo la red | 22.41 | 33.31 | 32.99 | 41.06 | 32.33 |
Comando actualidad | 22.58 | 24.68 | 27.34 | 32.70 | 25.6 |
Boca norte | 32.07 | 37.94 | 43.16 | 52.92 | 40.37 |
Wake-up | 30.87 | 33.96 | 40.81 | 47.71 | 38.19 |
Versión española | 16.14 | 18.10 | 19.15 | 25.66 | 18.06 |
Aquí la tierra | 14.90 | 16.48 | 19.69 | 24.68 | 17.67 |
Mercado central | 16.44 | 17.83 | 25.43 | 34.05 | 21.91 |
Vaya crack | 19.22 | 19.96 | 28.43 | 30.16 | 20.80 |
Cómo nos reíamos | 46.17 | 48.53 | 54.33 | 61.41 | 53.20 |
Imprescindibles | 30.44 | 34.45 | 37.12 | 44.94 | 29.52 |
Millenium | 16.02 | 15.98 | 17.30 | 18.82 | 17.57 |
Global | 17.60 | 19.27 | 22.96 | 28.42 | 20.68 |
ASR System | RAM (GB) | CPU Cores | GPU (GB) | Time | RTF |
---|---|---|---|---|---|
Multistream CNN | 6.7 | 1 | 12 | 4.9 h | 0.09 |
CNN-TDNN-F | 6.7 | 1 | 12 | 4.9 h | 0.09 |
Quartznet Q15×5 | 6 | 1 | 12 | 8 h | 0.14 |
Quartznet Q5×5 | 6 | 1 | 12 | 7.5 h | 0.13 |
Wav2vec2.0 | 30 | 40 | 18 | 8 h | 0.14 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Álvarez, A.; Arzelus, H.; Torre, I.G.; González-Docasal, A. Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database. Appl. Sci. 2022, 12, 1889. https://doi.org/10.3390/app12041889
Álvarez A, Arzelus H, Torre IG, González-Docasal A. Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database. Applied Sciences. 2022; 12(4):1889. https://doi.org/10.3390/app12041889
Chicago/Turabian StyleÁlvarez, Aitor, Haritz Arzelus, Iván G. Torre, and Ander González-Docasal. 2022. "Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database" Applied Sciences 12, no. 4: 1889. https://doi.org/10.3390/app12041889
APA StyleÁlvarez, A., Arzelus, H., Torre, I. G., & González-Docasal, A. (2022). Evaluating Novel Speech Transcription Architectures on the Spanish RTVE2020 Database. Applied Sciences, 12(4), 1889. https://doi.org/10.3390/app12041889