Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech
Abstract
:1. Introduction
- (1)
- Focusing on low-resource language, an endangered language, the paper proposes AVSR to tackle the difficulties due to scarce speech data and labeling, noises, and insufficient speakers (in the case of single-modality recognition). In addition to single modeling, AVSR draws upon lip movement for complementary information to achieve multi-modal learning.
- (2)
- The paper proposes an end-to-end model based on LSTM-Transformer. In training, the model can not only learn the contextual relation of the sequence signal, but also learn the temporal and spatial correlation between them when the fusion information is diverse and complex.
- (3)
- Speaker-related experiments are designed to verify that the proposed method can solve the speaker dependence problem by comparing the recognition accuracy when the training set contains the speakers in the test set and the training set does not contain the speakers in the test set.
2. Audiovisual Fusion Model for Tujia Speech
2.1. Feature Extraction
2.2. Audiovisual Fusion Block
2.3. TM-CTC Decoder
3. Experiments
3.1. Dataset
3.2. Setup Conditions for Initials and Finals
3.3. Evaluation Metrics
3.4. Experimental Setup
3.4.1. Visual Features
3.4.2. Acoustic Features
3.4.3. LSTM-Transformer
3.4.4. Sequences Alignment
3.4.5. Speakers Preparation
4. Results and Analysis
4.1. Results of Acoustic Model
4.2. Results of AVSR
4.3. Results of Speaker-Independent Experiments
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chen, Z.; Yang, H. Yi language speech recognition using deep learning methods. In Proceedings of the 2020 IEEE 4th Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), Chongqing, China, 12–14 June 2020; pp. 1064–1068. [Google Scholar] [CrossRef]
- Yu, C.; Chen, Y.; Li, Y.; Kang, M.; Xu, S.; Liu, X. Cross-language end-to-end speech recognition research based on transfer learning for the low-resource Tujia language. Symmetry 2019, 11, 179. [Google Scholar] [CrossRef]
- DiCanio, C.; Nam, H.; Whalen, D.H.; Timothy Bunnell, H.; Amith, J.D.; García, R.C. Using automatic alignment to analyze endangered language data: Testing the viability of untrained alignment. J. Acoust. Soc. Am. 2013, 134, 2235–2246. [Google Scholar] [CrossRef] [PubMed]
- Gales, M.J.F. Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 1998, 12, 75–98. [Google Scholar] [CrossRef]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Vesely, K. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society (CONF), Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
- Wang, Y.; Wang, H. Multilingual convolutional, long short-term memory, deep neural networks for low resource speech recognition. Procedia Comput. Sci. 2017, 107, 842–847. [Google Scholar] [CrossRef]
- Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 4960–4964. [Google Scholar] [CrossRef]
- Miao, H.; Cheng, G.; Gao, C.; Zhang, P.; Yan, Y. Transformer-based online CTC/attention end-to-end speech recognition architecture. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6084–6088. [Google Scholar] [CrossRef]
- Tachbelie, M.Y.; Abate, S.T.; Besacier, L. Using different acoustic, lexical and language modeling units for ASR of an under-resourced language–Amharic. Speech Commun. 2014, 56, 181–194. [Google Scholar] [CrossRef]
- Inaguma, H.; Cho, J.; Baskar, M.K.; Kawahara, T.; Watanabe, S. Transfer learning of language-independent end-to-end ASR with language model fusion. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6096–6100. [Google Scholar] [CrossRef]
- Wang, D.; Zhang, X. THCHS-30: A Free Chinese Speech Corpus. arXiv 2015, arXiv:1512.01882. [Google Scholar]
- Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 23, 7–19. [Google Scholar] [CrossRef]
- Pandey, A.; Wang, D. On adversarial training and loss functions for speech enhancement. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5414–5418. [Google Scholar] [CrossRef]
- Jun, S.; Kim, M.; Oh, M.; Park, H.M. Robust speech recognition based on independent vector analysis using harmonic frequency dependency. Neyral Comput. Appl. 2013, 22, 1321–1327. [Google Scholar] [CrossRef]
- Kavalekalam, M.S.; Christensen, M.G.; Gran, F.; Boldt, J.B. Kalman filter for speech enhancement in cocktail party scenarios using a codebook-based approach. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 191–195. [Google Scholar] [CrossRef]
- Yu, C.; Kang, M.; Chen, Y.; Li, M.; Dai, T. Endangered Tujia language speech enhancement research based on improved DCGAN. In Proceedings of the China National Conference on Chinese Computational Linguistics, Kunming, China, 18–20 October 2019; pp. 394–404. [Google Scholar] [CrossRef]
- Ochiai, T.; Watanabe, S.; Katagiri, S.; Hori, T.; Hershey, J. Speaker adaptation for multichannel end-to-end speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6707–6711. [Google Scholar] [CrossRef]
- Liao, H. Speaker adaptation of context dependent deep neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7947–7951. [Google Scholar] [CrossRef]
- Zhang, W.L.; Zhang, W.Q.; Qu, D.; Li, B.C. Speaker adaptation based on regularized speaker-dependent eigenphone matrix estimation. EURASIP J. Audio Speech Music Process. 2014, 2014, 11. [Google Scholar] [CrossRef]
- Yu, D.; Yao, K.; Su, H.; Li, G.; Seide, F. KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7893–7897. [Google Scholar] [CrossRef]
- Imseng, D.; Bourlard, H. Speaker adaptive kullback-leibler divergence based hidden markov models. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7913–7917. [Google Scholar] [CrossRef]
- Meng, Z.; Li, J.; Gong, Y. Adversarial speaker adaptation. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5721–5725. [Google Scholar] [CrossRef]
- McGurk, H.; MacDonald, J. Hearing lips and seeing voices. Nature 1976, 264, 746–748. [Google Scholar] [CrossRef]
- Ivanko, D.; Karpov, A.; Fedotov, D.; Kipyatkova, I.; Ryumin, D.; Ivanko, D.; Zelezny, M. Multimodal speech recognition: Increasing accuracy using high speed video data. J. Multimodal User Interfaces 2018, 12, 319–328. [Google Scholar] [CrossRef]
- Massaro, D.W.; Cohen, M.M. Evaluation and integration of visual and auditory information in speech perception. J. Exp. Psychol. Hum. 1983, 9, 753. [Google Scholar] [CrossRef] [PubMed]
- Assael, Y.M.; Shillingford, B.; Whiteson, S.; De Freitas, N. Lipnet: End-to-end sentence-level lipreading. arXiv 2016, arXiv:1611.01599. [Google Scholar]
- Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3444–3453. [Google Scholar] [CrossRef]
- Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 221–231. [Google Scholar] [CrossRef] [PubMed]
- Katsaggelos, A.K.; Bahaadini, S.; Molina, R. Audiovisual fusion: Challenges and new approaches. Proc. IEEE 2015, 103, 1635–1653. [Google Scholar] [CrossRef]
- Brand, M.; Oliver, N.; Pentland, A. Coupled hidden markov models for complex action recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 994–999. [Google Scholar] [CrossRef]
- Dupont, S.; Luettin, J. Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2000, 2, 141–151. [Google Scholar] [CrossRef]
- Lee, J.S.; Park, C.H. Adaptive decision fusion for audio-visual speech recognition. Speech Recognit. Technol. Appl. 2008, 2008, 275–296. [Google Scholar] [CrossRef]
- Huang, J.; Kingsbury, B. Audio-visual deep learning for noise robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7596–7599. [Google Scholar] [CrossRef]
- Teissier, P.; Robert-Ribes, J.; Schwartz, J.L.; Guérin-Dugué, A. Comparing models for audiovisual fusion in a noisy-vowel recognition task. IEEE Trans. Speech Audio Process. 1999, 7, 629–642. [Google Scholar] [CrossRef]
- Potamianos, G.; Neti, C.; Gravier, G.; Garg, A.; Senior, A.W. Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 2003, 91, 1306–1326. [Google Scholar] [CrossRef]
- Mroueh, Y.; Marcheret, E.; Goel, V. Deep multimodal learning for audio-visual speech recognition. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 2130–2134. [Google Scholar] [CrossRef]
- Petridis, S.; Stafylakis, T.; Ma, P.; Cai, F.; Tzimiropoulos, G.; Pantic, M. End-to-end audiovisual speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 6548–6552. [Google Scholar] [CrossRef]
- Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 44, 8717–8727. [Google Scholar] [CrossRef]
- Sterpu, G.; Saam, C.; Harte, N. Attention-based audio-visual fusion for robust automatic speech recognition. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 16–20 October 2018; pp. 111–115. [Google Scholar] [CrossRef]
- Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; Sun, C. Attention bottlenecks for multimodal fusion. Adv. Neural Inf. Process. Syst. 2021, 34. [Google Scholar] [CrossRef]
- Wei, L.; Zhang, J.; Hou, J.; Dai, L. Attentive fusion enhanced audio-visual encoding for transformer based robust speech recognition. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 638–643. [Google Scholar]
- Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar] [CrossRef]
Name | Dates | Speakers | Sentences | Duration |
---|---|---|---|---|
Tima Legend | 20 June 2017 | Lu Bangzhu | 349 | 12 m 59 s |
Origins of Lu Family | 19 June 2017 | 186 | 7 m 53 s | |
Festival Customs of Tujia People | 17 June 2017 | 329 | 12 m 5 s | |
Marital Customs of Tujia People | 17 June 2017 | 228 | 8 m 27 s | |
Construction of Stilted Buildings | 17 June 2017 | 348 | 12 m | |
Experience of Being an Official | 17 June 2017 | Lu Chengcheng | 155 | 6 m 23 s |
Life of Lu Longwen | 15 June 2017 | Lu Longwen | 122 | 3 m 55 s |
Life of Lu Kaibai | 14 June 2017 | Lu Kaibai | 43 | 2 m 23 s |
Bilabial | Supra-Dental | Blade-Alveola | Frontal | Velar | |||
---|---|---|---|---|---|---|---|
Stops | voiceless | unaspirated | p | t | k | ||
aspirated | ph | th | kh | ||||
Affricates | voiceless | unaspirated | ts | tɕ | |||
aspirated | tsh | tɕh | |||||
Nasals | m | n | ŋ | ||||
Laterals | l | ||||||
Fricatives | voiceless | s | ɕ | x | |||
voiced | z | ɣ | |||||
Semivowels | w | j |
Categorization | Finals |
---|---|
Simple Finals | i、e、a、ɨ、o、u |
Compound Finals | ie、ei、uei、ai、uai、ia、ua、iu、ou、iau、au |
Nasal Finals | ĩ、ẽ、ã、ũ、uẽ、iã、uã、iũ |
Model | Quantity of Sentences | ||||
---|---|---|---|---|---|
2000 | 3000 | 4000 | 5000 | 6000 | |
HMM | 91.03 | 79.27 | 76.28 | 73.19 | 69.97 |
GMM/HMM | 90.02 | 82.42 | 77.41 | 73.01 | 67.74 |
GMM/HMM + LDA | 91.10 | 82.43 | 77.62 | 72.86 | 67.52 |
GMM/HMM + LDA + MLLT | 93.64 | 85.63 | 81.86 | 75.13 | 68.62 |
GMM/HMM + LDA + MLLT + SAT | 97.67 | 83.60 | 79.83 | 76.66 | 69.32 |
HMM/LSTM | - | - | - | - | 73.94 |
Transformer/CTC | 66.0 | 61.7 | 58.6 | 53.3 | 48.2 |
Model | Quantity of Sentences | ||
---|---|---|---|
1000 | 1400 | 1800 | |
Video-only | 60.2% | 59.2% | 58.7% |
Audio-only | 68.0% | 67.2% | 63.8% |
AV(TM-CTC) | 60.1% | 55.8% | 55.6% |
AV(LSTM/TM-CTC) | 50.9% | 50.2% | 46.9% |
AV (feature fusion) | - | 61.7% | 61.4% |
Model | Quantity of Sentences | ||
---|---|---|---|
1000 | 1400 | 1800 | |
Video-only | 61.4% | 60.9% | 58.8% |
Audio-only | 70.3% | 69.7% | 69.2% |
AV(TM-CTC) | 59.0% | 58.1% | 56.6% |
AV(LSTM/TM-CTC) | 63.7% | 56.0% | 52.0% |
AV (feature fusion) | - | - | - |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, C.; Yu, J.; Qian, Z.; Tan, Y. Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech. Sensors 2023, 23, 2071. https://doi.org/10.3390/s23042071
Yu C, Yu J, Qian Z, Tan Y. Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech. Sensors. 2023; 23(4):2071. https://doi.org/10.3390/s23042071
Chicago/Turabian StyleYu, Chongchong, Jiaqi Yu, Zhaopeng Qian, and Yuchen Tan. 2023. "Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech" Sensors 23, no. 4: 2071. https://doi.org/10.3390/s23042071
APA StyleYu, C., Yu, J., Qian, Z., & Tan, Y. (2023). Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech. Sensors, 23(4), 2071. https://doi.org/10.3390/s23042071