An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario
Abstract
:1. Introduction
- (1)
- We explore different pre-training strategies in the AVSR tasks, and this can provide some useful experience for deploying AVSR systems in real scenarios where there is only a limited size of audio–visual data;
- (2)
- Based on (1), we explore the impacts of different model architectures for audio–visual embedding extractors on the AVSR performance;
- (3)
- We explore the performance of different audio–visual fusion methods.
2. End-to-End AVSR Framework
3. Methods
3.1. Different Pre-Training Strategies
3.2. Model Architectures
3.3. Audio–Visual Fusion
4. Experiments and Analysis
4.1. Experimental Setups
4.2. Experimental Results
4.2.1. Results of the Pre-Training Strategies
4.2.2. Results of Different Model Architectures
4.2.3. Results of Audio–Visual Fusion
4.2.4. Results of System Fusion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Baker, J.M.; Deng, L.; Glass, J.; Khudanpur, S.; Lee, C.H.; Morgan, N.; O’Shaughnessy, D. Developments and directions in speech recognition and understanding, Part 1 [DSP Education]. IEEE Signal Process. Mag. 2009, 26, 75–80. [Google Scholar] [CrossRef]
- Deng, L.; Hinton, G.; Kingsbury, B. New types of deep neural network learning for speech recognition and related applications: An overview. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8599–8603. [Google Scholar]
- Deng, L.; Li, X. Machine learning paradigms for speech recognition: An overview. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 1060–1089. [Google Scholar] [CrossRef]
- Li, J. Recent advances in end-to-end automatic speech recognition. APSIPA Trans. Signal Inf. Process. 2022, 11, e8. [Google Scholar] [CrossRef]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Yu, D.; Deng, L. Automatic Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2016; Volume 1. [Google Scholar]
- Graves, A.; Jaitly, N. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning; PMLR: Bejing, China, 2014; pp. 1764–1772. [Google Scholar]
- Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567 2014. [Google Scholar]
- Chorowski, J.; Bahdanau, D.; Cho, K.; Bengio, Y. End-to-end continuous speech recognition using attention-based recurrent NN: First results. arXiv 2014, arXiv:1412.1602 2014. [Google Scholar]
- Miao, Y.; Gowayyed, M.; Metze, F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In Proceedings of the 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), Scottsdale, AZ, USA, 13–17 December 2015; pp. 167–174. [Google Scholar]
- Bahdanau, D.; Chorowski, J.; Serdyuk, D.; Brakel, P.; Bengio, Y. End-to-end attention-based large vocabulary speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 25 March 2016; pp. 4945–4949. [Google Scholar]
- Chan, W.; Jaitly, N.; Le, Q.; Vinyals, O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of the 2016 IEEE international Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 25 March 2016; pp. 4960–4964. [Google Scholar]
- Watanabe, S.; Mandel, M.; Barker, J.; Vincent, E.; Arora, A.; Chang, X.; Khudanpur, S.; Manohar, V.; Povey, D.; Raj, D.; et al. CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. arXiv 2020, arXiv:2004.09249 2020. [Google Scholar]
- Yu, F.; Zhang, S.; Fu, Y.; Xie, L.; Zheng, S.; Du, Z.; Huang, W.; Guo, P.; Yan, Z.; Ma, B.; et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6167–6171. [Google Scholar]
- McGurk, H.; MacDonald, J. Hearing lips and seeing voices. Nature 1976, 264, 746–748. [Google Scholar] [CrossRef]
- Rosenblum, L.D. Speech perception as a multimodal phenomenon. Curr. Dir. Psychol. Sci. 2008, 17, 405–409. [Google Scholar] [CrossRef] [Green Version]
- Massaro, D.W.; Simpson, J.A. Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry; Psychology Press: London, UK, 2014. [Google Scholar]
- Tao, F.; Busso, C. Gating neural network for large vocabulary audiovisual speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1290–1302. [Google Scholar] [CrossRef]
- Son Chung, J.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6447–6456. [Google Scholar]
- Petridis, S.; Stafylakis, T.; Ma, P.; Tzimiropoulos, G.; Pantic, M. Audio-Visual Speech Recognition with a Hybrid Ctc/Attention Architecture. In Proceedings of the 2018 IEEE Spoken Language TechnologyWorkshop (SLT), Athens, Greece, 18–21 December 2018; pp. 513–520. [Google Scholar]
- Xu, B.; Lu, C.; Guo, Y.; Wang, J. Discriminative multi-modality speech recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14433–14442. [Google Scholar]
- Makino, T.; Liao, H.; Assael, Y.; Shillingford, B.; Garcia, B.; Braga, O.; Siohan, O. Recurrent neural network transducer for audio-visual speech recognition. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 905–912. [Google Scholar]
- Braga, O.; Makino, T.; Siohan, O.; Liao, H. End-to-End Multi-Person Audio/Visual Automatic Speech Recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6994–6998. [Google Scholar]
- Ma, P.; Petridis, S.; Pantic, M. End-to-end audio-visual speech recognition with conformers. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 7613–7617. [Google Scholar]
- Sterpu, G.; Saam, C.; Harte, N. Attention-based audio-visual fusion for robust automatic speech recognition. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, Boulder, CO, USA, 19–20 October 2018; pp. 111–115. [Google Scholar]
- Sterpu, G.; Saam, C.; Harte, N. How to teach DNNs to pay attention to the visual modality in speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1052–1064. [Google Scholar] [CrossRef]
- Cooke, M.; Barker, J.; Cunningham, S.; Shao, X. An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 2006, 120, 2421–2424. [Google Scholar] [CrossRef] [Green Version]
- Zhao, G.; Barnard, M.; Pietikainen, M. Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimed. 2009, 11, 1254–1265. [Google Scholar] [CrossRef]
- Anina, I.; Zhou, Z.; Zhao, G.; Pietikäinen, M. Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis. In Proceedings of the 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Ljubljana, Slovenia, 4–8 May 2015; Volume 1, pp. 1–5. [Google Scholar]
- Harte, N.; Gillen, E. TCD-TIMIT: An audio-visual corpus of continuous speech. IEEE Trans. Multimed. 2015, 17, 603–615. [Google Scholar] [CrossRef]
- Chung, J.S.; Zisserman, A. Lip reading in the wild. In Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016, Revised Selected Papers, Part II 13; Springer: Berlin/Heidelberg, Germany, 2017; pp. 87–103. [Google Scholar]
- Ephrat, A.; Mosseri, I.; Lang, O.; Dekel, T.; Wilson, K.; Hassidim, A.; Freeman, W.T.; Rubinstein, M. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv 2018, arXiv:1804.03619 2018. [Google Scholar] [CrossRef] [Green Version]
- Yu, J.; Su, R.; Wang, L.; Zhou, W. A multi-channel/multi-speaker interactive 3D audio-visual speech corpus in Mandarin. In Proceedings of the 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP), Tianjin, China, 17–20 October 2016; pp. 1–5. [Google Scholar]
- Liu, H.; Chen, Z.; Shi, W. Robust Audio-Visual Mandarin Speech Recognition Based on Adaptive Decision Fusion and Tone Features. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 25–28 October 2020; pp. 1381–1385. [Google Scholar]
- Chen, H.; Zhou, H.; Du, J.; Lee, C.H.; Chen, J.; Watanabe, S.; Siniscalchi, S.M.; Scharenborg, O.; Liu, D.Y.; Yin, B.C.; et al. The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, tasks, baselines and results. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9266–9270. [Google Scholar]
- Chen, H.; Du, J.; Dai, Y.; Lee, C.H.; Siniscalchi, S.M.; Watanabe, S.; Scharenborg, O.; Chen, J.; Yin, B.C.; Pan, J. Audio-Visual Speech Recognition in MISP 2021 Challenge: Dataset Release and Deep Analysis. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Incheon, Korea, 18–22 September 2022; Volume 2022, pp. 1766–1770. [Google Scholar]
- Xu, G.; Yang, S.; Li, W.; Wang, S.; Wei, G.; Yuan, J.; Gao, J. Channel-Wise AV-Fusion Attention for Multi-Channel Audio-Visual Speech Recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9251–9255. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100 2020. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Watanabe, S.; Hori, T.; Kim, S.; Hershey, J.R.; Hayashi, T. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 2017, 11, 1240–1253. [Google Scholar] [CrossRef]
- Shi, B.; Hsu, W.N.; Lakhotia, K.; Mohamed, A. Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv 2022, arXiv:2201.02184 2022. [Google Scholar]
- Zhang, J.X.; Wan, G.; Pan, J. Is Lip Region-of-Interest Sufficient for Lipreading? In Proceedings of the 2022 International Conference on Multimodal Interaction, Bengaluru, India, 21–22 January 2022; pp. 368–372. [Google Scholar]
- Yuan, J.; Xiong, H.C.; Xiao, Y.; Guan, W.; Wang, M.; Hong, R.; Li, Z.Y. Gated CNN: Integrating multi-scale feature layers for object detection. Pattern Recognit. 2020, 105, 107131. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Zhang, W.; Ye, Z.; Tang, H.; Li, X.; Zhou, X.; Yang, J.; Cui, J.; Deng, P.; Shi, M.; Song, Y.; et al. The USTC-NELSLIP Offline Speech Translation Systems for IWSLT 2022. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), Dublin, Ireland, 26–27 May 2022; pp. 198–207. [Google Scholar]
- Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A fast, extensible toolkit for sequence modeling. arXiv 2019, arXiv:1904.01038 2019. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779 2019. [Google Scholar]
- Wang, W.; Gong, X.; Wu, Y.; Zhou, Z.; Li, C.; Zhang, W.; Han, B.; Qian, Y. The Sjtu System for Multimodal Information Based Speech Processing Challenge 2021. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 9261–9265. [Google Scholar]
Task | Model | Pre-Training | Data Processing | Eval |
---|---|---|---|---|
ASR | Hybrid | N | N | 45.92 |
CAT | 41.69 | |||
AVSR | Hybrid | N | N | 44.12 |
E2E [36] | N | N | 52.26 | |
E2E | HS (CAT) | N | 35.93 | |
E2E | HS (CAT) | CAT | 33.78 |
Task | Model | Pre-Training | Data Processing | Eval |
---|---|---|---|---|
ASR | E2E | HS (CAT) | CAT | 36.81 |
HS (SP) | SP | 37.44 | ||
HS (SP + CAT) | SP + CAT | 35.90 | ||
AVSR | E2E | ES (CAT) | CAT | 33.16 |
ES (SP) | CAT | 34.00 | ||
ES (SP + CAT) | CAT | 32.21 |
Model | Pre-Training | Extractor | Data Processing | Eval |
---|---|---|---|---|
E2E (M = 6, N = 3) | HS (CAT) | VGG/Resnet | CAT | 38.27 |
E2E (M = 6, N = 3) | AV-HuBERT-V1 | 42.78 | ||
E2E (M = 10, N = 4) | 41.85 | |||
E2E (M = 12, N = 6) | 96.98 |
Task | Model | Pre-Training | A/V Extractor | Eval |
---|---|---|---|---|
AVSR | E2E | ES (SP + CAT) | VGG/Resnet | 32.21 |
Gate CNN/Resnet | 29.88 | |||
VGG/Swin Transformer | 36.13 |
Task | Model | ALL | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 |
---|---|---|---|---|---|---|---|---|---|---|
ASR | E2E | 41.15 | 46.99 | 45.45 | 48.02 | 37.84 | 41.80 | 47.43 | 32.23 | 33.14 |
AVSR | 38.62 | 42.44 | 39.80 | 45.61 | 35.13 | 37.42 | 49.60 | 30.71 | 31.98 |
Task | Model | Pre-Training | Fusion | Data Processing | Eval |
---|---|---|---|---|---|
AVSR | E2E | HS (CAT) | Splicing | CAT | 33.78 |
SA | CAT | 38.16 | |||
SA | CAT + SP | 37.43 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yin, B.; Niu, S.; Tang, H.; Sun, L.; Du, J.; Ling, Z.; Liu, C. An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario. Appl. Sci. 2023, 13, 4100. https://doi.org/10.3390/app13074100
Yin B, Niu S, Tang H, Sun L, Du J, Ling Z, Liu C. An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario. Applied Sciences. 2023; 13(7):4100. https://doi.org/10.3390/app13074100
Chicago/Turabian StyleYin, Bing, Shutong Niu, Haitao Tang, Lei Sun, Jun Du, Zhenhua Ling, and Cong Liu. 2023. "An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario" Applied Sciences 13, no. 7: 4100. https://doi.org/10.3390/app13074100
APA StyleYin, B., Niu, S., Tang, H., Sun, L., Du, J., Ling, Z., & Liu, C. (2023). An Investigation into Audio–Visual Speech Recognition under a Realistic Home–TV Scenario. Applied Sciences, 13(7), 4100. https://doi.org/10.3390/app13074100