A Pronunciation Prior Assisted Vowel Reduction Detection Framework with Multi-Stream Attention Method
Abstract
:1. Introduction
- Among the methods for this task, it is the first to perform an E2E framework that is easy to integrate into widely used ASR-based CAPT systems;
- A method of adopting the auxiliary encoder to utilize the prior information of pronunciation is proposed and several prior types of pronunciation are designed;
- A multi-stream mechanism is proposed. It uses the attention mechanism to process the association between the speech signal and the prior knowledge of pronunciation and generates a fusion information stream, which is sent to the back-end with the original coded information streams.
2. Method
2.1. CTC-Attention Based Multi-Task Learning
2.2. Automatic Auxiliary Input Sequence Generation
2.3. Pronunciation Prior Knowledge Assisted Multi-Encoder Structure
2.4. Multi-Stream Information Fusion
2.5. Multi-Stream Expansion Based on Location-Aware Attention Mechanism
3. Experimental Setup
3.1. Dataset Preprocessing and Acoustic Features
3.2. HMM-DNN Hybrid Baseline
3.3. Settings in Our Proposed Method
3.4. Evaluation Metrics
4. Experimental Results and Discussion
4.1. Comparison with HMM-DNN Hybrid Baseline System and Related Work
4.2. Analysis of Auxiliary Input
4.3. Analysis of Multi-Stream Expansion Method
4.4. Comparison of Different Auxiliary Input Types
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Li, K.; Mao, S.; Li, X.; Wu, Z.; Meng, H. Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks. Speech Commun. 2018, 96, 28–36. [Google Scholar] [CrossRef]
- Lee, G.G.; Lee, H.Y.; Song, J.; Kim, B.; Kang, S.; Lee, J.; Hwang, H. Automatic sentence stress feedback for non-native English learners. Comput. Speech Lang. 2017, 41, 29–42. [Google Scholar] [CrossRef] [Green Version]
- Li, K.; Wu, X.; Meng, H. Intonation classification for L2 English speech using multi-distribution deep neural networks. Comput. Speech Lang. 2017, 43, 18–33. [Google Scholar] [CrossRef]
- Bang, J.; Lee, K.; Ryu, S.; Lee, G.G. Vowel-reduction feedback system for non-native learners of English. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 935–939. [Google Scholar]
- Lindblom, B. Spectrographic study of vowel reduction. J. Acoust. Soc. Am. 1963, 35, 1773–1781. [Google Scholar] [CrossRef]
- Zhang, Y.; Nissen, S.L.; Francis, A.L. Acoustic characteristics of English lexical stress produced by native Mandarin speakers. J. Acoust. Soc. Am. 2008, 123, 4498–4513. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- van Bergem, D.R. Acoustic and lexical vowel reduction. In Proceedings of the Phonetics and Phonology of Speaking Styles: Reduction and Elaboration in Speech Communication, Barcelona, Spain, 30 September–2 October 1991. [Google Scholar]
- Flemming, E. A Phonetically-Based Model of Phonological Vowel Reduction; MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
- Burzio, L. Phonology and phonetics of English stress and vowel reduction. Lang. Sci. 2007, 29, 154–176. [Google Scholar] [CrossRef]
- Kuo, C.; Weismer, G. Vowel reduction across tasks for male speakers of American English. J. Acoust. Soc. Am. 2016, 140, 369–383. [Google Scholar] [CrossRef] [PubMed]
- Byers, E.; Yavas, M. Vowel reduction in word-final position by early and late Spanish-English bilinguals. PLoS ONE 2017, 12, e0175226. [Google Scholar] [CrossRef] [PubMed]
- Fourakis, M. Tempo, stress, and vowel reduction in American English. J. Acoust. Soc. Am. 1991, 90, 1816–1827. [Google Scholar] [CrossRef] [PubMed]
- Gonzalez-Dominguez, J.; Eustis, D.; Lopez-Moreno, I.; Senior, A.; Beaufays, F.; Moreno, P.J. A real-time end-to-end multilingual speech recognition architecture. IEEE J. Sel. Top. Signal Process. 2014, 9, 749–759. [Google Scholar] [CrossRef]
- Dhakal, P.; Damacharla, P.; Javaid, A.Y.; Devabhaktuni, V. A near real-time automatic speaker recognition architecture for voice-based user interface. Mach. Learn. Knowl. Extract. 2019, 1, 504–520. [Google Scholar] [CrossRef] [Green Version]
- Yang, C.H.H.; Qi, J.; Chen, S.Y.C.; Chen, P.Y.; Siniscalchi, S.M.; Ma, X.; Lee, C.H. Decentralizing feature extraction with quantum convolutional neural network for automatic speech recognition. In Proceedings of the ICASSP 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6523–6527. [Google Scholar]
- Watanabe, S.; Hori, T.; Kim, S.; Hershey, J.R.; Hayashi, T. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process. 2017, 11, 1240–1253. [Google Scholar] [CrossRef]
- Lo, T.H.; Weng, S.Y.; Chang, H.J.; Chen, B. An Effective End-to-End Modeling Approach for Mispronunciation Detection. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Shanghai, China, 25–29 October 2020. [Google Scholar]
- Feng, Y.; Fu, G.; Chen, Q.; Chen, K. SED-MDD: Towards sentence dependent end-to-end mispronunciation detection and diagnosis. In Proceedings of the ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3492–3496. [Google Scholar]
- Zhang, Z.; Wang, Y.; Yang, J. Text-conditioned Transformer for automatic pronunciation error detection. Speech Commun. 2021, 130, 55–63. [Google Scholar] [CrossRef]
- Wang, X.; Li, R.; Mallidi, S.H.; Hori, T.; Watanabe, S.; Hermansky, H. Stream attention-based multi-array end-to-end speech recognition. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7105–7109. [Google Scholar]
- Li, R.; Wang, X.; Mallidi, S.H.; Watanabe, S.; Hori, T.; Hermansky, H. Multi-stream end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 28, 646–655. [Google Scholar] [CrossRef] [Green Version]
- Leung, W.K.; Liu, X.; Meng, H. CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8132–8136. [Google Scholar]
- Kim, S.; Hori, T.; Watanabe, S. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4835–4839. [Google Scholar]
- Weide, R. The CMU Pronunciation Dictionary; Release 0.6; Carnegie Mellon University: Pittsburgh, PA, USA, 1998. [Google Scholar]
- Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-Based Models for Speech Recognition. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’15), Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; Volume 1, pp. 577–585. [Google Scholar]
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1; NASA STI/Recon Technical Report; NASA: Washington, DC, USA, 1993; Volume 93, p. 27403. [Google Scholar]
- Watanabe, S.; Hori, T.; Karita, S.; Hayashi, T.; Nishitoba, J.; Unno, Y.; Yalta Soplin, N.E.; Heymann, J.; Wiesner, M.; Chen, N.; et al. ESPnet: End-to-End Speech Processing Toolkit. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 2207–2211. [Google Scholar]
- Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlicek, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi speech recognition toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Big Island, HI, USA, 11–15 December 2011; pp. 1–4. [Google Scholar]
- Yan, B.C.; Chen, B. End-to-End Mispronunciation Detection and Diagnosis From Raw Waveforms. arXiv 2021, arXiv:2103.03023. [Google Scholar]
- Fu, K.; Lin, J.; Ke, D.; Xie, Y.; Zhang, J.; Lin, B. A Full Text-Dependent End to End Mispronunciation Detection and Diagnosis with Easy Data Augmentation Techniques. arXiv 2021, arXiv:2104.08428. [Google Scholar]
- Jiang, S.W.F.; Yan, B.C.; Lo, T.H.; Chao, F.A.; Chen, B. Towards Robust Mispronunciation Detection and Diagnosis for L2 English Learners with Accent-Modulating Methods. arXiv 2021, arXiv:2108.11627. [Google Scholar]
Text | The Bedroom Wall |
---|---|
G2P | DH AH0 B EH1 D R UW2 M W AO1 L |
Letter | t h e b e d r o o m w a l l |
Subset | Sentences | ah Schwa | ah Normal | ih Schwa | ih Normal |
---|---|---|---|---|---|
Train | 3696 | 3892 | 2266 | 7370 | 4248 |
Dev | 400 | 403 | 248 | 757 | 444 |
Test | 192 | 186 | 135 | 377 | 203 |
Model | PCR (%) | ||||
---|---|---|---|---|---|
Pre/Rec | F1 | Pre/Rec | F1 | ||
Hybrid | 0.58/0.64 | 0.61 | 0.54/0.53 | 0.53 | 77.3 |
CNN-RNN-CTC [22] | 0.64/0.67 | 0.65 | 0.57/0.56 | 0.56 | 74.7 |
E2E | 0.59/0.68 | 0.63 | 0.53/0.54 | 0.53 | 77.9 |
E2E + prior | 0.64/0.67 | 0.66 | 0.62/0.55 | 0.58 | 81.8 |
Method | PCR (%) | Time (s) | ||||
---|---|---|---|---|---|---|
Pre/Rec | F1 | Pre/Rec | F1 | |||
original-streams + G2P | 0.64/0.67 | 0.66 | 0.62/0.55 | 0.58 | 81.8 | 244 |
original-streams + char | 0.58/0.72 | 0.64 | 0.60/0.57 | 0.59 | 82.9 | 232 |
multi-streams + G2P | 0.55/0.63 | 0.59 | 0.55/0.52 | 0.54 | 80.6 | 248 |
multi-streams + char | 0.58/0.68 | 0.63 | 0.58/0.53 | 0.56 | 86.3 | 241 |
Method | ||||
---|---|---|---|---|
Pre/Rec | F1 | Pre/Rec | F1 | |
Human trans | 0.80/0.82 | 0.810 | 0.82/0.82 | 0.820 |
G2P | 0.64/0.67 | 0.655 | 0.62/0.55 | 0.583 |
Letter | 0.58/0.72 | 0.642 | 0.60/0.57 | 0.585 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, Z.; Huang, Z.; Wang, L.; Zhang, P. A Pronunciation Prior Assisted Vowel Reduction Detection Framework with Multi-Stream Attention Method. Appl. Sci. 2021, 11, 8321. https://doi.org/10.3390/app11188321
Liu Z, Huang Z, Wang L, Zhang P. A Pronunciation Prior Assisted Vowel Reduction Detection Framework with Multi-Stream Attention Method. Applied Sciences. 2021; 11(18):8321. https://doi.org/10.3390/app11188321
Chicago/Turabian StyleLiu, Zongming, Zhihua Huang, Li Wang, and Pengyuan Zhang. 2021. "A Pronunciation Prior Assisted Vowel Reduction Detection Framework with Multi-Stream Attention Method" Applied Sciences 11, no. 18: 8321. https://doi.org/10.3390/app11188321
APA StyleLiu, Z., Huang, Z., Wang, L., & Zhang, P. (2021). A Pronunciation Prior Assisted Vowel Reduction Detection Framework with Multi-Stream Attention Method. Applied Sciences, 11(18), 8321. https://doi.org/10.3390/app11188321