Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation
Abstract
:1. Introduction
2. Materials and Methods
2.1. Methods
2.1.1. Teacher Model
- Selecting and extracting acoustic features from the raw speech signals;
- Going through a deep learning network to obtain deep emotion features;
- Using a simple classifier to convert the deep features to the predicted emotion label.
2.1.2. Student Model
2.1.3. Cross-Modal Emotion Distillation
2.2. Materials
3. Experiments and Results
3.1. Implementation Details
3.2. Experiments on the Teacher Model
- The choice of acoustic feature input, log-Mel-spectrograms or traditional Mel-spectro-grams utilized in [39];
3.3. Experiments on the Student Model
3.4. Experiments on Cross-Modal Emotion Distillation
- A pre-trained techer model based on ResNet18 with log-Mel-spectrograms as input;
- A 4-layer Bi-LSTM student model with as input;
- The cross-modal emotion distillation was conducted on the feature level.
3.5. Evaluations and Results
3.6. Experiments on EMO-DB
- Choosing log-Mel-spectrograms as the teacher model input and ResNet18 as the teacher model;
- Choosing 4-Layer Bi-LSTM as the student model;
- Setting hyper-parameters and in the model with CMED.
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G. Emotion recognition in human-computer interaction. IEEE Signal Process. Mag. 2002, 18, 32–80. [Google Scholar] [CrossRef]
- Ringeval, F.; Michaud, A.; Cifti, E.; Güle, H.; Lalanne, D. AVEC 2018 Workshop and Challenge: Bipolar Disorder and Cross-Cultural Affect Recognition. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, Seoul, Korea, 22 October 2018. [Google Scholar]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204. [Google Scholar] [CrossRef] [Green Version]
- Neumann, M.; Vu, N.T. Attentive Convolutional Neural Network Based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech. In Proceedings of the Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 1263–1267. [Google Scholar]
- Kim, J.; Englebienne, G.; Truong, K.P.; Evers, V. Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition. In Proceedings of the 2017 ACM on Multimedia Conference, Mountain View, CA, USA, 23–27 October 2017; pp. 1006–1013. [Google Scholar] [CrossRef]
- Han, W.; Ruan, H.; Chen, X.; Wang, Z.; Li, H.; Schuller, B. Towards Temporal Modelling of Categorical Speech Emotion Recognition. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018. [Google Scholar] [CrossRef] [Green Version]
- Atmaja, B.T.; Akagi, M. Speech Emotion Recognition Based on Speech Segment Using LSTM with Attention Model. In Proceedings of the 2019 IEEE International Conference on Signals and Systems (ICSigSys), Bandung, Indonesia, 16–18 July 2019; pp. 40–44. [Google Scholar] [CrossRef]
- Rajamani, S.T.; Rajamani, K.T.; Mallol-Ragolta, A.; Liu, S.; Schuller, B. A Novel Attention-Based Gated Recurrent Unit and its Efficacy in Speech Emotion Recognition. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6294–6298. [Google Scholar] [CrossRef]
- Peng, Z.; Lu, Y.; Pan, S.; Liu, Y. Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 3020–3024. [Google Scholar] [CrossRef]
- Helmiyah, S.; Riadi, I.; Umar, R.; Hanif, A. Speech Classification to Recognize Emotion Using Artificial Neural Network. Khazanah Inform. J. Ilmu Komput. Dan Inform. 2021, 7, 11913. [Google Scholar] [CrossRef]
- Tzirakis, P.; Zhang, J.; Schuller, B.W. End-to-End Speech Emotion Recognition Using Deep Neural Networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5089–5093. [Google Scholar] [CrossRef]
- Sarma, M.; Ghahremani, P.; Povey, D.; Goel, N.K.; Dehak, N. Emotion Identification from Raw Speech Signals Using DNNs. In Proceedings of the Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2–6 September 2018; pp. 3097–3101. [Google Scholar] [CrossRef] [Green Version]
- Yu, Y.; Kim, Y.J. Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database. Electronics 2020, 9, 713. [Google Scholar] [CrossRef]
- Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 2020, 59, 101894. [Google Scholar] [CrossRef]
- Muppidi, A.; Radfar, M. Speech Emotion Recognition Using Quaternion Convolutional Neural Networks. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6309–6313. [Google Scholar] [CrossRef]
- Bandela, S.R.; Kumar, T.K. Unsupervised feature selection and NMF de-noising for robust Speech Emotion Recognition. Appl. Acoust. 2021, 172, 107645. [Google Scholar] [CrossRef]
- Tronchin, L.; Kob, M.; Guarnaccia, C. Spatial Information on Voice Generation from a Multi-Channel Electroglottograph. Appl. Sci. 2018, 8, 1560. [Google Scholar] [CrossRef] [Green Version]
- Fant, G. Acoustic Theory of Speech Production; De Gruyter Mouton: Berlin, Germany, 1971. [Google Scholar] [CrossRef]
- Kumar, S.S.; Mandal, T.; Rao, K.S. Robust glottal activity detection using the phase of an electroglottographic signal. Biomed. Signal Process. Control 2017, 36, 27–38. [Google Scholar] [CrossRef]
- Chen, L.; Mao, X.; Yan, H. Text-Independent Phoneme Segmentation Combining EGG and Speech Data. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 1029–1037. [Google Scholar] [CrossRef]
- Paul, N.; Kumar, S.; Chatterjee, I.; Mukherjee, B. Electroglottographic Parameterization of the Effects of Gender, Vowel and Phonatory Registers on Vocal Fold Vibratory Patterns: An Indian Perspective. Indian J. Otolaryngol. Head Neck Surg. 2011, 63, 27–31. [Google Scholar] [CrossRef] [Green Version]
- Macerata, A.; Nacci, A.; Manti, M.; Cianchetti, M.; Matteucci, J.; Romeo, S.O.; Fattori, B.; Berrettini, S.; Laschi, C.; Ursino, F. Evaluation of the Electroglottographic signal variability by amplitude-speed combined analysis. Biomed. Signal Process. Control 2017, 37, 61–68. [Google Scholar] [CrossRef]
- Borsky, M.; Mehta, D.D.; Stan, J.H.V.; Gudnason, J. Modal and Nonmodal Voice Quality Classification Using Acoustic and Electroglottographic Features. IEEE/Acm Trans. Audio Speech Lang. Process. 2017, 25, 2281–2291. [Google Scholar] [CrossRef] [PubMed]
- Liu, D.; Kankare, E.; Laukkanen, A.M.; Alku, P. Comparison of parametrization methods of electroglottographic and inverse filtered acoustic speech pressure signals in distinguishing between phonation types. Biomed. Signal Process. Control 2017, 36, 183–193. [Google Scholar] [CrossRef] [Green Version]
- Lebacq, J.; Dejonckere, P.H. The dynamics of vocal onset. Biomed. Signal Process. Control 2019, 49, 528–539. [Google Scholar] [CrossRef]
- Filipa, M.; Ternstrm, S. Flow ball-assisted voice training: Immediate effects on vocal fold contacting. Biomed. Signal Process. Control 2020, 62. [Google Scholar] [CrossRef]
- Chen, L.; Ren, J.; Chen, P.; Mao, X.; Zhao, Q. Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2. Appl. Intell. 2022. [Google Scholar] [CrossRef]
- Hui, L.; Ting, L.; See, S.; Chan, P. Use of Electroglottograph (EGG) to Find a Relationship between Pitch, Emotion and Personality. Procedia Manuf. 2015, 3, 1926–1931. [Google Scholar] [CrossRef] [Green Version]
- Chen, L.; Mao, X.; Wei, P.; Compare, A. Speech emotional features extraction based on electroglottograph. Neural Comput. 2013, 25, 3294–3317. [Google Scholar] [CrossRef] [PubMed]
- Prasanna, S.R.M.; Govind, D. Analysis of excitation source information in emotional speech. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010; pp. 781–784. [Google Scholar] [CrossRef]
- Pravena, D.; Govind, D. Significance of incorporating excitation source parameters for improved emotion recognition from speech and electroglottographic signals. Int. J. Speech Technol. 2017, 20, 787–797. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. Comput. Sci. 2015, 14, 38–39. [Google Scholar]
- Afouras, T.; Chung, J.S.; Zisserman, A. ASR is All You Need: Cross-Modal Distillation for Lip Reading. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 2143–2147. [Google Scholar] [CrossRef] [Green Version]
- Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for Thin Deep Nets. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Heo, B.; Kim, J.; Yun, S.; Park, H.; Kwak, N.; Choi, J.Y. A Comprehensive Overhaul of Feature Distillation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 1921–1930. [Google Scholar] [CrossRef] [Green Version]
- Albanie, S.; Nagrani, A.; Vedaldi, A.; Zisserman, A. Emotion Recognition in Speech Using Cross-Modal Transfer in the Wild. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Korea, 22–26 October 2018; pp. 292–301. [Google Scholar] [CrossRef] [Green Version]
- Li, R.; Zhao, J.; Jin, Q. Speech Emotion Recognition via Multi-Level Cross-Modal Distillation. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 4488–4492. [Google Scholar] [CrossRef]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Rao, K.S.; Yegnanarayana, B. Prosody modification using instants of significant excitation. IEEE Trans. Audio Speech Lang. Process. 2010, 14, 972–980. [Google Scholar] [CrossRef] [Green Version]
- Chen, L.; Mao, X.; Compare, A. A new method for speech synthesis combined with EGG. In Proceedings of the National Conference on Man-Machine Speech Communication 2013, Lianyungang, China, 11–13 October 2013. [Google Scholar]
- Prukkanon, N.; Chamnongthai, K.; Miyanaga, Y. F0 contour approximation model for a one-stream tonal word recognition system. AEUE Int. J. Electron. Commun. 2016, 70, 681–688. [Google Scholar] [CrossRef]
- Chen, P.; Chen, L.; Mao, X. Content Classification With Electroglottograph. J. Phys. Conf. Ser. 2020, 1544, 012191. [Google Scholar] [CrossRef]
- Xiao, Z. An Approach of Fundamental Frequencies Smoothing for Chinese Tone Recognition. J. Chin. Inf. Process. 2001, 15, 45–50. [Google Scholar] [CrossRef]
- Ma, T.; Tian, W.; Xie, Y. Multi-level knowledge distillation for low-resolution object detection and facial expression recognition. Knowl.-Based Syst. 2022, 240, 108136. [Google Scholar] [CrossRef]
- Wu, J.; Hua, Y.; Yang, S.; Qin, H.; Qin, H. Speech Enhancement Using Generative Adversarial Network by Distilling Knowledge from Statistical Method. Appl. Sci. 2019, 9, 3396. [Google Scholar] [CrossRef] [Green Version]
- Chen, H.; Pei, Y.; Zhao, H.; Huang, Y. Super-resolution guided knowledge distillation for low-resolution image classification. Pattern Recognit. Lett. 2022, 155, 62–68. [Google Scholar] [CrossRef]
- Wang, J.; Zhang, P.; He, Q.; Li, Y.; Hu, Y. Revisiting Label Smoothing Regularization with Knowledge Distillation. Appl. Sci. 2021, 11, 4699. [Google Scholar] [CrossRef]
- Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Jing, S.; Mao, X.; Chen, L.; Zhang, N. Annotations and consistency detection for Chinese dual-mode emotional speech database. J. Beijing Univ. Aeronaut. A 2015, 41, 1925–1934. [Google Scholar] [CrossRef]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Russell, J.A.; Barrett, F.L. Core Affect, Prototypical Emotional Episodes, and Other Things Called Emotion: Dissecting the Elephant. J. Personal. Soc. Psychol. 1999, 76, 805–819. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Viualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of german emotional speech. In Proceedings of the Interspeech 2005—Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
Emotion | Sadness | Anger | Surprise | Fear | Happiness | Disgust | Neutrality | Total |
---|---|---|---|---|---|---|---|---|
Train. Set | 156 | 84 | 294 | 28 | 152 | 67 | 280 | 1061 |
Val. Set | 38 | 21 | 73 | 7 | 37 | 16 | 70 | 262 |
Model | Unweighted Validation Accuracy (%) |
---|---|
CRNN [39] + Mel-spectrograms | 69.53 |
CRNN + log-Mel-spectrograms | 71.45 |
ResNet18 + Mel-spectrograms | 80.86 |
ResNet18 + log-Mel-spectrograms | 81.25 |
Model | Unweighted Validation Accuracy (%) |
---|---|
Bi-LSTM (3 layer) | 53.52 |
Bi-LSTM (4 layer) | 58.98 |
Bi-LSTM (>4 layer) | Not converged |
Model | Unweighted Training Accuracy (%) | Unweighted Validation Accuracy (%) |
---|---|---|
Teacher Model | 99.22 | 81.25 |
Student Model | 66.57 | 58.98 |
Student Model with CMED | 89.77 | 66.80 |
Ground Truth 1 | 70.00 | 70.00 |
Emotion | Sadness | Anger | Boredom | Fear | Happiness | Disgust | Neutrality | Total |
---|---|---|---|---|---|---|---|---|
Train Set | 96 | 109 | 89 | 97 | 92 | 84 | 83 | 650 |
Val Set | 24 | 27 | 22 | 24 | 22 | 20 | 20 | 159 |
Model | Unweighted Train Accuracy (%) | Unweighted Validation Accuracy (%) |
---|---|---|
Teacher Model | 98.12 | 75.00 |
Student Model | 33.89 | 32.29 |
Student Model with CMED | 84.13 | 42.71 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, L.; Ren, J.; Mao, X.; Zhao, Q. Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation. Appl. Sci. 2022, 12, 4338. https://doi.org/10.3390/app12094338
Chen L, Ren J, Mao X, Zhao Q. Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation. Applied Sciences. 2022; 12(9):4338. https://doi.org/10.3390/app12094338
Chicago/Turabian StyleChen, Lijiang, Jie Ren, Xia Mao, and Qi Zhao. 2022. "Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation" Applied Sciences 12, no. 9: 4338. https://doi.org/10.3390/app12094338
APA StyleChen, L., Ren, J., Mao, X., & Zhao, Q. (2022). Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation. Applied Sciences, 12(9), 4338. https://doi.org/10.3390/app12094338