Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability
Abstract
:1. Introduction
2. Methods
2.1. Relative Difficulty-Aware Loss
2.2. Training Target Considering Labeling Reliability
2.3. Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability
3. Experiments
3.1. Experimental Design
3.2. Input Features and Model Configuration
4. Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Cai, X.; Dai, D.; Wu, Z.; Li, X.; Li, J.; Meng, H. Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5734–5738. [Google Scholar]
- Hovy, E. Generating natural language under pragmatic constraints. J. Pragmat. 1987, 11, 689–719. [Google Scholar] [CrossRef]
- Marsh, P.J.; Polito, V.; Singh, S.; Coltheart, M.; Langdon, R.; Harris, A.W. A quasi-randomized feasibility pilot study of specific treatments to improve emotion recognition and mental-state reasoning impairments in schizophrenia. BMC Psychiatry 2022, 16, 360. [Google Scholar] [CrossRef] [PubMed]
- Milner, R.; Jalal, M.A.; Ng, R.W.; Hain, T. A cross-corpus study on speech emotion recognition. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 304–311. [Google Scholar]
- Parry, J.; Palaz, D.; Clarke, G.; Lecomte, P.; Mead, R.; Berger, M.; Hofer, G. Analysis of Deep Learning Architectures for Cross-corpus Speech Emotion Recognition. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 1656–1660. [Google Scholar]
- Braunschweiler, N.; Doddipatla, R.; Keizer, S.; Stoyanchev, S. A study on cross-corpus speech emotion recognition and data augmentation. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 24–30. [Google Scholar]
- Lee, S.-W. Domain generalization with triplet network for cross-corpus speech emotion recognition. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT) 2021, Shenzhen, China, 19–22 January 2021; pp. 389–396. [Google Scholar]
- Kim, J.; Englebienne, G.; Truong, K.P.; Evers, V. Towards speech emotion recognition “in the wild” using aggregated corpora and deep multi-task learning. In Proceedings of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 1113–1117. [Google Scholar]
- Goron, E.; Asai, L.; Rut, E.; Dinov, M. Improving Domain Generalization in Speech Emotion Recognition with Whisper. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 11631–11635. [Google Scholar]
- Ahn, Y.; Lee, S.J.; Shin, J.W. Cross-Corpus Speech Emotion Recognition Based on Few-shot Learning and Domain Adaptation. IEEE Signal Process. Lett. 2021, 28, 1190–1194. [Google Scholar] [CrossRef]
- Ahn, Y.; Lee, S.J.; Shin, J.W. Multi-Corpus Speech Emotion Recognition for Unseen Corpus Using Corpus-Wise Weights in Classification Loss. In Proceedings of the INTERSPEECH, Incheon, Republic of Korea, 18–22 September 2022; pp. 131–135. [Google Scholar]
- Braunschweiler, N.; Doddipatla, R.; Keizer, S.; Stoyanchev, S. Factors in Emotion Recognition with Deep Learning Models Using Speech and Text on Multiple Corpora. IEEE Signal Process. Lett. 2022, 29, 722–726. [Google Scholar] [CrossRef]
- Schuller, B.; Zhang, Z.; Weninger, F.; Rigoll, G. Using Multiple Databases for Training in Emotion Recognition: To Unite or to Vote? In Proceedings of the INTERSPEECH, Florence, Italy, 27–31 August 2011; pp. 1553–1556. [Google Scholar]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J.; Schuller, B.W. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Trans. Affect. Comp. 2022, 13, 992–1004. [Google Scholar] [CrossRef]
- Feng, K.; Chaspari, T. Few-shot learning in emotion recognition of spontaneous speech using a siamese neural network with adaptive sample pair formation. IEEE Trans. Affect. Comput. 2023, 14, 1627–1633. [Google Scholar] [CrossRef]
- Li, J.-L.; Lee, C.-C. An Enroll-to-Verify Approach for Cross-Task Unseen Emotion Class Recognition. IEEE Trans. Affect. Comput. 2022; early access. [Google Scholar] [CrossRef]
- Steidl, S.; Levit, M.; Batliner, A.; Noth, E.; Niemann, H. Of all things the measure is man automatic classification of emotions and inter-labeler consistency. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, PA, USA, 18–23 March 2005; pp. 5734–5738. [Google Scholar]
- Huang, J.; Tao, J.; Liu, B.; Lian, Z. Learning Utterance-level Representations with Label Smoothing for Speech Emotion Recognition. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 4079–4083. [Google Scholar]
- Zhong, Y.; Hu, Y.; Huang, H.; Silamu, W. A lightweight model based on separable convolution for speech emotion recognition. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 3331–3335. [Google Scholar]
- Neumann, M.; Vu, N.T. Improving speech emotion recognition with unsupervised representation learning on unlabeled speech. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7390–7394. [Google Scholar]
- Eskimez, S.E.; Duan, Z.; Heinzelman, W. Unsupervised learning approach to feature analysis for automatic speech emotion recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5099–5103. [Google Scholar]
- Latif, S.; Rana, R.; Khalifa, S.; Jurdak, R.; Epps, J.; Schuller, B.W. Multitask Learning from Augmented Auxiliary Data for Improving Speech Emotion Recognition. IEEE Trans. Affect. Comp. 2022; early access. [Google Scholar] [CrossRef]
- Dissanayake, V.; Zhang, H.; Billinghurst, M.; Nanayakkara, S. Speech Emotion Recognition ‘in the wild’ Using an Autoencoder. In Proceedings of the INTERSPEECH, Shanghai, China, 25–29 October 2020; pp. 526–530. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; pp. 539–546. [Google Scholar]
- Kim, S.; Kim, D.; Cho, M.; Kwak, S. Proxy Anchor Loss for Metric Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3238–3247. [Google Scholar]
- Cauldwell, R.T. Where did the anger go? the role of context in interpreting emotion in speech. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Newcastle, UK, 5–7 September 2000; pp. 127–131. [Google Scholar]
- Movshovitz-Attias, Y.; Toshev, A.; Leung, T.K.; Ioffe, S.; Singh, S. No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 360–368. [Google Scholar]
- Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
- Busso, C.; Parthasarathy, S.; Burmania, A.; AbdelWahab, M.; Sadoughi, N.; Provost, E.M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 2016, 8, 67–80. [Google Scholar] [CrossRef]
- Lotfian, R.; Busso, C. Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings. IEEE Trans. Affect. Comput. 2019, 10, 471–483. [Google Scholar] [CrossRef]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Xu, T.-B.; Liu, C.-L. Data-distortion guided self-distillation for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 29–31 January 2019; pp. 5565–5572. [Google Scholar]
- Yun, S.; Park, J.; Lee, K.; Shin, J. Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13876–13885. [Google Scholar]
- Schuller, B.; Steidl, S.; Batliner, A.; Burkhardt, F.; Devillers, L.; Müller, C.; Narayanan, S.S. The INTERSPEECH 2010 paralinguistic challenge. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, Chiba, Japan, 26–30 September 2010. [Google Scholar]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, N. wav2vec: Unsupervised pre-training for speech recognition. arXiv 2019, arXiv:1904.05862. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference North American Chapter of the Association for Computational, Linguistics: Human Language Technologies (NAACL), Minneapolis, MI, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
- Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: The Munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 1459–1462. [Google Scholar]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning (ICML), Vienna, Austria, 23–29 July 2023. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the NIPS’19: 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Scherer, K.R.; Banse, R.; Wallbott, H.G. Emotion inferences from vocal expression correlate across languages and cultures. J. Cross-Cult. Psychol. 2001, 32, 76–92. [Google Scholar] [CrossRef]
Corpus | #Speakers | Neutral | Happy | Sad | Angry |
---|---|---|---|---|---|
CRE [30] | 91 | 1087 | 1271 | 1270 | 1271 |
IEM [29] | 10 | 1708 | 1636 | 1084 | 1103 |
IMP [31] | 12 | 3477 | 2644 | 885 | 792 |
POD [32] | 1285+ | 26,009 | 14,285 | 2649 | 3218 |
Method | CRE | IEM | IMP | POD | Avg |
---|---|---|---|---|---|
Within-single-corpus (CE) | 66.0 | 60.1 | 49.5 | 46.0 | 55.4 |
Out-of-corpus (CE) | 51.6 | 50.1 | 38.9 | 31.9 | 43.1 |
Soft label [17] | 52.7 | 50.2 | 40.2 | 31.6 | 43.7 |
Label smoothing (LS) [18] | 53.5 | 51.5 | 39.4 | 32.4 | 44.2 |
Unigram smoothing [18] | 55.0 | 52.6 | 39.0 | 32.7 | 44.8 |
Focal loss [19] | 51.4 | 49.6 | 40.5 | 32.9 | 43.6 |
AE [20] | 55.2 | 48.9 | 42.8 | 31.3 | 44.6 |
CWW [11] | 53.8 | 52.3 | 42.7 | 33.1 | 45.5 |
Contrastive loss [25] | 52.2 | 51.4 | 42.8 | 32.7 | 44.8 |
Proxy-Anchor [26] | 52.5 | 51.5 | 43.2 | 33.4 | 44.9 |
CWW + PEL [11] | 53.5 | 51.2 | 40.7 | 38.7 | 46.0 |
AE + LS + PEL (†) | 53.1 | 52.5 | 42.5 | 36.1 | 46.0 |
CWW + † | 53.8 | 53.4 | 42.0 | 36.1 | 46.3 |
Proxy-Anchor + † | 53.1 | 52.8 | 43.8 | 35.2 | 46.2 |
CWW + Proxy-Anchor+ † | 54.0 | 53.1 | 43.4 | 35.8 | 46.6 |
Self-KD | 30.1 | 33.8 | 35.2 | 26.3 | 31.4 |
Relative difficulty (RD) | 53.3 | 53.0 | 43.6 | 32.2 | 45.5 |
CWW + RD | 53.5 | 53.6 | 45.2 | 32.8 | 46.3 |
Labeling Reliability (LR) | 53.1 | 52.9 | 42.9 | 32.7 | 45.4 |
LS + LR | 53.1 | 53.0 | 42.9 | 32.6 | 45.4 |
RD + CWW + † | 54.0 | 54.1 | 43.9 | 37.2 | 47.3 |
LR + CWW + AE + PEL | 54.3 | 53.6 | 43.9 | 37.4 | 47.3 |
LR + CWW + † | 54.2 | 53.6 | 43.9 | 37.4 | 47.3 |
RD + LR (RDLR)+ † | 54.2 | 54.0 | 45.2 | 37.7 | 47.7 |
RDLR + CWW + AE + PEL | 56.3 | 54.1 | 44.2 | 38.7 | 48.3 |
RDLR + CWW + † (proposed) | 56.4 | 54.1 | 44.3 | 38.8 | 48.4 |
Method | CRE | IEM | IMP | POD | Avg |
---|---|---|---|---|---|
Within-single-corpus (CE) | 66.0 | 60.1 | 49.5 | 46.0 | 55.4 |
# + W2V | 70.9 | 62.6 | 53.1 | 49.0 | 58.9 |
# + W2V and BERT | 71.3 | 68.7 | 60.3 | 56.2 | 64.1 |
RDLR + CWW + † (proposed) | 56.4 | 54.1 | 44.3 | 38.8 | 48.4 |
CE # + W2V | 52.6 | 55.9 | 48.6 | 36.5 | 48.4 |
AE + LS + PEL (†) | 53.5 | 56.2 | 48.9 | 39.1 | 49.6 |
CWW + † | 53.7 | 56.8 | 49.9 | 39.3 | 49.9 |
Proxy-Anchor+ † | 53.6 | 56.9 | 50.5 | 39.1 | 50.0 |
CWW + Proxy-Anchor + † | 53.7 | 56.9 | 50.6 | 39.5 | 50.2 |
RD + CWW + † | 53.4 | 57.1 | 50.6 | 41.3 | 50.6 |
LR + CWW + † | 53.8 | 57.1 | 50.0 | 40.5 | 50.4 |
RDLR + † | 55.5 | 57.1 | 49.8 | 40.7 | 50.8 |
RDLR + CWW + † (proposed) | 55.5 | 57.4 | 50.8 | 41.7 | 51.4 |
CE # + W2V and BERT | 52.0 | 60.4 | 53.1 | 42.9 | 52.1 |
AE + LS + PEL (†) | 52.6 | 61.3 | 53.6 | 44.1 | 52.9 |
CWW + † | 54.9 | 60.7 | 52.8 | 44.2 | 53.2 |
Proxy-Anchor + † | 53.5 | 60.8 | 54.0 | 44.0 | 53.1 |
CWW + Proxy-Anchor + † | 54.1 | 60.9 | 54.2 | 44.3 | 53.4 |
RD + CWW + † | 53.2 | 61.3 | 54.6 | 45.9 | 53.8 |
LR + CWW + † | 53.4 | 61.7 | 54.8 | 46.0 | 54.0 |
RDLR + † | 55.3 | 61.8 | 54.3 | 45.9 | 54.3 |
RDLR + CWW + † (proposed) | 56.3 | 61.8 | 54.8 | 46.0 | 54.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahn, Y.; Han, S.; Lee, S.; Shin, J.W. Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability. Sensors 2024, 24, 4111. https://doi.org/10.3390/s24134111
Ahn Y, Han S, Lee S, Shin JW. Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability. Sensors. 2024; 24(13):4111. https://doi.org/10.3390/s24134111
Chicago/Turabian StyleAhn, Youngdo, Sangwook Han, Seonggyu Lee, and Jong Won Shin. 2024. "Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability" Sensors 24, no. 13: 4111. https://doi.org/10.3390/s24134111
APA StyleAhn, Y., Han, S., Lee, S., & Shin, J. W. (2024). Speech Emotion Recognition Incorporating Relative Difficulty and Labeling Reliability. Sensors, 24(13), 4111. https://doi.org/10.3390/s24134111