Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations
Abstract
:1. Introduction
- It assesses the effectiveness of the cross-corpus strategy for SER systems.
- It evaluates the incorporation of multiple languages within the cross-corpus strategy.
- It investigates the feasibility of training SER systems using predominantly out-domain data with a limited amount of in-domain data.
2. Previous Work
3. Materials: Experimental Setup
3.1. Databases
3.1.1. Speech Emotion Collection
3.1.2. Special Case: IEMOCAP Database
3.2. Performance Metrics
4. Methods: Speech Emotion Recognition System
4.1. Feature Extraction
4.1.1. HuBERT
4.1.2. WavLM
4.2. Classification
4.2.1. Pooling + Linear Layer
4.2.2. CNN Self Attention
4.2.3. Class Token Transformer
5. Experiments
5.1. Experiments Description
5.1.1. Experiment 1: Matched vs. Cross-Corpus Training with EmoDb, RAVDESS, and IEMOCAP Databases
5.1.2. Experiment 2: Matched vs. Cross-Corpus Training with EmoDb, RAVDESS, IEMOCAP, and CREMA-D Databases
5.1.3. Validity of Out-Domain Training
5.2. Experimental Framework Description
6. Results and Discussion
6.1. Experiment 1: Cross-Corpus Strategy for Training
6.1.1. Analysis of the Performance among Representations and Classifiers
Analysis of the System Errors among Emotions
6.2. Experiment 2: Effect of Dataset Size and Diversity in Cross-Corpus Strategy for Training
6.3. Experiment 3: Role of In-Domain Data in the Training Set
Analysis of the System Errors among Emotions
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
SER | Speech Emotion Recognition |
DNN | Deep Neural Network |
SS | Self-Supervised |
TP | True Positives |
FP | False Positives |
TN | True Negatives |
FN | False Negatives |
UAR | Unweighted Average Recall |
MFCC | Mel Frequency Cepstral Coefficient |
CNN | Convolutional Neural Network |
CT-Transformer | Class Token Transformer |
CE | Cross Entropy |
References
- Gupta, N. Human-Machine Interaction and IoT Applications for a Smarter World; Taylor & Francis Group: Milton, UK, 2022. [Google Scholar]
- Castellano, G.; Kessous, L.; Caridakis, G. Emotion Recognition through Multiple Modalities: Face, Body Gesture, Speech. In Affect and Emotion in Human-Computer Interaction: From Theory to Applications; Peter, C., Beale, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 92–103. [Google Scholar] [CrossRef]
- Thakur, A.; Dhull, S. Speech Emotion Recognition: A Review. In Proceedings of the Advances in Communication and Computational Technology; Hura, G.S., Singh, A.K., Siong Hoe, L., Eds.; Springer: Singapore, 2021; pp. 815–827. [Google Scholar]
- Zong, Y.; Zheng, W.; Zhang, T.; Huang, X. Cross-Corpus Speech Emotion Recognition Based on Domain-Adaptive Least-Squares Regression. IEEE Signal Process. Lett. 2016, 23, 585–589. [Google Scholar] [CrossRef]
- Braunschweiler, N.; Doddipatla, R.; Keizer, S.; Stoyanchev, S. A Study on Cross-Corpus Speech Emotion Recognition and Data Augmentation. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 24–30. [Google Scholar] [CrossRef]
- Pastor, M.; Ribas, D.; Ortega, A.; Miguel, A.; Lleida, E. Cross-Corpus Speech Emotion Recognition with HuBERT Self-Supervised Representation. In Proceedings of the IberSPEECH 2022, Granada, Spain, 14–16 November 2022; pp. 76–80. [Google Scholar] [CrossRef]
- Schuller, B.; Zhang, Z.; Weninger, F.; Rigoll, G. Using Multiple Databases for Training in Emotion Recognition: To Unite or to Vote? In Proceedings of the Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011; pp. 1553–1556. [Google Scholar]
- Zehra, W.; Javed, A.R.; Jalil, Z.; Gadekallu, T.; Kahn, H. Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex Intell. Syst. 2021, 7, 1845–1854. [Google Scholar] [CrossRef]
- Ma, H.; Zhang, C.; Zhou, X.; Chen, J.; Zhou, Q. Domain Adversarial Network for Cross-Domain Emotion Recognition in Conversation. Appl. Sci. 2022, 12, 5436. [Google Scholar] [CrossRef]
- Lian, Z.; Tao, J.; Liu, B.; Huang, J. Domain Adversarial Learning for Emotion Recognition. In Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), Macao, China, 10–16 August 2019. [Google Scholar]
- Liu, N.; Zong, Y.; Zhang, B.; Liu, L.; Chen, J.; Zhao, G.; Zhu, J. Unsupervised Cross-Corpus Speech Emotion Recognition Using Domain-Adaptive Subspace Learning. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5144–5148. [Google Scholar] [CrossRef] [Green Version]
- Etienne, C.; Fidanza, G.; Petrovskii, A.; Devillers, L.; Schmauch, B. CNN+LSTM Architecture for Speech Emotion Recognition with Data Augmentation. arXiv 2018, arXiv:1802.05630. [Google Scholar]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. 2016, 7, 190–202. [Google Scholar] [CrossRef] [Green Version]
- Pepino, L.; Riera, P.E.; Ferrer, L. Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021. [Google Scholar]
- Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv 2020, arXiv:abs/2006.11477. [Google Scholar]
- Hsu, W.N.; Bolte, B.; Tsai, Y.H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
- Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing. IEEE J. Sel. Top. Signal Process. 2022, 16, 1505–1518. [Google Scholar] [CrossRef]
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the INTERSPEECH, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520. [Google Scholar]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Funding Information Natural Sciences and Engineering Research Council of Canada: 2012-341583 Hear the world research chair in music and emotional speech from Phonak. Zenodo 2018. [Google Scholar] [CrossRef]
- Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Yang, S.W.; Chi, P.H.; Chuang, Y.S.; Lai, C.I.J.; Lakhotia, K.; Lin, Y.Y.; Liu, A.T.; Shi, J.; Chang, X.; Lin, G.T.; et al. Superb: Speech processing universal performance benchmark. arXiv 2021, arXiv:2105.01051. [Google Scholar]
- Safari, P.; India, M.; Hernando, J. Self-attention encoding and pooling for speaker recognition. arXiv 2020, arXiv:2008.01077. [Google Scholar] [CrossRef]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 3–5 June 2019; pp. 4171–4186. [Google Scholar]
- Russell, J. A Circumplex Model of Affect. J. Personal. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar] [CrossRef]
- Petrushin, V. Emotion in Speech: Recognition and Application to Call Centers. In Proceedings of the Artificial Neural Networks in Engineering, St. Louis, MO, USA, 7–10 November 1999. [Google Scholar]
- van der Maaten, L.; Hinton, G.E. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Database | Time (m) | Lang. | Spk | #Neutral | #Happy | #Anger | #Sad | Text |
---|---|---|---|---|---|---|---|---|
EmoDb | 16 | Ger. | 10 | 79 | 71 | 127 | 62 | Read |
RAVDESS | 42 | Eng. | 24 | 96 | 192 | 192 | 192 | Read |
CREMA-D | 203 | Eng. | 91 | 1087 | 1271 | 1271 | 1271 | Read |
IEMOCAP | 420 | Eng. | 10 | 1708 | 1636 | 1103 | 1084 | Improv. |
MATCHED TRAIN | EXTENDED TRAIN | TEST |
---|---|---|
EmoDb | EmoDb | EmoDb |
RAVDESS | ||
IEMOCAP | ||
RAVDESS | EmoDb | RAVDESS |
RAVDESS | ||
IEMOCAP | ||
IEMOCAP | EmoDb | IEMOCAP |
RAVDESS | ||
IEMOCAP |
MATCHED TRAIN | EXTENDED TRAIN | TEST |
---|---|---|
IEMOCAP | EmoDb | IEMOCAP |
RAVDESS | ||
CREMA-D | ||
IEMOCAP | ||
CREMA-D | EmoDb | CREMA-D |
RAVDESS | ||
CREMA-D | ||
IEMOCAP |
STEP | EmoDb | RAVDESS | ||
---|---|---|---|---|
TRAIN | TEST | TRAIN | TEST | |
0 | IEMOCAP | 2 speakers of EmoDb | IEMOCAP | 4 speakers of RAVDESS |
1 | IEMOCAP + 3 min EmoDb (2 speakers) | 2 speakers of EmoDb | IEMOCAP + 7 min RAVDESS (4 speakers) | 4 speakers of RAVDESS |
2 | IEMOCAP + 6 min EmoDb (4 speakers) | 2 speakers of EmoDb | IEMOCAP + 14 min RAVDESS (8 speakers) | 4 speakers of RAVDESS |
3 | IEMOCAP + 9 min EmoDb (8 speakers) | 2 speakers of EmoDb | IEMOCAP + 21 min RAVDESS (12 speakers) | 4 speakers of RAVDESS |
4 | IEMOCAP + 12 min EmoDb (10 speakers) | 2 speakers of EmoDb | IEMOCAP + 28 min RAVDESS (16 speakers) | 4 speakers of RAVDESS |
5 | IEMOCAP + 35 min RAVDESS (20 speakers) | 4 speakers of RAVDESS |
Classifier | MATCHED TRAIN | EXTENDED TRAIN | ||
---|---|---|---|---|
HuBERT | WavLM | HuBERT | WavLM | |
Evaluation dataset: EmoDb | ||||
MLP | 87.86% | 85.64% | 84.64% | 84.64% |
CNNSelfAtt | 87.34% | 89.70% | 87.26% | 87.02% |
Transformer | 90.60% | 90.64% | 81.16% | 79.44% |
Evaluation dataset: RAVDESS | ||||
MLP | 68.34% | 70.82% | 64.14% | 65.70% |
CNNSelfAtt | 62.44% | 67.04% | 68.88% | 70.92% |
Transformer | 64.78% | 66.58% | 69.26% | 69.14% |
Evaluation dataset: IEMOCAP | ||||
MLP | 63.52% | 63.38% | 63.38% | 63.21% |
CNNSelfAtt | 64.94% | 65.90% | 65.75% | 66.90% |
Transformer | 60.82% | 62.08% | 61.90% | 62.98% |
Predicted Value | |||||
---|---|---|---|---|---|
Neutral | Happiness | Anger | Sadness | ||
Real Value | Neutral | 976 | 345 | 136 | 251 |
Happiness | 338 | 981 | 167 | 150 | |
Anger | 142 | 155 | 771 | 35 | |
Sadness | 220 | 116 | 9 | 709 |
Emotion | Neutral | Happiness | Anger | Sadness |
---|---|---|---|---|
Recall | 59.37% | 65.06% | 71.81% | 66.72% |
Classifier | MATCHED TRAIN | EXTENDED TRAIN | ||
---|---|---|---|---|
HuBERT | WavLM | HuBERT | WavLM | |
Evaluation dataset: IEMOCAP | ||||
MLP | 63.52% | 63.38% | 62.56% | 63.28% |
CNNSelfAtt | 64.94% | 65.90% | 64.72% | 65.06% |
Transformer | 60.82% | 62.08% | 62.56% | 62.85% |
Evaluation dataset: CREMA-D | ||||
MLP | 81.64% | 80.85% | 80.20% | 79.19% |
CNNSelfAtt | 80.75% | 78.51% | 79.80% | 78.09% |
Transformer | 83.04% | 79.57% | 72.99% | 78.83% |
Step 1 | Step 2 | Step 3 | Step 4 | Step 5 | Step 6 | |
---|---|---|---|---|---|---|
EmoDb | Baseline | 28.82% | 36.59% | 38.38% | 43.99% | – |
RAVDESS | Baseline | 15.31% | 22.18% | 27.03% | 29.68% | 35.15% |
Predicted Value | |||||
---|---|---|---|---|---|
Neutral | Happiness | Anger | Sadness | ||
Real Value | Neutral | 77 | 0 | 0 | 2 |
Happiness | 2 | 51 | 18 | 0 | |
Anger | 0 | 20 | 107 | 0 | |
Sadness | 4 | 0 | 0 | 58 |
Predicted Value | |||||
---|---|---|---|---|---|
Neutral | Happiness | Anger | Sadness | ||
Real Value | Neutral | 52 | 7 | 1 | 20 |
Happiness | 10 | 102 | 15 | 33 | |
Anger | 3 | 20 | 124 | 13 | |
Sadness | 13 | 24 | 7 | 116 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pastor, M.A.; Ribas, D.; Ortega, A.; Miguel, A.; Lleida, E. Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations. Appl. Sci. 2023, 13, 9062. https://doi.org/10.3390/app13169062
Pastor MA, Ribas D, Ortega A, Miguel A, Lleida E. Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations. Applied Sciences. 2023; 13(16):9062. https://doi.org/10.3390/app13169062
Chicago/Turabian StylePastor, Miguel A., Dayana Ribas, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida. 2023. "Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations" Applied Sciences 13, no. 16: 9062. https://doi.org/10.3390/app13169062
APA StylePastor, M. A., Ribas, D., Ortega, A., Miguel, A., & Lleida, E. (2023). Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations. Applied Sciences, 13(16), 9062. https://doi.org/10.3390/app13169062