Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets
Abstract
:1. Introduction
2. Related Works
3. Ensemble Learning Model for SER in Multi-Domain Datasets
3.1. Multi-Path Embedding Features
3.2. Group Loss
4. Evaluation
4.1. Datasets
4.2. Evaluation of the BLSTM-Based Baseline SER
4.3. Evaluation of Multi-Domain Adaptation
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Akçay, M.B.; Oğuz, K. Speech Emotion Recognition: Emotional Models, Databases, Features, Preprocessing Methods, Supporting Modalities, and Classifiers. Speech Commun. 2020, 116, 56–76. [Google Scholar] [CrossRef]
- Hazer-Rau, D.; Meudt, S.; Daucher, A.; Spohrs, J.; Hoffmann, H.; Schwenker, F.; Traue, H.C. The UulmMAC Database—A Multimodal Affective Corpus for Affective Computing in Human-Computer Interaction. Sensors 2020, 20, 2308. [Google Scholar] [CrossRef] [Green Version]
- Marín-Morales, J.; Llinares, C.; Guixeres, J.; Alcañiz, M. Emotion Recognition in Immersive Virtual Reality: From Statistics to Affective Computing. Sensors 2020, 20, 5163. [Google Scholar] [CrossRef]
- Haq, S.; Jackson, P.J.; Edge, J. Speaker-Dependent Audio-Visual Emotion Recognition. In Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP), Norwich, UK, 10–13 September 2009; pp. 53–58. [Google Scholar]
- Vryzas, N.; Kotsakis, R.; Liatsou, A.; Dimoulas, C.A.; Kalliris, G. Speech Emotion Recognition for Performance Interaction. J. Audio Eng. Soc. 2018, 66, 457–467. [Google Scholar] [CrossRef]
- Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A Dynamic, Multimodal Set of Facial and Vocal Expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Busso, C.; Bulut, M.; Lee, C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Abdelwahab, M.; Busso, C. Supervised Domain Adaptation for Emotion Recognition from Speech. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 5058–5062. [Google Scholar]
- Liang, J.; Chen, S.; Zhao, J.; Jin, Q.; Liu, H.; Lu, L. Cross-Culture Multimodal Emotion Recognition with Adversarial Learning. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 4000–4004. [Google Scholar]
- Schuller, B.; Vlasenko, B.; Eyben, F.; Wöllmer, M.; Stuhlsatz, A.; Wendemuth, A.; Rigoll, G. Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies. IEEE Trans. Affect. Comput. 2010, 1, 119–131. [Google Scholar] [CrossRef]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019. [Google Scholar] [CrossRef] [Green Version]
- Bang, J.; Hur, T.; Kim, D.; Lee, J.; Han, Y.; Banos, O.; Kim, J.-I.; Lee, S. Adaptive Data Boosting Technique for Robust Personalized Speech Emotion in Emotionally-Imbalanced Small-Sample Environments. Sensors 2018, 18, 3744. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Huang, Z.; Xue, W.; Mao, Q.; Zhan, Y. Unsupervised Domain Adaptation for Speech Emotion Recognition Using PCANet. Multimed. Tools Appl. 2017, 76, 6785–6799. [Google Scholar] [CrossRef]
- Neumann, M. Cross-Lingual and Multilingual Speech Emotion Recognition on English and French. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5769–5773. [Google Scholar]
- Li, Y.; Yang, T.; Yang, L.; Xia, X.; Jiang, D.; Sahli, H. A Multimodal Framework for State of Mind Assessment with Sentiment Pre-Classification. In Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, Nice, France, 21 October 2019; The Association for Computing Machinery: New York, NY, USA, 2019; pp. 13–18. [Google Scholar]
- Lee, S. The Generalization Effect for Multilingual Speech Emotion Recognition across Heterogeneous Languages. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5881–5885. [Google Scholar]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B. CNN Architectures for Large-Scale Audio Classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar]
- Motiian, S.; Piccirilli, M.; Adjeroh, D.A.; Doretto, G. Unified Deep Supervised Domain Adaptation and Generalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5715–5725. [Google Scholar]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic Speech Emotion Recognition Using Recurrent Neural Networks with Local Attention. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 2227–2231. [Google Scholar]
- Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D Convolutional Recurrent Neural Networks with Attention Model for Speech Emotion Recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
- Liu, Z.-T.; Wu, M.; Cao, W.-H.; Mao, J.-W.; Xu, J.-P.; Tan, G.-Z. Speech Emotion Recognition Based on Feature Selection and Extreme Learning Machine Decision Tree. Neurocomputing 2018, 273, 271–280. [Google Scholar] [CrossRef]
- Huang, C.-W.; Narayanan, S.S. Attention Assisted Discovery of Sub-Utterance Structure in Speech Emotion Recognition. In Proceedings of the INTERSPEECH, San Francisco, CA, USA, 8–12 September 2016; pp. 1387–1391. [Google Scholar]
- Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-Based Models for Speech Recognition. Adv. Neural Inf. Process. Syst. 2015, 28, 577–585. [Google Scholar]
- Anvarjon, T.; Kwon, S. Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features. Sensors 2020, 20, 5212. [Google Scholar] [CrossRef] [PubMed]
- Yeh, S.-L.; Lin, Y.-S.; Lee, C.-C. An Interaction-Aware Attention Network for Speech Emotion Recognition in Spoken Dialogs. In Proceedings of the ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6685–6689. [Google Scholar]
- Mu, Y.; Gómez, L.A.H.; Montes, A.C.; Martínez, C.A.; Wang, X.; Gao, H. Speech Emotion Recognition Using Convolutional-Recurrent Neural Networks with Attention Model. In Proceedings of the International Conference on Computer Engineering, Information Science and Internet Technology (CII), Sanya, China, 11–12 November 2017; pp. 341–350. [Google Scholar]
- Yao, Z.; Wang, Z.; Liu, W.; Liu, Y.; Pan, J. Speech Emotion Recognition Using Fusion of Three Multi-Task Learning-Based Classifiers: HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 2020, 120, 11–19. [Google Scholar] [CrossRef]
- Jin, Q.; Li, C.; Chen, S.; Wu, H. Speech Emotion Recognition with Acoustic and Lexical Features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, 19–24 April 2015; pp. 4749–4753. [Google Scholar]
- Glodek, M.; Tschechne, S.; Layher, G.; Schels, M.; Brosch, T.; Scherer, S.; Kächele, M.; Schmidt, M.; Neumann, H.; Palm, G. Multiple Classifier Systems for the Classification of Audio-Visual Emotional States. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction, Memphis, TN, USA, 9–12 October 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 359–368. [Google Scholar]
- Hong, I.S.; Ko, Y.J.; Shin, H.S.; Kim, Y.J. Emotion Recognition from Korean Language Using MFCC HMM and Speech Speed. In Proceedings of the 12th International Conference on Multimedia Information Technology and Applications (MITA2016), Luang Prabang, Laos, 4–6 July 2016; pp. 12–15. [Google Scholar]
- Ntalampiras, S.; Fakotakis, N. Modeling the Temporal Evolution of Acoustic Parameters for Speech Emotion Recognition. IEEE Trans. Affect. Comput. 2011, 3, 116–125. [Google Scholar] [CrossRef]
- Vrysis, L.; Tsipas, N.; Thoidis, I.; Dimoulas, C. 1d/2d Deep CNNs vs. Temporal Feature Integration for General Audio Classification. J. Audio Eng. Soc. 2020, 68, 66–77. [Google Scholar] [CrossRef]
- Sandhya, P.; Spoorthy, V.; Koolagudi, S.G.; Sobhana, N.V. Spectral Features for Emotional Speaker Recognition. In Proceedings of the Third International Conference on Advances in Electronics, Computers and Communications (ICAECC), Bengaluru, India, 11–12 December 2020; pp. 1–6. [Google Scholar]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Research and Affective Computing. IEEE Trans. Affect. Comput. 2015, 7, 190–202. [Google Scholar] [CrossRef] [Green Version]
- Schuller, B.; Steidl, S.; Batliner, A.; Burkhardt, F.; Devillers, L.; Müller, C.; Narayanan, S.S. The INTERSPEECH 2010 Paralinguistic Challenge. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Makuhari, Japan, 26–30 September 2010. [Google Scholar]
- Eyben, F.; Wullmer, M.; Schuller, B.O. OpenSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor. In Proceedings of the ACM International Conference on Multimedia (MM), Firenze, Italy, 25–29 October 2010; pp. 1459–1462. [Google Scholar]
- Jing, S.; Mao, X.; Chen, L. Prominence Features: Effective Emotional Features for Speech Emotion Recognition. Digit. Signal Process. 2018, 72, 216–231. [Google Scholar] [CrossRef]
- Sahoo, S.; Kumar, P.; Raman, B.; Roy, P.P. A Segment Level Approach to Speech Emotion Recognition Using Transfer Learning. In Proceedings of the Asian Conference on Pattern Recognition, Auckland, New Zealand, 26–29 November 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 435–448. [Google Scholar]
- Jiang, W.; Wang, Z.; Jin, J.S.; Han, X.; Li, C. Speech Emotion Recognition with Heterogeneous Feature Unification of Deep Neural Network. Sensors 2019, 19, 2730. [Google Scholar] [CrossRef] [Green Version]
- Chatziagapi, A.; Paraskevopoulos, G.; Sgouropoulos, D.; Pantazopoulos, G.; Nikandrou, M.; Giannakopoulos, T.; Katsamanis, A.; Potamianos, A.; Narayanan, S. Data Augmentation Using GANs for Speech Emotion Recognition. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 171–175. [Google Scholar]
- Salamon, J.; Bello, J.P. Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
- Vryzas, N.; Vrysis, L.; Matsiola, M.; Kotsakis, R.; Dimoulas, C.; Kalliris, G. Continuous Speech Emotion Recognition with Convolutional Neural Networks. J. Audio Eng. Soc. 2020, 68, 14–24. [Google Scholar] [CrossRef]
- Abdelwahab, M.; Busso, C. Active Learning for Speech Emotion Recognition Using Deep Neural Network. In Proceedings of the 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, UK, 3–6 September 2019; pp. 1–7. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–12 December 2014; pp. 2672–2680. [Google Scholar]
- Kang, G.; Jiang, L.; Yang, Y.; Hauptmann, A.G. Contrastive Adaptation Network for Unsupervised Domain Adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4893–4902. [Google Scholar]
- Gao, W.; McDonnell, M.; UniSA, S. Acoustic Scene Classification Using Deep Residual Networks with Focal Loss and Mild Domain Adaptation; Technical Report; Detection and Classification of Acoustic Scenes and Event: Mawson, Australia, 2020. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Gemmeke, J.F.; Ellis, D.P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 776–780. [Google Scholar]
- Dou, Q.; Coelho de Castro, D.; Kamnitsas, K.; Glocker, B. Domain Generalization via Model-Agnostic Learning of Semantic Features. Adv. Neural Inf. Process. Syst. 2019, 32, 6450–6461. [Google Scholar]
- Ekman, P.; Friesen, W.V.; Ellsworth, P. Emotion in the Human Face: Guidelines for Research and an Integration of Findings; Elsevier: Amsterdam, The Netherlands, 2013; Volume 11. [Google Scholar]
- Povolny, F.; Matejka, P.; Hradis, M.; Popková, A.; Otrusina, L.; Smrz, P.; Wood, I.; Robin, C.; Lamel, L. Multimodal Emotion Recognition for AVEC 2016 Challenge. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, Amsterdam, The Netherlands, 15–19 October 2016; pp. 75–82. [Google Scholar]
- Verykios, V.S.; Elmagarmid, A.K.; Bertino, E.; Saygin, Y.; Dasseni, E. Association Rule Hiding. IEEE Trans. Knowl. Data Eng. 2004, 16, 434–447. [Google Scholar] [CrossRef]
- Kumar, S. Real-Time Implementation and Performance Evaluation of Speech Classifiers in Speech Analysis-Synthesis. ETRI J. 2020, 43, 82–94. [Google Scholar] [CrossRef]
- Zheng, W.Q.; Yu, J.S.; Zou, Y.X. An Experimental Study of Speech Emotion Recognition Based on Deep Convolutional Neural Networks. In Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Xi’an, China, 21–24 September 2015; pp. 827–831. [Google Scholar]
Index | Association Property | IEMOCAP | KESDy18 | KESDy19 | |
---|---|---|---|---|---|
(a) | Valence Mean ± variation | angry | 1.89 ± 0.52 | 2.11 ± 0.21 | 1.78 ± 0.37 |
happy | 3.94 ± 0.47 | 4.42 ± 0.34 | 4.33 ± 0.36 | ||
neutral | 2.95 ± 0.49 | 3.23 ± 0.53 | 2.94 ± 0.60 | ||
sad | 2.24 ± 0.57 | 2.00 ± 0.33 | 1.89 ± 0.52 | ||
Arousal Mean ± variation | angry | 3.69 ± 0.66 | 3.93 ± 0.46 | 3.81 ± 0.58 | |
happy | 3.16 ± 0.61 | 3.92 ± 0.36 | 3.90 ± 0.53 | ||
neutral | 2.79 ± 0.53 | 3.08 ± 0.38 | 2.99 ± 0.33 | ||
sad | 2.61 ± 0.61 | 2.60 ± 0.44 | 2.63 ± 0.64 | ||
(b) | Confidence | Conf.(->{}) | 0.8 | 0.95 | 0.95 |
Conf.(->{}) | 0.58 | 0.9 | 0.86 | ||
Conf.(}->{}) | 0.85 | 0.83 | 0.71 | ||
Conf.(->{}) | 0.77 | 0.93 | 0.86 |
Property | IEMOCAP | KESDy18 | KESDy19 2 |
---|---|---|---|
Language | English | Korean | Korean |
Speakers | 10 (5 male, 5 female) | 30 (15 male, 15 female) | 40 (20 male, 20 female) |
Utterance type | Acted (Scripted/Improvised) | Acted (Scripted) | Acted (Scripted/Improvised) |
Datasets (Mic.) | IEMOCAP (2 Mic. of the same type) | KESDy18_PM (Galaxy S6), KESDy18_EM 1 (Shure S35) | KESDy19_PM (Galaxy S8), KESDy19_EM (AKG C414) |
angry | 947 | 431 | 1628 |
happy | 507 | 157 | 1121 |
neutral | 1320 | 1193 | 2859 |
sad | 966 | 467 | 694 |
Total | 3740 | 2248 | 6302 |
Model | Dataset | Input LLDs | WA | UA | PR | F1 |
---|---|---|---|---|---|---|
Our baseline (SPSL: single-path-single-loss) | IEMOCAP | MFCC | 0.616 | 0.588 | 0.576 | 0.559 |
Mel-spec | 0.534 | 0.525 | 0.504 | 0.491 | ||
MFCC + Mel-spec | 0.608 | 0.58 | 0.574 | 0.562 | ||
MFCC + Mel-spec + TimeSpectral | 0.611 | 0.59 | 0.58 | 0.575 | ||
KESDy18_EM | MFCC | 0.742 | 0.712 | 0.715 | 0.71 | |
Mel-spec | 0.62 | 0.57 | 0.553 | 0.556 | ||
MFCC + Mel-spec | 0.762 | 0.736 | 0.719 | 0.724 | ||
MFCC + Mel-spec + TimeSpectral | 0.774 | 0.738 | 0.737 | 0.734 | ||
KESDy19_EM | MFCC | 0.613 | 0.563 | 0.581 | 0.567 | |
Mel-spec | 0.56 | 0.483 | 0.518 | 0.491 | ||
MFCC + Mel-spec | 0.617 | 0.562 | 0.579 | 0.568 | ||
MFCC + Mel-spec + TimeSpectral | 0.643 | 0.595 | 0.608 | 0.599 |
Model | Dataset | WA | UA | PR | F1 |
---|---|---|---|---|---|
Our baseline (SPSL) | IEMOCAP | 0.611 | 0.59 | 0.58 | 0.575 |
KESDy18_PM | 0.776 | 0.739 | 0.739 | 0.736 | |
KESDy18_EM | 0.774 | 0.738 | 0.737 | 0.734 | |
KESDy19_PM | 0.624 | 0.574 | 0.589 | 0.58 | |
KESDy19_EM | 0.643 | 0.595 | 0.608 | 0.599 |
Researches | Features | Network | UA | Emotions |
---|---|---|---|---|
Mirsamadi [19] | 32 LLD | RNN | 0.585 | 4 |
Chen 1 [20] | logMel | CRNN | 0.647 ± 0.054 | 4 |
Mu [26] | Spectrogram | CRNN | 0.564 | 4 |
Our baseline (SPSL) | 74 LLD | RNN | 0.59 ± 0.08 | 4 |
Index | Domain | Model | WA | UA | PR | F1 |
---|---|---|---|---|---|---|
(a) | IEMOCAP | SPSL | 0.611 | 0.59 | 0.58 | 0.575 |
MPSL | 0.611 | 0.606 | 0.576 | 0.583 | ||
MPGL | 0.619 | 0.607 | 0.582 | 0.588 | ||
(b) | KESDy18_PM | SPSL | 0.776 | 0.739 | 0.739 | 0.736 |
MPSL | 0.781 | 0.753 | 0.747 | 0.746 | ||
MPGL | 0.814 | 0.778 | 0.771 | 0.773 | ||
(c) | KESDy18_EM | SPSL | 0.774 | 0.738 | 0.737 | 0.734 |
MPSL | 0.788 | 0.756 | 0.732 | 0.741 | ||
MPGL | 0.797 | 0.768 | 0.761 | 0.762 | ||
(d) | KESDy19_PM | SPSL | 0.624 | 0.574 | 0.589 | 0.58 |
MPSL | 0.625 | 0.581 | 0.594 | 0.586 | ||
MPGL | 0.637 | 0.586 | 0.607 | 0.594 | ||
(e) | KESDy19_EM | SPSL | 0.643 | 0.595 | 0.608 | 0.599 |
MPSL | 0.629 | 0.581 | 0.591 | 0.584 | ||
MPGL | 0.642 | 0.592 | 0.608 | 0.598 |
Index | Multi-Domain | Model | WA | UA | PR | F1 |
---|---|---|---|---|---|---|
(a) | KESDy18_PM, KESDy18_EM | SPSL | 0.774 | 0.749 | 0.722 | 0.731 |
MPSL | 0.799 | 0.764 | 0.753 | 0.756 | ||
MPGL | 0.806 | 0.773 | 0.766 | 0.768 | ||
(b) | KESDy19_PM, KESDy19_EM | SPSL | 0.618 | 0.581 | 0.584 | 0.581 |
MPSL | 0.626 | 0.58 | 0.589 | 0.584 | ||
MPGL | 0.631 | 0.585 | 0.595 | 0.589 | ||
(c) | KESDy18_PM, KESDy18_EM, KESDy19_PM, KESDy19_EM | SPSL | 0.653 | 0.628 | 0.63 | 0.628 |
MPSL | 0.664 | 0.639 | 0.642 | 0.639 | ||
MPGL | 0.663 | 0.63 | 0.639 | 0.634 | ||
(d) | KESDy18_PM, KESDy18_EM, IEMOCAP | SPSL | 0.683 | 0.649 | 0.63 | 0.637 |
MPSL | 0.706 | 0.675 | 0.654 | 0.66 | ||
MPGL | 0.713 | 0.677 | 0.656 | 0.664 | ||
(e) | KESDy19_PM, KESDy19_EM, IEMOCAP | SPSL | 0.599 | 0.577 | 0.575 | 0.573 |
MPSL | 0.602 | 0.583 | 0.576 | 0.578 | ||
MPGL | 0.616 | 0.587 | 0.59 | 0.588 |
Index | Source Domain | Target Domain | Model | WA | UA | PR | F1 |
---|---|---|---|---|---|---|---|
(a) | KESDy18_PM, KESDy18_EM, KESDy19_EM | KESDy19_PM | SPSL | 0.594 | 0.532 | 0.563 | 0.539 |
MPSL | 0.592 | 0.53 | 0.559 | 0.536 | |||
MPGL | 0.606 | 0.543 | 0.573 | 0.551 | |||
(b) | KESDy18_EM, IEMOCAP | KESDy18_PM | SPSL | 0.682 | 0.69 | 0.652 | 0.658 |
MPSL | 0.688 | 0.704 | 0.643 | 0.658 | |||
MPGL | 0.718 | 0.74 | 0.677 | 0.693 | |||
(c) | KESDy19_EM, IEMOCAP | KESDy19_PM | SPSL | 0.572 | 0.55 | 0.538 | 0.538 |
MPSL | 0.577 | 0.552 | 0.545 | 0.542 | |||
MPGL | 0.596 | 0.555 | 0.561 | 0.554 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Noh, K.J.; Jeong, C.Y.; Lim, J.; Chung, S.; Kim, G.; Lim, J.M.; Jeong, H. Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets. Sensors 2021, 21, 1579. https://doi.org/10.3390/s21051579
Noh KJ, Jeong CY, Lim J, Chung S, Kim G, Lim JM, Jeong H. Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets. Sensors. 2021; 21(5):1579. https://doi.org/10.3390/s21051579
Chicago/Turabian StyleNoh, Kyoung Ju, Chi Yoon Jeong, Jiyoun Lim, Seungeun Chung, Gague Kim, Jeong Mook Lim, and Hyuntae Jeong. 2021. "Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets" Sensors 21, no. 5: 1579. https://doi.org/10.3390/s21051579
APA StyleNoh, K. J., Jeong, C. Y., Lim, J., Chung, S., Kim, G., Lim, J. M., & Jeong, H. (2021). Multi-Path and Group-Loss-Based Network for Speech Emotion Recognition in Multi-Domain Datasets. Sensors, 21(5), 1579. https://doi.org/10.3390/s21051579