HyFusER: Hybrid Multimodal Transformer for Emotion Recognition Using Dual Cross Modal Attention
Abstract
:1. Introduction
- Integration of Complementary Information: By employing Dual Cross Modal Attention, the model effectively combines information from text and speech, enabling recognition of complex emotional signals that are difficult to capture using a single modality.
- High Learning Efficiency: The design allows each modality to learn features within the context of the other, ensuring that critical cues essential for emotion recognition are not missed.
- Improved Prediction Accuracy and Reliability: By fully leveraging the strengths of both text and speech, the model significantly enhances the accuracy and reliability of emotion prediction.
- Applicability to Various Domains: The model demonstrates high performance in practical applications requiring emotion recognition, including emotion-based conversational AI, healthcare, remote education, and emotional labor support.
2. Related Work
3. Background
3.1. Text Feature Extraction: The KoELECTRA Modal
3.2. Speech Feature Extraction: HuBERT Modal
3.3. Cross Modal Attention in Multimodal Environments
4. Hybrid Multimodal Transformer for Emotion Recognition
4.1. Overall Process
4.2. Extract Text and Speech Features
4.3. Hybrid Multimodal Transformer for Emotion Recognition Model
4.3.1. Intermediate Layer Fusion
- Step 1: In this step, the text embeddings are set as the Query, and the speech embeddings are set as the Key and Value. The learning process, as described in Equation (7), complements the features of speech based on the contextual information from the text. This approach enhances emotion recognition performance by focusing on text-centric features.
- 2.
- Step 2: In this step, the speech embeddings are set as the Query, and the text embeddings are set as the Key and Value. The learning process, as described in Equation (8), leverages the temporal and acoustic characteristics of speech to learn the semantic information from text. This approach improves emotion recognition performance by focusing on speech-centric features.
4.3.2. Last Fusion
Algorithm 1. Hybrid Multimodal Transformer Emotion Recognition Model |
Input: , Speech data Output: //Modal Definition , //Feature Extraction , , , //Step 1: Intermediate Layer Fusion , , , , //Step 2: Intermediate Layer Fusion , , , , //Last Fusion |
4.3.3. Hyperparameter Settings
5. Experimental Evaluation
5.1. Dataset
5.2. Experimental Results
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Choi, S.Y. A Development Study of Instructional Design Principles for Multi-Sensory Distance Learning in Elementary School. Master’s Thesis, Seoul National University, Seoul, Republic of Korea, February 2022. [Google Scholar]
- Lee, O.; Yoo, M.; Kim, D. Changes of Teachers’ Perception after Online Distance Learning Experience Due to the COVID-19 Pandemic. J. Educ. Technol. 2021, 37, 429–458. [Google Scholar] [CrossRef]
- Sim, J.-Y.; Seo, H.-G. Remote Medical Smart Healthcare System for IoT-Based Multi-Biometric Information Measurement. J. Korea Converg. Soc. 2020, 11, 53–61. [Google Scholar]
- Cho, M.-G. A Study on Wellbeing Support System for the Elderly Using AI. J. Converg. Inf. Technol. 2021, 11, 16–24. [Google Scholar]
- Ha, T.; Lee, H. Implementation of Application for Smart Healthcare Exercise Management Based on Artificial Intelligence. J. Inst. Electron. Inf. Eng. 2020, 57, 44–53. [Google Scholar]
- Lee, N.-K.; Lee, J.-O. A Study on Mobile Personalized Healthcare Management System. KIPS Trans. Comp. Commun. Syst. 2015, 4, 197–204. [Google Scholar]
- Yoon, S.; Byun, S.; Jung, K. Multimodal Speech Emotion Recognition Using Audio and Text. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; IEEE: New York, NY, USA, 2018; pp. 1–6. [Google Scholar]
- Kim, Y.-J.; Roh, K.; Chae, D. Feature-Based Emotion Recognition Model Using Multimodal Data. In Proceedings of the 2023 Korean Computer Congress (KCC), Seoul, Republic of Korea, 21–23 June 2023; Korean Institute of Information Scientists and Engineers: Seoul, Republic of Korea, 2023; pp. 2169–2171. [Google Scholar]
- Park, H. Enhancement of Multimodal Emotion Recognition Classification Model through Weighted Average Ensemble of KoBART and CNN Models. In Proceedings of the 2023 Korean Computer Congress (KCC), Seoul, Republic of Korea, 21–23 June 2023; Korean Institute of Information Scientists and Engineers: Seoul, Republic of Korea, 2023; pp. 2157–2159. [Google Scholar]
- Kim, S.-S.; Yang, J.-H.; Choi, H.-S.; Go, J.-H.; Moon, N. The Research on Emotion Recognition through Multimodal Feature Combination. In Proceedings of the 2024 ASK Conference, Seoul, Republic of Korea, 15–17 May 2024; ASK Conference Proceedings: Seoul, Republic of Korea, 2024; pp. 739–740. [Google Scholar]
- Byun, Y.-C. A Study on Multimodal Korean Emotion Recognition Using Speech and Text. Master’s Thesis, Graduate School of Information and Communication, Sogang University, Seoul, Republic of Korea, 2024. [Google Scholar]
- Agarkhed, J.; Vishalakshmi. Machine Learning-Based Integrated Audio and Text Modalities for Enhanced Emotional Analysis. In Proceedings of the 5th International Conference on Inventive Research in Computing Applications (ICIRCA 2023), Bengaluru, India, 20–22 July 2023; IEEE: New York, NY, USA, 2023; pp. 989–993. [Google Scholar]
- Makhmudov, F.; Kultimuratov, A.; Cho, Y.-I. Enhancing Multimodal Emotion Recognition through Attention Mechanisms in BERT and CNN Architectures. Appl. Sci. 2024, 14, 4199. [Google Scholar] [CrossRef]
- Feng, L.; Liu, L.-Y.; Liu, S.-L.; Zhou, J.; Yang, H.-Q.; Yang, J. Multimodal Speech Emotion Recognition Based on Multi-Scale MFCCs and Multi-View Attention Mechanism. Multimed. Tools Appl. 2023, 82, 28917–28935. [Google Scholar] [CrossRef]
- Luo, J.; Phan, H.; Reiss, J. Cross-Modal Fusion Techniques for Utterance-Level Emotion Recognition from Text and Speech. arXiv 2023, arXiv:2302.02447. [Google Scholar]
- Patamia, R.A.; Santos, P.E.; Acheampong, K.N.; Ekong, F.; Sarpong, K.; Kun, S. Multimodal Speech Emotion Recognition Using Modality-Specific Self-Supervised Frameworks. arXiv 2023, arXiv:2312.01568. [Google Scholar]
- Wang, P.; Zeng, S.; Chen, J.; Fan, L.; Chen, M.; Wu, Y.; He, X. Leveraging Label Information for Multimodal Emotion Recognition. arXiv 2023, arXiv:2309.02106. [Google Scholar]
- Wu, Z.; Lu, Y.; Dai, X. An Empirical Study and Improvement for Speech Emotion Recognition. arXiv 2023, arXiv:2304.03899. [Google Scholar]
- Clark, K.; Luong, M.-T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-Training Text Encoders as Discriminators Rather than Generators. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3459–3473. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019. [Google Scholar]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal Transformer for Unaligned Multimodal Language Sequences. arXiv 2019, arXiv:1906.00295. [Google Scholar]
- KEMDy19 Dataset. Available online: https://nanum.etri.re.kr/share/kjnoh/KEMDy19?lang=ko_KR (accessed on 30 December 2024).
- KEMDy20 Dataset. Available online: https://nanum.etri.re.kr/share/kjnoh/KEMDy20?lang=ko_KR (accessed on 30 December 2024).
- AI-HUB Korean Emotion Recognition Dataset. Available online: https://aihub.or.kr/aihubdata/data/view.do?dataSetSn=263 (accessed on 30 December 2024).
- Yi, M.-H.; Kwak, K.-C.; Shin, J.-H. KoHMT: A Multimodal Emotion Recognition Model Integrating KoELECTRA, HuBERT with Multimodal Transformer. Electronics 2024, 13, 4674. [Google Scholar] [CrossRef]
NSMC (acc) | Naver NER (F1) | PAWS (acc) | KorNLI (acc) | KorSTS (spearman) | Question Pair (acc) | KorQuaD (Dev) (EM/F1) | Korean-Hate-Speech | |
---|---|---|---|---|---|---|---|---|
KoBERT | 89.59 | 87.92 | 81.25 | 79.62 | 81.59 | 94.85 | 51.75/ 79.15 | 66.21 |
XLM-Roberta-Base | 89.03 | 86.65 | 82.80 | 80.23 | 78.45 | 93.80 | 64.70/ 88.94 | 64.06 |
HanBERT | 90.06 | 87.70 | 82.95 | 80.32 | 82.73 | 94.72 | 78.74/ 92.02 | 68.32 |
KoELECTRA-Base | 90.33 | 87.18 | 81.70 | 80.64 | 82.00 | 93.54 | 60.86/ 89.28 | 66.09 |
KoELECTRA-Base-V2 | 89.56 | 87.16 | 80.70 | 80.72 | 82.30 | 94.85 | 84.01/ 92.40 | 67.45 |
KoELECTRA-Base-V3 | 90.63 | 88.11 | 84.45 | 82.24 | 85.53 | 95.25 | 84.83/ 93.45 | 67.61 |
Emotion | Step 1 Model | Step 2 Model | Average Ensemble | Final Emotion |
---|---|---|---|---|
Angry | 0.0110 | 0.0052 | 0.0081 | neutral |
Disgust | 0.0012 | 0.0015 | 0.0013 | |
Fear | 0.0079 | 0.0111 | 0.0095 | |
Happy | 0.2358 | 0.3191 | 0.2775 | |
Neutral | 0.5105 | 0.625 | 0.5677 | |
Sad | 0.0196 | 0.0088 | 0.0142 | |
Surprise | 0.2137 | 0.0290 | 0.1213 |
Parameter | Value |
---|---|
Input Shape | (126, 768) |
Feed-Forward Dimension | 768 |
Number of Layers | 3 |
Number of Attention Heads | 1 |
Layer Normalization Epsilon | 0.000001 |
Activation Function | ReLU |
Optimizer | Adam (Learning Rate 0.0001) |
Early Stopping Monitor | Val Sparse Categorical Accuracy |
Early Stopping Patience | 3 |
Loss | Sparse Categorical Crossentropy |
Model Checkpoint Monitor | Val Sparse Categorical Accuracy |
Batch Size | 32 |
Epochs | 30 |
Emotion | Collected Data | Final Data Used |
---|---|---|
Angry | 1352 | 1352 |
Disgust | 446 | 446 |
Fear | 840 | 840 |
Happy | 2643 | 2643 |
Neutral | 15,082 | 2643 |
Sad | 1028 | 1028 |
Surprise | 1429 | 1429 |
Total | 22,820 | 10,381 |
Model | Accuracy (%) | Recall (Weighted) | Precision (Weighted) | F1-Score (Weighted) |
---|---|---|---|---|
LSTM (Text) | 59.22 | 0.5922 | 0.6278 | 0.6046 |
CNN (Speech) | 51.85 | 0.5185 | 0.6243 | 0.5584 |
Single modality average ensemble | 60.85 | 0.6085 | 0.6660 | 0.6292 |
Single modality weighted average ensemble | 61.00 | 0.6100 | 0.6638 | 0.6292 |
Bidirectional Cross Modal Attention | 64.08 | 0.6408 | 0.6604 | 0.6419 |
KoHMT | 65.62 | 0.6562 | 0.6739 | 0.6628 |
HyFusER | 67.83 | 0.6783 | 0.6890 | 0.6823 |
Model | Accuracy (%) | Recall (Weighted) | Precision (Weighted) | F1-Score (Weighted) |
---|---|---|---|---|
LSTM (Text) | 70.38 | 0.7038 | 0.7086 | 0.7039 |
CNN (Speech) | 56.38 | 0.5638 | 0.5958 | 0.5703 |
Single modality average ensemble | 70.52 | 0.7052 | 0.7232 | 0.7075 |
Single modality weighted average ensemble | 71.52 | 0.7152 | 0.7247 | 0.7164 |
Bidirectional Cross Modal Attention | 77.63 | 0.7763 | 0.7827 | 0.7770 |
KoHMT | 77.45 | 0.7745 | 0.7780 | 0.7744 |
HyFusER | 79.77 | 0.7977 | 0.7975 | 0.7975 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yi, M.-H.; Kwak, K.-C.; Shin, J.-H. HyFusER: Hybrid Multimodal Transformer for Emotion Recognition Using Dual Cross Modal Attention. Appl. Sci. 2025, 15, 1053. https://doi.org/10.3390/app15031053
Yi M-H, Kwak K-C, Shin J-H. HyFusER: Hybrid Multimodal Transformer for Emotion Recognition Using Dual Cross Modal Attention. Applied Sciences. 2025; 15(3):1053. https://doi.org/10.3390/app15031053
Chicago/Turabian StyleYi, Moung-Ho, Keun-Chang Kwak, and Ju-Hyun Shin. 2025. "HyFusER: Hybrid Multimodal Transformer for Emotion Recognition Using Dual Cross Modal Attention" Applied Sciences 15, no. 3: 1053. https://doi.org/10.3390/app15031053
APA StyleYi, M.-H., Kwak, K.-C., & Shin, J.-H. (2025). HyFusER: Hybrid Multimodal Transformer for Emotion Recognition Using Dual Cross Modal Attention. Applied Sciences, 15(3), 1053. https://doi.org/10.3390/app15031053