Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments
Abstract
:1. Introduction
- We propose the MSWOHHF model, which enhances the robustness of sentiment prediction by dynamically completing the emotional semantics of damaged ASR textual modalities through a Multimodal Sentiment Word Optimization Module.
- We develop a novel heterogeneous hierarchical multimodal fusion network to facilitate effective interaction among the three modalities, each with varying information densities. This network incorporates a fusion module that utilizes attention aggregation and cross-modal attention mechanisms, enabling fair and efficient complementary learning across modalities.
- To evaluate each modality’s significance during the fusion process, we design a feature-based attention module that dynamically adjusts the weighting of its representation.
2. Related Work
3. Methodology
3.1. Multimodal Sentiment Word Optimization Module (MSWOM)
3.2. Unimodal Features Extraction
3.3. Transformer Aggregation Module (TAM)
3.4. Cross-Attention Fusion Module (CAFM)
3.5. Feature-Based Attention Fusion Module (FBAFM)
3.6. Regression Analysis
4. Experiments
4.1. Dataset
4.2. Implementation Details
4.3. Baseline
4.4. Results and Analysis
4.5. Ablation Study
4.5.1. Modal Ablation
4.5.2. Model Ablation
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shi, Q.; Fan, J.; Wang, Z.; Zhang, Z. Multimodal channel-wise attention transformer inspired by multisensory integration mechanisms of the brain. Pattern Recognit. 2022, 130, 108837. [Google Scholar] [CrossRef]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
- Wu, T.; Peng, J.; Zhang, W.; Zhang, H.; Tan, S.; Yi, F.; Ma, C.; Huang, Y. Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl.-Based Syst. 2022, 235, 107676. [Google Scholar] [CrossRef]
- Han, W.; Chen, H.; Poria, S.J.a.p.a. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv 2021, arXiv:2109.00412. [Google Scholar]
- Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
- Liu, Y.; Liu, L.; Guo, Y.; Lew, M.S. Learning visual and textual representations for multimodal matching and classification. Pattern Recognit. 2018, 84, 51–67. [Google Scholar] [CrossRef]
- Poria, S.; Cambria, E.; Hazarika, D.; Mazumder, N.; Zadeh, A.; Morency, L.-P. Multi-level multiple attentions for contextual multimodal sentiment analysis. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 1033–1038. [Google Scholar]
- Tang, J.; Li, K.; Jin, X.; Cichocki, A.; Zhao, Q.; Kong, W. CTFN: Hierarchical learning for multimodal sentiment analysis using coupled-translation fusion network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 5301–5311. [Google Scholar]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the Conference. Association for Computational Linguistics. Meeting, Florence, Italy, 28 July–2 August 2019; p. 6558. [Google Scholar]
- Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; pp. 10790–10797. [Google Scholar]
- Chen, M.; Wang, S.; Liang, P.P.; Baltrušaitis, T.; Zadeh, A.; Morency, L.-P. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 163–171. [Google Scholar]
- Wu, Y.; Lin, Z.; Zhao, Y.; Qin, B.; Zhu, L.-N. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; pp. 4730–4738. [Google Scholar]
- Prabowo, R.; Thelwall, M. Sentiment analysis: A combined approach. J. Informetr. 2009, 3, 143–157. [Google Scholar] [CrossRef]
- Goldberg, Y.; Levy, O. word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv 2014, arXiv:1402.3722. [Google Scholar]
- Yang, H.-J.; Lee, G.-S.; Kim, S.-H. End-to-end learning for multimodal emotion recognition in video with adaptive loss. IEEE Multimed. 2021, 28, 59–66. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Chen, F.; Sun, Z.; Ouyang, D.; Liu, X.; Shao, J. Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 1064–1073. [Google Scholar]
- Baltrušaitis, T.; Robinson, P.; Morency, L.-P. Openface: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
- Chen, R.; Zhou, W.; Li, Y.; Zhou, H. Video-based cross-modal auxiliary network for multimodal sentiment analysis. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8703–8716. [Google Scholar] [CrossRef]
- Ma, Y.; Hao, Y.; Chen, M.; Chen, J.; Lu, P.; Košir, A. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach. Inf. Fusion 2019, 46, 184–192. [Google Scholar] [CrossRef]
- Tsai, Y.-H.H.; Liang, P.P.; Zadeh, A.; Morency, L.-P.; Salakhutdinov, R. Learning factorized multimodal representations. arXiv 2018, arXiv:1806.06176. [Google Scholar]
- Zhu, T.; Li, L.; Yang, J.; Zhao, S.; Liu, H.; Qian, J. Multimodal sentiment analysis with image-text interaction network. IEEE Trans. Multimed. 2022, 25, 3375–3385. [Google Scholar] [CrossRef]
- Xu, N.; Mao, W. Multisentinet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2399–2402. [Google Scholar]
- Mazloom, M.; Rietveld, R.; Rudinac, S.; Worring, M.; Van Dolen, W. Multimodal popularity prediction of brand-related social media posts. In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 197–201. [Google Scholar]
- Pérez-Rosas, V.; Mihalcea, R.; Morency, L.P. Utterance-Level Multimodal Sentiment Analysis. In Proceedings of the Association for Computational Linguistics. ACL, Sofia, Bulgaria, 4–9 August 2013. [Google Scholar]
- Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
- Yang, X.; Molchanov, P.; Kautz, J. Multilayer and multimodal fusion of deep neural networks for video classification. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 978–987. [Google Scholar]
- Agarwal, A.; Yadav, A.; Vishwakarma, D.K. Multimodal sentiment analysis via RNN variants. In Proceedings of the 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), Honolulu, HI, USA, 29–31 May 2019; pp. 19–23. [Google Scholar]
- Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.-P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Fu, Z.; Liu, F.; Xu, Q.; Qi, J.; Fu, X.; Zhou, A.; Li, Z. NHFNET: A non-homogeneous fusion network for multimodal sentiment analysis. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Huddar, M.G.; Sannakki, S.S.; Rajpurohit, V.S. Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification. Int. J. Multimed. Inf. Retr. 2020, 9, 103–112. [Google Scholar] [CrossRef]
- Pham, H.; Liang, P.P.; Manzini, T.; Morency, L.-P.; Póczos, B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6892–6899. [Google Scholar]
- Lei, Y.; Yang, D.; Li, M.; Wang, S.; Chen, J.; Zhang, L. Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences. In Proceedings of the CAAI International Conference on Artificial Intelligence, Fuzhou, China, 22–23 July 2023; pp. 189–200. [Google Scholar]
- Zhu, C.; Chen, M.; Zhang, S.; Sun, C.; Liang, H.; Liu, Y.; Chen, J. SKEAFN: Sentiment Knowledge Enhanced Attention Fusion Network for multimodal sentiment analysis. Inf. Fusion 2023, 100. [Google Scholar] [CrossRef]
- Wu, Y.; Zhao, Y.; Yang, H.; Chen, S.; Qin, B.; Cao, X.; Zhao, W. Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors. arXiv 2022, arXiv:2203.00257. [Google Scholar]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.-P. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. [Google Scholar]
- Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; et al. SpeechBrain: A general-purpose speech toolkit. arXiv 2021, arXiv:2106.04624. [Google Scholar]
- Zeng, J.; Liu, T.; Zhou, J. Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1545–1554. [Google Scholar]
No. | Dataset | Score |
---|---|---|
Ds 1 | MOSI-SpeechBrain | 73.5 |
Ds 2 | MOSI-IBM | 82.4 |
Ds 3 | MOSI-iFlytek | 89.4 |
Dataset No. | Model | Evaluation Metrics | |||||
---|---|---|---|---|---|---|---|
Has0-Acc ↑ | Has0-F1 ↑ | Non0-Acc ↑ | Non0-F1 ↑ | MAE ↓ | Corr ↑ | ||
Ds 1 | MulT | 71.78 | 71.70 | 72.74 | 72.75 | 109.00 | 54.69 |
Self-MM | 73.67 | 73.72 | 74.85 | 74.98 | 90.95 | 67.23 | |
MMIM | 73.81 | 73.93 | 75.02 | 75.11 | 90.83 | 67.43 | |
TATE | 74.41 | 74.48 | 75.63 | 75.69 | 90.48 | 67.52 | |
SWRM | 74.58 | 74.62 | 75.70 | 75.82 | 90.56 | 67.47 | |
Ours | 76.82 | 76.74 | 78.20 | 78.18 | 87.23 | 68.35 | |
Ds 2 | MulT | 75.57 | 75.54 | 76.74 | 76.79 | 100.32 | 64.34 |
Self-MM | 77.32 | 77.37 | 78.60 | 78.72 | 85.65 | 73.23 | |
MMIM | 78.28 | 78.30 | 79.02 | 79.08 | 83.04 | 73.68 | |
TATE | 78.51 | 78.63 | 79.64 | 79.72 | 83.22 | 73.84 | |
SWRM | 78.43 | 78.47 | 79.70 | 79.80 | 82.91 | 73.91 | |
Ours | 80.52 | 80.63 | 82.05 | 82.31 | 79.12 | 75.78 | |
Ds 3 | MulT | 77.32 | 77.05 | 78.75 | 78.56 | 89.84 | 68.14 |
Self-MM | 80.26 | 80.26 | 81.16 | 81.20 | 78.79 | 75.83 | |
MMIM | 79.24 | 79.36 | 79.89 | 80.08 | 78.50 | 75.62 | |
TATE | 81.32 | 81.38 | 82.06 | 82.10 | 76.98 | 76.60 | |
SWRM | 80.47 | 80.47 | 81.28 | 81.34 | 78.39 | 75.97 | |
Ours | 82.31 | 82.25 | 83.73 | 83.78 | 74.74 | 76.89 |
Task | Has0-Acc ↑ | Non0-Acc ↑ | MAE ↓ |
---|---|---|---|
Textual | 73.32 | 75.09 | 94.22 |
Audio | 44.75 | 45.10 | 144.32 |
Visual | 55.24 | 56.30 | 140.30 |
Textual, Visual | 73.92 | 75.36 | 91.43 |
Textual, Audio | 73.56 | 74.70 | 92.46 |
Visual, Audio | 58.23 | 59.86 | 136.75 |
Textual, Visual, Audio | 74.20 | 75.12 | 89.10 |
MSWOHHF | 76.82 | 78.20 | 87.23 |
Dataset No. | Model | Has0-Acc ↑ | Non0-Acc ↑ | MAE ↓ |
---|---|---|---|---|
Ds 1 | MSWOHHF | 76.82 | 78.20 | 87.23 |
w/o MSWOM | 75.88 | 77.49 | 89.52 | |
w/o TAM | 74.46 | 75.87 | 90.23 | |
w/o CAFM | 75.64 | 77.56 | 88.80 | |
w/o FBAFM | 75.37 | 76.62 | 89.69 | |
Ds 2 | MSWOHHF | 80.52 | 82.05 | 79.12 |
w/o MSWOM | 79.72 | 81.16 | 80.76 | |
w/o TAM | 78.23 | 79.88 | 82.02 | |
w/o CAFM | 80.12 | 81.64 | 80.48 | |
w/o FBAFM | 78.82 | 80.86 | 81.72 | |
Ds 3 | MSWOHHF | 82.31 | 83.73 | 74.74 |
w/o MSWOM | 81.02 | 81.94 | 76.12 | |
w/o TAM | 80.46 | 81.22 | 77.13 | |
w/o CAFM | 81.56 | 82.74 | 75.98 | |
w/o FBAFM | 80.32 | 81.40 | 77.32 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, J.; Chen, W.; Wang, F.; Zhang, H. Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments. Electronics 2024, 13, 4137. https://doi.org/10.3390/electronics13204137
Huang J, Chen W, Wang F, Zhang H. Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments. Electronics. 2024; 13(20):4137. https://doi.org/10.3390/electronics13204137
Chicago/Turabian StyleHuang, Ju, Wenkang Chen, Fangyi Wang, and Haijun Zhang. 2024. "Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments" Electronics 13, no. 20: 4137. https://doi.org/10.3390/electronics13204137
APA StyleHuang, J., Chen, W., Wang, F., & Zhang, H. (2024). Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments. Electronics, 13(20), 4137. https://doi.org/10.3390/electronics13204137