Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network
Abstract
:1. Introduction
2. Related Work
3. Methodology
3.1. Multimodal Word-Refinement Module (MWRM)
3.2. Unimodal Feature Extraction
3.3. Cross-Modal Attention (CMA)
3.4. Trimodal Fusion
3.5. Regression Analysis
4. Experiments
4.1. Dataset
4.2. Implementation Details
4.3. Baseline
4.4. Results and Analysis
4.5. Ablation Study
4.5.1. Modal Ablation
4.5.2. Model Ablation
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Qin, Z.; Zhao, P.; Zhuang, T.; Deng, F.; Ding, Y.; Chen, D. A survey of identity recognition via data fusion and feature learning. Inf. Fusion 2023, 91, 694–712. [Google Scholar] [CrossRef]
- Tu, G.; Liang, B.; Jiang, D.; Xu, R.J.I.T.o.A.C. Sentiment-Emotion-and Context-guided Knowledge Selection Framework for Emotion Recognition in Conversations. IEEE Trans. Affect. Comput. 2022, 1–14. [Google Scholar] [CrossRef]
- Noroozi, F.; Corneanu, C.A.; Kamińska, D.; Sapiński, T.; Escalera, S.; Anbarjafari, G. Survey on emotional body gesture recognition. IEEE Trans. Affect. Comput. 2018, 12, 505–523. [Google Scholar] [CrossRef]
- Pang, B.; Lee, L.; Vaithyanathan, S. Thumbs up? Sentiment classification using machine learning techniques. arXiv 2002, arXiv:cs/0205070. [Google Scholar]
- Yue, W.; Li, L. Sentiment analysis using Word2vec-CNN-BiLSTM classification. In Proceedings of the 2020 Seventh International Conference on Social Networks Analysis, Management and Security (SNAMS), Paris, France, 14–16 December 2020; pp. 1–5. [Google Scholar]
- Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
- Mazloom, M.; Rietveld, R.; Rudinac, S.; Worring, M.; Van Dolen, W. Multimodal popularity prediction of brand-related social media posts. In Proceedings of the Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 197–201. [Google Scholar]
- Pérez-Rosas, V.; Mihalcea, R.; Morency, L.-P. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, 4–9 August 2013; pp. 973–982. [Google Scholar]
- Poria, S.; Cambria, E.; Gelbukh, A. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2539–2544. [Google Scholar]
- Liu, N.; Dellandréa, E.; Chen, L.; Zhu, C.; Zhang, Y.; Bichot, C.-E.; Bres, S.; Tellez, B. Multimodal recognition of visual concepts using histograms of textual concepts and selective weighted late fusion scheme. Comput. Vis. Image Underst. 2013, 117, 493–512. [Google Scholar] [CrossRef]
- Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
- Poria, S.; Chaturvedi, I.; Cambria, E.; Hussain, A. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; pp. 439–448. [Google Scholar]
- Nguyen, D.; Nguyen, K.; Sridharan, S.; Dean, D.; Fookes, C. Deep spatio-temporal feature fusion with compact bilinear pooling for multimodal emotion recognition. Comput. Vis. Image Underst. 2018, 174, 33–42. [Google Scholar] [CrossRef]
- Lv, F.; Chen, X.; Huang, Y.; Duan, L.; Lin, G. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2554–2562. [Google Scholar]
- Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 27 July–2 August 2019; p. 6558. [Google Scholar]
- Cheng, J.; Fostiropoulos, I.; Boehm, B.; Soleymani, M. Multimodal phased transformer for sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual, 7–11 November 2021; pp. 2447–2458. [Google Scholar]
- Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.-P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the Conference Association for Computational Linguistics Meeting, Seattle, WA, USA, 5–10 July 2020; p. 2359. [Google Scholar]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
- Wang, Z.; Wan, Z.; Wan, X. Transmodality: An end2end fusion method with transformer for multimodal sentiment analysis. In Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 2514–2520. [Google Scholar]
- Peng, Y.; Qi, J. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2019, 15, 1–24. [Google Scholar] [CrossRef]
- Tsai, Y.-H.H.; Ma, M.Q.; Yang, M.; Salakhutdinov, R.; Morency, L.-P. Multimodal routing: Improving local and global interpretability of multimodal language analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; p. 1823. [Google Scholar]
- Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef]
- Georgiou, E.; Papaioannou, C.; Potamianos, A. Deep Hierarchical Fusion with Application in Sentiment Analysis. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 1646–1650. [Google Scholar]
- Yan, L.; Cui, Y.; Chen, Y.; Liu, D. Hierarchical attention fusion for geo-localization. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2220–2224. [Google Scholar]
- Fu, Z.; Liu, F.; Xu, Q.; Qi, J.; Fu, X.; Zhou, A.; Li, Z. NHFNET: A non-homogeneous fusion network for multimodal sentiment analysis. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
- Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
- Colombo, C.; Del Bimbo, A.; Pala, P. Semantics in visual information retrieval. IEEE Multimed. 1999, 6, 38–53. [Google Scholar] [CrossRef]
- Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6897–6906. [Google Scholar]
- Bonifazi, G.; Cauteruccio, F.; Corradini, E.; Marchetti, M.; Sciarretta, L.; Ursino, D.; Virgili, L. A Space-Time Framework for Sentiment Scope Analysis in Social Media. Big Data Cogn. Comput. 2022, 6, 130. [Google Scholar] [CrossRef]
- Wöllmer, M.; Weninger, F.; Knaup, T.; Schuller, B.; Sun, C.; Sagae, K.; Morency, L.-P. Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intell. Syst. 2013, 28, 46–53. [Google Scholar] [CrossRef]
- Rozgić, V.; Ananthakrishnan, S.; Saleem, S.; Kumar, R.; Prasad, R. Ensemble of SVM Trees for Multimodal Emotion Recognition. In Proceedings of the 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, Hollywood, CA, USA, 3–6 December 2012; pp. 1–4. [Google Scholar]
- Metallinou, A.; Lee, S.; Narayanan, S. Audio-visual emotion recognition using gaussian mixture models for face and voice. In Proceedings of the 2008 Tenth IEEE International Symposium on Multimedia, Berkeley, CA, USA, 15–17 December 2008; pp. 250–257. [Google Scholar]
- EyEyben, F.; Wöllmer, M.; Graves, A.; Schuller, B.; Douglas-Cowie, E.; Cowie, R. On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. J. Multimodal User Interfaces 2010, 3, 7–19. [Google Scholar] [CrossRef]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
- Poria, S.; Peng, H.; Hussain, A.; Howard, N.; Cambria, E.J.N. Ensemble application of convolutional neural networks and multiple kernel learning for multimodal sentiment analysis. Neurocomputing 2017, 261, 217–230. [Google Scholar] [CrossRef]
- Ghosal, D.; Akhtar, M.S.; Chauhan, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Contextual inter-modal attention for multi-modal sentiment analysis. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3454–3466. [Google Scholar]
- Zhang, Y.; Song, D.; Zhang, P.; Wang, P.; Li, J.; Li, X.; Wang, B. A quantum-inspired multimodal sentiment analysis framework. Theor. Comput. Sci. 2018, 752, 21–40. [Google Scholar] [CrossRef]
- Verma, S.; Wang, C.; Zhu, L.; Liu, W. Deepcu: Integrating both common and unique latent information for multimodal sentiment analysis. In Proceedings of the International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 3627–3634. [Google Scholar]
- Pham, H.; Liang, P.P.; Manzini, T.; Morency, L.P.; Póczos, B. Found in Translation: Learning Robust Joint Representations by Cyclic Translations between Modalities. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6892–6899. [Google Scholar]
- Liang, P.P.; Liu, Z.; Tsai, Y.H.H.; Zhao, Q.; Salakhutdinov, R.; Morency, L.P. Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization. arXiv 2019, arXiv:1907.01011. [Google Scholar]
- Mittal, T.; Bhattacharya, U.; Chandra, R.; Bera, A.; Manocha, D. M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 1359–1367. [Google Scholar]
- Xue, X.; Zhang, C.; Niu, Z.; Wu, X. Multi-level attention map network for multimodal sentiment analysis. IEEE Trans. Knowl. Data Eng. 2022, 35, 5105–5118. [Google Scholar] [CrossRef]
- Cauteruccio, F.; Stamile, C.; Terracina, G.; Ursino, D.; Sappey-Mariniery, D. An automated string-based approach to White Matter fiber-bundles clustering. In Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–8. [Google Scholar]
- Wu, Y.; Zhao, Y.; Yang, H.; Chen, S.; Qin, B.; Cao, X.; Zhao, W. Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors. arXiv 2022, arXiv:2203.00257. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Chung, J.; Gulcehre, C.; Cho, K.H.; Bengio, Y.J.E.A. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.-P. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. [Google Scholar]
- Degottex, G.; Kane, J.; Drugman, T.; Raitio, T.; Scherer, S. COVAREP—A collaborative voice analysis repository for speech technologies. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 960–964. [Google Scholar]
- Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Bengio, Y. SpeechBrain: A General-Purpose Speech Toolkit. arXiv 2021, arXiv:2106.04624. [Google Scholar]
- Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
- Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 10–16 October 2020; pp. 1122–1131. [Google Scholar]
- Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 10790–10797. [Google Scholar]
Datasets | Percentages |
---|---|
MOSI-SpeechBrain | 26.5% |
MOSI-IBM | 17.6% |
MOSI-iFlytek | 10.6% |
Datasets | Models | Evaluation Metrics | |||||
---|---|---|---|---|---|---|---|
Has0-Acc ↑ | Has0-F1 ↑ | Non0-Acc ↑ | Non0-F1 ↑ | MAE ↓ | Corr ↑ | ||
MOSI-SpeechBrain | TFN | 68.98 | 68.95 | 69.51 | 69.57 | 115.55 | 48.54 |
LMF | 68.86 | 68.88 | 69.36 | 69.48 | 117.42 | 48.66 | |
MulT | 71.78 | 71.70 | 72.74 | 72.75 | 109.00 | 54.69 | |
MISA | 73.79 | 73.85 | 74.75 | 74.66 | 98.52 | 65.37 | |
Self-MM | 73.67 | 73.72 | 74.85 | 74.98 | 90.95 | 67.23 | |
SWRM | 74.58 | 74.62 | 75.70 | 75.82 | 90.56 | 67.47 | |
Ours | 76.53 | 76.43 | 78.05 | 78.03 | 88.98 | 68.27 | |
MOSI-IBM | TFN | 71.81 | 71.78 | 72.13 | 73.21 | 109.42 | 58.19 |
LMF | 73.06 | 73.09 | 74.30 | 74.41 | 104.70 | 59.07 | |
MulT | 75.57 | 75.54 | 76.74 | 76.79 | 100.32 | 64.34 | |
MISA | 76.97 | 76.99 | 78.08 | 78.17 | 91.23 | 71.30 | |
Self-MM | 77.32 | 77.37 | 78.60 | 78.72 | 85.65 | 73.23 | |
SWRM | 78.43 | 78.47 | 79.70 | 79.80 | 82.91 | 73.91 | |
Ours | 80.17 | 80.15 | 81.86 | 81.90 | 79.61 | 75.32 | |
MOSI-iFlytek | TFN | 71.95 | 72.01 | 72.62 | 72.76 | 107.01 | 56.52 |
LMF | 71.98 | 72.03 | 72.35 | 72.49 | 106.63 | 59.48 | |
MulT | 77.32 | 77.05 | 78.75 | 78.56 | 89.84 | 68.14 | |
MISA | 79.59 | 79.62 | 79.82 | 79.91 | 85.63 | 74.53 | |
Self-MM | 80.26 | 80.26 | 81.16 | 81.20 | 78.79 | 75.83 | |
SWRM | 80.47 | 80.47 | 81.28 | 81.34 | 78.39 | 75.97 | |
Ours | 81.92 | 81.93 | 82.93 | 82.98 | 75.25 | 76.31 |
Tasks | Has0-Acc ↑ | Non0-Acc ↑ | MAE ↓ |
---|---|---|---|
T | 73.32 | 75.09 | 91.22 |
A | 55.69 | 56.10 | 137.88 |
V | 55.25 | 53.66 | 142.20 |
V + T | 73.91 | 75.30 | 89.03 |
T + A | 73.82 | 74.70 | 90.41 |
V + A | 56.04 | 56.89 | 136.34 |
T + V + A | 75.07 | 76.37 | 91.10 |
TVA + T + V + A | 76.53 | 78.05 | 88.89 |
Datasets | Models | Has0-Acc ↑ | Non0-Acc ↑ | MAE ↓ |
---|---|---|---|---|
MOSI-SpeechBrain | MWRCMH | 76.53 | 78.05 | 88.98 |
w/o Fusion | 75.66 | 77.29 | 89.76 | |
w/o MWRM | 75.37 | 76.62 | 89.69 | |
MOSI-IBM | MWRCMH | 80.17 | 81.86 | 79.61 |
w/o Fusion | 79.74 | 81.11 | 80.75 | |
w/o MWRM | 79.61 | 81.01 | 81.02 | |
MOSI-iFlytek | MWRCMH | 81.92 | 82.93 | 75.25 |
w/o Fusion | 80.61 | 81.86 | 76.16 | |
w/o MWRM | 80.66 | 81.25 | 76.81 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huang, J.; Lu, P.; Sun, S.; Wang, F. Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network. Electronics 2023, 12, 3504. https://doi.org/10.3390/electronics12163504
Huang J, Lu P, Sun S, Wang F. Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network. Electronics. 2023; 12(16):3504. https://doi.org/10.3390/electronics12163504
Chicago/Turabian StyleHuang, Ju, Pengtao Lu, Shuifa Sun, and Fangyi Wang. 2023. "Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network" Electronics 12, no. 16: 3504. https://doi.org/10.3390/electronics12163504
APA StyleHuang, J., Lu, P., Sun, S., & Wang, F. (2023). Multimodal Sentiment Analysis in Realistic Environments Based on Cross-Modal Hierarchical Fusion Network. Electronics, 12(16), 3504. https://doi.org/10.3390/electronics12163504