Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis
Abstract
:1. Introduction
- •
- We propose the mixed-order polynomial tensor pooling (MOPTP) block to adaptively activate the much more discriminative sentiment properties among representation subspaces with varying orders. Compared to the existing fixed-order methods, the proposed mixed-order model can effectively integrate multiple local sentiment properties into the more discriminative one, leading to the relatively global optimal performance.
- •
- We propose a novel tree-based multimodal sentiment learning architecture (TMOPFN) that allows for using multiple sentiment analysis strategies on the same network layer simultaneously. Compared to the existing sequential sentiment analysis model, our parallel framework naturally provides us the great benefit of capturing multi-level sentiment properties via the parallel procedure.
- •
- We conduct various experiments on three public multimodal sentiment benchmarks to evaluate our proposed model. Empirical performance demonstrated the effectiveness of the mixed-order-based techniques and tree-based learning architecture. Please note that sentiment is part of emotion, i.e., the emotion recognition also covers sentiment analysis. Thus, we evaluated our model on two sentiment analysis benchmarks and one emotion recognition benchmark.
2. Related Work
3. Preliminaries
4. Methodology
4.1. Mix-Order Polynomial Tensor Pooling (MOPTP)
Algorithm 1 Mix-order Polynomial Tensor Pooling (MOPTP) |
Input: text , video , audio , the number of subspaces ‘N’, tensor rank ‘R’. Output: The multimodal fusion representation ‘’. 1: ← concat (, , ) 2: for to N do 3: for to n do 4: for to R do 5: ← 6: end for 7: ← ( + ⋯ + ) 8: ←· 9: end for 10: ← 11: end for 12: ← ( + ⋯ + ) 13: return x |
4.2. Tree-Based Mix-Order Polynomial Fusion Network (TMOPFN)
5. Experiments
5.1. Experiment Setups
5.2. Experimental Results
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shoumy, N.J.; Ang, L.M.; Seng, K.P.; Rahaman, D.M.; Zia, T. Multimodal big data affective analytics: A comprehensive survey using text, audio, visual and physiological signals. J. Netw. Comput. Appl. 2020, 149, 102447. [Google Scholar] [CrossRef]
- Yu, Y.; Kim, Y.J. Attention-LSTM-attention model for speech emotion recognition and analysis of IEMOCAP database. Electronics 2020, 9, 713. [Google Scholar] [CrossRef]
- Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the Conference Association for Computational Linguistics, Online, 6–8 July 2020; Volume 2020, p. 2359. Available online: https://aclanthology.org/2020.acl-main.214/ (accessed on 20 December 2022).
- Yadav, A.; Vishwakarma, D.K. Sentiment analysis using deep learning architectures: A review. Artif. Intell. Rev. 2020, 53, 4335–4385. [Google Scholar] [CrossRef]
- Peng, Y.; Qin, F.; Kong, W.; Ge, Y.; Nie, F.; Cichocki, A. GFIL: A unified framework for the importance analysis of features, frequency bands and channels in EEG-based emotion recognition. IEEE Trans. Cogn. Dev. Syst. 2021, 14, 935–947. [Google Scholar] [CrossRef]
- Lai, Z.; Wang, Y.; Feng, R.; Hu, X.; Xu, H. Multi-Feature Fusion Based Deepfake Face Forgery Video Detection. Systems 2022, 10, 31. [Google Scholar] [CrossRef]
- Shen, F.; Peng, Y.; Dai, G.; Lu, B.; Kong, W. Coupled Projection Transfer Metric Learning for Cross-Session Emotion Recognition from EEG. Systems 2022, 10, 47. [Google Scholar] [CrossRef]
- Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 2–3 February 2018. [Google Scholar]
- Chandrasekaran, G.; Antoanela, N.; Andrei, G.; Monica, C.; Hemanth, J. Visual Sentiment Analysis Using Deep Learning Models with Social Media Data. Appl. Sci. 2022, 12, 1030. [Google Scholar] [CrossRef]
- Atmaja, B.T.; Sasou, A. Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations. Sensors 2022, 22, 6369. [Google Scholar] [CrossRef]
- Ma, F.; Zhang, W.; Li, Y.; Huang, S.L.; Zhang, L. Learning better representations for audio-visual emotion recognition with common information. Appl. Sci. 2020, 10, 7239. [Google Scholar] [CrossRef]
- Atmaja, B.T.; Sasou, A.; Akagi, M. Survey on bimodal speech emotion recognition from acoustic and linguistic information fusion. Speech Commun. 2022, 140, 11–28. [Google Scholar] [CrossRef]
- Liang, P.P.; Liu, Z.; Zadeh, A.B.; Morency, L.P. Multimodal language Analysis with recurrent multistage fusion. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 150–161. [Google Scholar]
- Boehm, K.M.; Khosravi, P.; Vanguri, R.; Gao, J.; Shah, S.P. Harnessing multimodal data integration to advance precision oncology. Nat. Rev. Cancer 2022, 22, 114–126. [Google Scholar] [CrossRef]
- Liang, P.P.; Lim, Y.C.; Tsai, Y.H.H.; Salakhutdinov, R.; Morency, L.P. Strong and simple baselines for multimodal utterance embeddings. arXiv 2019, arXiv:1906.02125. [Google Scholar]
- Bayoudh, K.; Knani, R.; Hamdaoui, F.; Mtibaa, A. A survey on deep multimodal learning for computer vision: Advances, trends, applications, and datasets. Vis. Comput. 2022, 38, 2939–2970. [Google Scholar] [CrossRef]
- Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef] [Green Version]
- Sharma, K.; Giannakos, M. Multimodal data capabilities for learning: What can multimodal data tell us about learning? Br. J. Educ. Technol. 2020, 51, 1450–1484. [Google Scholar] [CrossRef]
- Zhang, J.; Yin, Z.; Chen, P.; Nichele, S. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Inf. Fusion 2020, 59, 103–126. [Google Scholar] [CrossRef]
- Mai, S.; Hu, H.; Xu, J.; Xing, S. Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Trans. Affect. Comput. 2020, 13, 320–334. [Google Scholar] [CrossRef]
- Li, Q.; Gkoumas, D.; Lioma, C.; Melucci, M. Quantum-inspired multimodal fusion for video sentiment analysis. Inf. Fusion 2021, 65, 58–71. [Google Scholar] [CrossRef]
- Li, W.; Zhu, L.; Shi, Y.; Guo, K.; Cambria, E. User reviews: Sentiment analysis using lexicon integrated two-channel CNN–LSTM family models. Appl. Soft Comput. 2020, 94, 106435. [Google Scholar] [CrossRef]
- Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of image classification algorithms based on convolutional neural networks. Remote Sens. 2021, 13, 4712. [Google Scholar] [CrossRef]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 1103–1114. [Google Scholar]
- Zhang, Y.; Cheng, C.; Wang, S.; Xia, T. Emotion recognition using heterogeneous convolutional neural networks combined with multimodal factorized bilinear pooling. Biomed. Signal Process. Control 2022, 77, 103877. [Google Scholar] [CrossRef]
- Wang, J.; Ji, Y.; Sun, J.; Yang, Y.; Sakai, T. MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2280–2292. [Google Scholar]
- Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.B.; Morency, L.P. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 2247–2256. [Google Scholar]
- Choi, D.Y.; Kim, D.H.; Song, B.C. Multimodal attention network for continuous-time emotion recognition using video and EEG signals. IEEE Access 2020, 8, 203814–203826. [Google Scholar] [CrossRef]
- Hou, M.; Tang, J.; Zhang, J.; Kong, W.; Zhao, Q. Deep multimodal multilinear fusion with high-order polynomial pooling. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 12136–12145. [Google Scholar]
- Huan, R.H.; Shu, J.; Bao, S.L.; Liang, R.H.; Chen, P.; Chi, K.K. Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimed. Tools Appl. 2021, 80, 8213–8240. [Google Scholar] [CrossRef]
- Van Houdt, G.; Mosquera, C.; Nápoles, G. A review on the long short-term memory model. Artif. Intell. Rev. 2020, 53, 5929–5955. [Google Scholar] [CrossRef]
- Khalid, H.; Gorji, A.; Bourdoux, A.; Pollin, S.; Sahli, H. Multi-view CNN-LSTM architecture for radar-based human activity recognition. IEEE Access 2022, 10, 24509–24519. [Google Scholar] [CrossRef]
- Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
- Zadeh, A.; Liang, P.P.; Poria, S.; Vij, P.; Cambria, E.; Morency, L.P. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 2–3 February 2018. [Google Scholar]
- Tsai, Y.H.; Liang, P.P.; Zadeh, A.; Morency, L.; Salakhutdinov, R. Learning Factorized Multimodal Representations. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Sahay, S.; Okur, E.; Kumar, S.H.; Nachman, L. Low Rank Fusion based Transformers for Multimodal Sequences. arXiv 2020, arXiv:2007.02038. [Google Scholar]
- Huang, F.; Wei, K.; Weng, J.; Li, Z. Attention-based modality-gated networks for image-text sentiment analysis. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2020, 16, 1–19. [Google Scholar] [CrossRef]
- Mai, S.; Xing, S.; He, J.; Zeng, Y.; Hu, H. Analyzing Unaligned Multimodal Sequence via Graph Convolution and Graph Pooling Fusion. arXiv 2020, arXiv:2011.13572. [Google Scholar]
- Yang, J.; Wang, Y.; Yi, R.; Zhu, Y.; Rehman, A.; Zadeh, A.; Poria, S.; Morency, L.P. MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences. arXiv 2020, arXiv:2010.11985. [Google Scholar]
- Chen, J.; Zhang, A. HGMF: Heterogeneous Graph-based Fusion for Multimodal Data with Incompleteness. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; pp. 1295–1305. [Google Scholar]
- Hong, D.; Kolda, T.G.; Duersch, J.A. Generalized canonical polyadic tensor decomposition. SIAM Rev. 2020, 62, 133–163. [Google Scholar] [CrossRef] [Green Version]
- Little, A.; Xie, Y.; Sun, Q. An analysis of classical multidimensional scaling with applications to clustering. Inf. Inference J. IMA 2022. [Google Scholar] [CrossRef]
- Reyes, J.A.; Stoudenmire, E.M. Multi-scale tensor network architecture for machine learning. Mach. Learn. Sci. Technol. 2021, 2, 035036. [Google Scholar] [CrossRef]
- Phan, A.H.; Cichocki, A.; Uschmajew, A.; Tichavskỳ, P.; Luta, G.; Mandic, D.P. Tensor networks for latent variable analysis: Novel algorithms for tensor train approximation. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4622–4636. [Google Scholar] [CrossRef] [PubMed]
- Asante-Mensah, M.G.; Ahmadi-Asl, S.; Cichocki, A. Matrix and tensor completion using tensor ring decomposition with sparse representation. Mach. Learn. Sci. Technol. 2021, 2, 035008. [Google Scholar] [CrossRef]
- Zhao, M.; Li, W.; Li, L.; Ma, P.; Cai, Z.; Tao, R. Three-order tensor creation and tucker decomposition for infrared small-target detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–16. [Google Scholar] [CrossRef]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. MOSI: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. [Google Scholar]
- Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1, Long Papers. pp. 2236–2246. [Google Scholar]
- Zhang, H. The Prosody of Fluent Repetitions in Spontaneous Speech. In Proceedings of the 10th International Conference on Speech Prosody 2020, Hong Kong, China, 25–28 May 2020; pp. 759–763. [Google Scholar]
- Kamyab, M.; Liu, G.; Adjeisah, M. Attention-based CNN and Bi-LSTM model based on TF-IDF and glove word embedding for sentiment analysis. Appl. Sci. 2021, 11, 11255. [Google Scholar] [CrossRef]
- Khalane, A.; Shaikh, T. Context-Aware Multimodal Emotion Recognition. In Proceedings of the International Conference on Information Technology and Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 51–61. [Google Scholar]
- Melinte, D.O.; Vladareanu, L. Facial expressions recognition for human–robot interaction using deep convolutional neural networks with rectified adam optimizer. Sensors 2020, 20, 2393. [Google Scholar] [CrossRef]
- Hashemi, A.; Dowlatshahi, M.B. MLCR: A fast multi-label feature selection method based on K-means and L2-norm. In Proceedings of the 2020 25th International Computer Conference, Computer Society of Iran (CSICC), Tehran, Iran, 1–2 January 2020; pp. 1–7. [Google Scholar]
- Xia, H.; Yang, Y.; Pan, X.; Zhang, Z.; An, W. Sentiment analysis for online reviews using conditional random fields and support vector machines. Electron. Commer. Res. 2020, 20, 343–360. [Google Scholar] [CrossRef]
- Zhang, C.; Yang, Z.; He, X.; Deng, L. Multimodal intelligence: Representation learning, information fusion, and applications. IEEE J. Sel. Top. Signal Process. 2020, 14, 478–493. [Google Scholar] [CrossRef] [Green Version]
- Lian, Z.; Liu, B.; Tao, J. CTNet: Conversational Transformer Network for Emotion Recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2021, 29, 985–1000. [Google Scholar] [CrossRef]
Model | Description of Layer-Wise Configuration |
---|---|
MOPFN-L1 | [] |
MOPFN-L2 | [, , ] – [] |
MOPFN-L2-S1 | [] – [] |
MOPFN-L2-S2 | [, , ] – [] |
MOPFN-L3 | [, , ] – |
[, , ] – [] | |
TMOPFN-L3-S1 | [] – |
[, , , ] – [] | |
TMOPFN-L3-S2 | [, , ] – [] – [] |
TMOPFN-L3-S3 | [, , ] – |
[, , ] – [] | |
MOPFN-L4 | [, , ] – |
[, , ] – | |
[, , ] – [] |
Models | Param |
---|---|
TFN [non-temporal] | |
LMF [non-temporal] | |
MOPTP [temporal] | |
TMOPFN (L layers) [temporal] |
Models | Input | Parameters (M) | CMU-MOSI | IEMOCAP | CMU-MOSEI | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAE | Corr | Acc-2 | F1 | Acc-7 | F1-Happy | F1-Sad | F1-Angry | F1-Neutral | MAE | Corr | Acc-2 | F1 | Acc-7 | |||
SVM [54] | (A+V+T) | - | 1.864 | 0.057 | 50.2 | 50.1 | 17.5 | 81.5 | 78.8 | 82.4 | 64.9 | 0.77 | 0.46 | 73.9 | 73.6 | 39.9 |
DF [55] | (A+V+T) | - | 1.143 | 0.518 | 72.3 | 72.1 | 26.8 | 81.0 | 81.2 | 65.4 | 44.0 | 0.72 | 0.51 | 74.0 | 72.5 | 43.5 |
BC-LSTM [33] | (A+V+T) | 1.7 M | 1.079 | 0.581 | 73.9 | 73.9 | 28.7 | 81.7 | 81.7 | 84.2 | 64.1 | 0.72 | 0.51 | 75.8 | 75.5 | 44.6 |
MV-LSTM [32] | (A+V+T) | - | 1.019 | 0.601 | 73.9 | 74.0 | 33.2 | 81.3 | 74.0 | 84.3 | 66.7 | 0.72 | 0.52 | 76.4 | 76.4 | 43.5 |
MARN [34] | (A+V+T) | - | 0.968 | 0.625 | 77.1 | 77.0 | 34.7 | 83.6 | 81.2 | 84.2 | 65.9 | 0.73 | 0.51 | 75.9 | 75.8 | 43.2 |
MFN [8] | (A+V+T) | 0.5 M | 0.965 | 0.632 | 77.4 | 77.3 | 34.1 | 84.0 | 82.1 | 83.7 | 69.2 | 0.72 | 0.52 | 76.0 | 76.0 | 43.2 |
TFN [24] | (A+V+T) | 12.5 M | 0.970 | 0.633 | 73.9 | 73.4 | 32.1 | 83.6 | 82.8 | 84.2 | 65.4 | 0.72 | 0.52 | 74.8 | 75.4 | 44.7 |
LMF [27] | (A+V+T) | 1.1 M | 0.912 | 0.668 | 76.4 | 75.7 | 32.8 | 85.8 | 85.9 | 89.0 | 71.7 | - | - | - | - | - |
MFM [35] | (A+V+T) | - | 0.951 | 0.662 | 78.1 | 78.1 | 36.2 | 85.8 | 86.1 | 86.7 | 68.1 | - | - | - | - | - |
LMF-MulT [36] | (A+V+T) | - | 1.016 | 0.647 | 77.9 | 77.9 | 32.4 | 84.1 | 83.4 | 86.2 | 70.8 | - | - | - | - | - |
HPFN-L1, 1 subspace (P = [8]) | (A+V+T) | 0.09 M | 0.968 | 0.648 | 77.2 | 77.2 | 36.9 | 85.7 | 86.5 | 87.9 | 71.8 | 0.71 | 0.53 | 75 | 75 | 45.2 |
TMOPFN-L1, 2 subspaces (P = [1, 2]) | (A+V+T) | 0.09 M | 0.938 | 0.678 | 79.6 | 79.6 | 37.9 | 86.0 | 86.6 | 88.6 | 72.5 | 0.71 | 0.55 | 75.3 | 75.5 | 45.4 |
TMOPFN-L2, 2 subspaces (P = [1, 2]) | (A+V+T) | 0.11 M | 0.943 | 0.659 | 78.6 | 78.7 | 38.3 | 87.4 | 86.8 | 90.2 | 72.6 | 0.71 | 0.55 | 75.9 | 75.9 | 45.3 |
TMOPFN-L3-S3, 2 subspaces (P = [1, 2]) | (A+V+T) | 0.12 M | 0.949 | 0.652 | 77.6 | 77.7 | 37.3 | 85.8 | 87.4 | 88.8 | 73.1 | 0.71 | 0.55 | 75.9 | 75.7 | 45.3 |
HPFN (previous-version) [29] | (A+V+T) | 0.11 M | 0.945 | 0.672 | 77.5 | 77.4 | 36.9 | 86.2 | 86.6 | 88.8 | 72.5 | - | - | - | - | - |
TMOPFN | (A+V+T) | 0.12 M | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 | 88.2 | 87.4 | 90.2 | 73.9 | 0.70 | 0.55 | 76.1 | 76.1 | 45.6 |
- | - | - | - |
Models | Input | CMU-MOSI | IEMOCAP | CMU-MOSEI | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAE | Corr | Acc-2 | F1 | Acc-7 | F1-Happy | F1-Sad | F1-Angry | F1-Neutral | MAE | Corr | Acc-2 | F1 | Acc-7 | ||
TMOPFN | (A) | 1.373 | 0.296 | 59.5 | 59.5 | 20.6 | 68.2 | 67.9 | 71.2 | 64.4 | 0.99 | 0.30 | 61.2 | 61.2 | 22.6 |
TMOPFN | (V) | 1.348 | 0.227 | 60.3 | 60.1 | 22.0 | 67.7 | 66.5 | 70.3 | 63.9 | 1.09 | 0.28 | 60.1 | 60.4 | 22.2 |
TMOPFN | (T) | 0.996 | 0.594 | 73.9 | 73.6 | 33.5 | 78.1 | 79.3 | 81.9 | 68.7 | 0.81 | 0.42 | 71.8 | 71.9 | 36.6 |
TMOPFN | (A+V) | 1.169 | 0.305 | 60.9 | 60.6 | 23.7 | 69.9 | 68.8 | 72.7 | 66.3 | 0.91 | 0.33 | 63.4 | 63.9 | 25.6 |
TMOPFN | (A+T) | 0.982 | 0.612 | 74.4 | 74.8 | 34.6 | 80.7 | 81.2 | 82.4 | 70.7 | 0.78 | 0.45 | 72.6 | 72.8 | 38.2 |
TMOPFN | (V+T) | 0.990 | 0.606 | 74.1 | 74.2 | 34.0 | 80.1 | 81.0 | 82.2 | 69.8 | 0.80 | 0.44 | 72.3 | 72.1 | 37.9 |
TMOPFN | (A+V+T) | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 | 88.2 | 87.4 | 90.2 | 73.9 | 0.70 | 0.55 | 76.1 | 76.1 | 45.6 |
Models | Input | CMU-MOSI | ||||
---|---|---|---|---|---|---|
MAE | Corr | Acc-2 | F1 | Acc-7 | ||
Multimodal-Graph [38] | (A+V+T) | 0.923 | 0.680 | 80.1 | 80.0 | 31.9 |
LMF-MulT [36] | (A+V+T) | 0.941 | 0.671 | 78.1 | 78.1 | 34.2 |
TMOPFN | (A+V+T) | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 |
Models | Input | IEMOCAP | |||
---|---|---|---|---|---|
F1-Happy | F1-Sad | F1-Angry | F1-Neutral | ||
HGMF [40] | (A+V+T) | 88.1 | 84.7 | 87.9 | - |
MTGAT [39] | (A+V+T) | 87.8 | 86.5 | 87.1 | 72.9 |
CTNet [56] | (A+V+T) | 83.0 | 86.3 | 80.2 | 83.9 |
LMF-MulT [36] | (A+V+T) | 84.2 | 83.8 | 86.1 | 71.2 |
TMOPFN | (A+V+T) | 88.2 | 87.4 | 90.2 | 73.9 |
Metric | CMU-MOSI | IEMOCAP | CMU-MOSEI | |||
---|---|---|---|---|---|---|
- | Happy | Sad | Angry | Neutral | - | |
UA | 79.9 | 88.13 | 87.32 | 90.45 | 73.18 | 75.6 |
WA | 77.4 | 87.29 | 86.14 | 88.84 | 71.63 | 73.2 |
Models | Value | CMU-MOSI | IEMOCAP | CMU-MOSEI | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MAE | Corr | Acc-2 | F1 | Acc-7 | F1-Happy | F1-Sad | F1-Angry | F1-Neutral | MAE | Corr | Acc-2 | F1 | Acc-7 | ||
TMOPFN | mean | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 | 88.2 | 87.4 | 90.2 | 73.9 | 0.70 | 0.55 | 76.1 | 76.1 | 45.6 |
TMOPFN | variance | ||||||||||||||
TMOPFN | standard deviation |
Models | IEMOCAP | CMU-MOSI | CMU-MOSEI | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
F1-Happy | F1-Sad | F1-Angry | F1-Neutral | MAE | Corr | Acc-2 | F1 | Acc-7 | MAE | Corr | Acc-2 | F1 | Acc-7 | |
HPFN-L1, P = [2] | 85.7 | 86.2 | 87.8 | 71.9 | 0.973 | 0.635 | 77.1 | 77.0 | 35.9 | 0.71 | 0.54 | 75.0 | 75.3 | 45.2 |
HPFN-L2, P = [2, 2] | 86.2 | 86.6 | 88.8 | 72.5 | 0.958 | 0.652 | 77.1 | 77.1 | 36.3 | 0.71 | 0.54 | 75.4 | 75.5 | 45.5 |
HPFN-L2-S1, P = [2, 2] | 86.2 | 86.7 | 88.9 | 72.6 | 0.959 | 0.654 | 77.3 | 77.2 | 36.5 | 0.71 | 0.54 | 75.1 | 75.3 | 45.3 |
HPFN-L2-S2, P = [2, 2] | 86.2 | 86.7 | 89.0 | 72.7 | 0.957 | 0.656 | 77.3 | 77.3 | 36.5 | 0.70 | 0.55 | 75.3 | 75.4 | 45.6 |
HPFN-L3, P = [2, 2, 1] | 86.1 | 86.8 | 88.3 | 72.7 | 0.960 | 0.651 | 76.8 | 76.8 | 36.0 | 0.71 | 0.55 | 75.5 | 75.3 | 45.5 |
HPFN-L4, P = [2, 2, 2, 1] | 85.8 | 86.4 | 88.1 | 72.5 | 0.992 | 0.634 | 76.6 | 76.5 | 34.6 | 0.71 | 0.55 | 75.3 | 75.1 | 45.3 |
TMOPFN-L3-S1 | 86.6 | 86.4 | 88.4 | 72.1 | 0.960 | 0.641 | 76.1 | 76.1 | 36.4 | 0.71 | 0.54 | 75.0 | 75.3 | 45.2 |
TMOPFN-L3-S2 | 85.8 | 86.2 | 88.8 | 72.7 | 0.968 | 0.648 | 76.0 | 76.0 | 35.4 | 0.71 | 0.54 | 75.0 | 75.2 | 45.1 |
TMOPFN-L3-S3 | 85.8 | 87.4 | 88.8 | 73.1 | 0.949 | 0.652 | 77.6 | 77.7 | 37.3 | 0.71 | 0.54 | 75.1 | 75.4 | 45.5 |
Models | Parameters (M) | Input | CMU-MOSI | IEMOCAP | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
MAE | Corr | Acc-2 | F1 | Acc-7 | F1-Happy | F1-Sad | F1-Angry | F1-Neutral | |||
2D-DenseNet | 6.97 M | (A+V+T) | 1.090 | 0.573 | 74.9 | 74.9 | 26.8 | 84.6 | 85.6 | 87.5 | 70.3 |
3D-DenseNet | 6.97 M | (A+V+T) | 1.054 | 0.630 | 75.9 | 76.0 | 29.0 | 82.5 | 84.4 | 88.3 | 67.7 |
HPFN (previous-version) [29] | 0.11 M | (A+V+T) | 0.945 | 0.672 | 77.5 | 77.4 | 36.9 | 86.2 | 86.6 | 88.8 | 72.5 |
TMOPFN | 0.12 M | (A+V+T) | 0.908 | 0.678 | 79.6 | 79.6 | 39.4 | 88.2 | 87.4 | 90.2 | 73.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tang, J.; Hou, M.; Jin, X.; Zhang, J.; Zhao, Q.; Kong, W. Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis. Systems 2023, 11, 44. https://doi.org/10.3390/systems11010044
Tang J, Hou M, Jin X, Zhang J, Zhao Q, Kong W. Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis. Systems. 2023; 11(1):44. https://doi.org/10.3390/systems11010044
Chicago/Turabian StyleTang, Jiajia, Ming Hou, Xuanyu Jin, Jianhai Zhang, Qibin Zhao, and Wanzeng Kong. 2023. "Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis" Systems 11, no. 1: 44. https://doi.org/10.3390/systems11010044
APA StyleTang, J., Hou, M., Jin, X., Zhang, J., Zhao, Q., & Kong, W. (2023). Tree-Based Mix-Order Polynomial Fusion Network for Multimodal Sentiment Analysis. Systems, 11(1), 44. https://doi.org/10.3390/systems11010044