Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis
Abstract
:1. Introduction
2. Related Work
2.1. Multimodal Sentiment Analysis
2.2. Recurrent Neural Network
3. Proposed Approach
3.1. Context-Unimodal Information Extraction Module
3.2. Information Enhancement and Data Reorganization Module
3.2.1. Information Enhancement (IE) Block
3.2.2. Data Reorganization (DR) Block
3.3. Sentiment Analysis Layer
Algorithm 1 Multimodal Sequence Feature Extraction Network (MSFE) |
|
4. Experiments
4.1. Datasets
4.2. Baselines
4.3. Experimental Settings
5. Results and Discussion
5.1. Results
5.2. Qualitative Analysis
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Korobiichuk, I.; Syerov, Y.; Fedushko, S. The Method of Semantic Structuring of Virtual Community Content. In Mechatronics 2019: Recent Advances Towards Industry 4.0. MECHATRONICS 2019; Advances in Intelligent Systems and Computing; Springer: Cham, Switzerland, 2020; Volume 1044, pp. 11–18. [Google Scholar]
- Cambria, E.; Hazarika, D.; Poria, S.; Hussain, A.; Subramanyam, R.B.V. Benchmarking multimodal sentiment analysis. Computational Linguistics and Intelligent Text Processing; Springer: Cham, Switzerland, 2018; Volume 10762, pp. 166–179. [Google Scholar]
- Reiter, E.; Dale, R. Building Natural Language Generation Systems, 1st ed.; Cambridge University Press: Cambridge, MA, USA, 2000. [Google Scholar]
- De Mulder, W.; Bethard, S.; Moens, M.-F. A survey on the application of recurrent neural networks to statistical language modeling. Comput. Speech Lang. 2015, 30, 61–98. [Google Scholar] [CrossRef] [Green Version]
- Harabagiu, A.M.; Paca, M.A.; Maiorano, S.J. Experiments with Open-Domain Textual Question Answering. Coling 2000, 2, 292–298. [Google Scholar]
- Strzalkowski, T.; Harabagiu, S. Advances in Open Domain Question Answering. Comput. Linguist. 2007, 33, 597–599. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2015, arXiv:1409.0473v7. [Google Scholar]
- Dodge, J.; Gane, A.; Zhang, X.; Bordes, A.; Chopra, S.; Miller, A.H.; Szlam, A.; Weston, J. Evaluating prerequisite qualities for learning end-to-end dialog systems. arXiv 2016, arXiv:1511.06931. [Google Scholar]
- Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; Jurafsky, D. Deep Reinforcement Learning for Dialogue Generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1192–1202. [Google Scholar]
- Li, H.; Lin, Z.; Shen, X.; Brandt, J.; Hua, G. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5325–5334. [Google Scholar]
- Jiang, H.; Learned-Miller, E. Face Detection with the Faster R-CNN. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 650–657. [Google Scholar]
- Badshah, A.M.; Ahmad, J.; Rahim, N.; Baik, S.W. Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network. In Proceedings of the 2017 International Conference on Platform Technology and Service, PlatCon 2017, Busan, Korea, 13–15 February 2017. [Google Scholar]
- Thongtan, T.; Phienthrakul, T. Sentiment Classification Using Document Embeddings Trained with Cosine Similarity. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, Florence, Italy, 28 July–2 August 2019; pp. 407–414. [Google Scholar]
- Pham, H.; Manzini, T.; Liang, P.P.; Poczos, B. Seq2seq2sentiment: Multimodal sequence to sequence models for sentiment analysis. arXiv 2018, arXiv:1807.03915. [Google Scholar]
- Akhtar, M.S.; Chauhan, D.S.; Ghosal, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis. In Proceedings of the NAACL HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; pp. 370–379. [Google Scholar]
- Huang, Y.; Wang, W.; Wang, L.; Tan, T. Multi-task deep neural network for multi-label learning. In Proceedings of the 2013 IEEE International Conference on Image Processing, ICIP 2013, Melbourne, Australia, 15–18 September 2013; pp. 2897–2900. [Google Scholar]
- Yoon, S.; Byun, S.; Jung, K. Multimodal Speech Emotion Recognition Using Audio and Text. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, 18–21 December 2018; pp. 112–118. [Google Scholar]
- Glodek, M.; Tschechne, S.; Layher, G.; Schels, M.; Brosch, T.; Scherer, S.; Kachele, M.; Schmidt, M.; Neumann, H.; Palm, G.; et al. Multiple Classifier Systems for the Classification of Audio-Visual Emotional States. In Computational Linguistics and Intelligent Text Processing; Springer: Cham, Switzerland, 2011; Volume 6975, pp. 359–368. [Google Scholar]
- Ghosh, S.; Laksana, E.; Morency, L.-P.; Scherer, S. Representation Learning for Speech Emotion Recognition. In Proceedings of the Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, 8–12 September 2016; pp. 3603–3607. [Google Scholar]
- Hu, A.; Flaxman, S. Multimodal Sentiment Analysis To Explore the Structure of Emotions. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, 19–23 August 2018; pp. 350–358. [Google Scholar]
- Wang, H.; Meghawat, A.; Morency, L.P.; Xing, E.P. Select-additive learning: Improving generalization in multimodal sentiment analysis. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 949–954. [Google Scholar]
- Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.P. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. [Google Scholar]
- Williams, J.; Kleinegesse, S.; Comanescu, R.; Radu, O. Recognizing emotions in video using multimodal dnn feature fusion. In Proceedings of the Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), Melbourne, Australia, 20 July 2018; pp. 11–19. [Google Scholar]
- Majumder, N.; Hazarika, D.; Gelbukh, A.; Cambria, E.; Poria, S. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowl.-Based Syst. 2018, 161, 124–133. [Google Scholar] [CrossRef] [Green Version]
- Poria, S.; Mazumder, N.; Cambria, E.; Hazarika, D.; Morency, L.-P.; Zadeh, A. Context-Dependent Sentiment Analysis in User-Generated Videos. In Proceedings of the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics ACL 2017, Vancouver, BC, Canada,, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
- Liang, P.P.; Liu, Z.; Zadeh, A.; Morency, L.-P. Multimodal Language Analysis with Recurrent Multistage Fusion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2020; pp. 150–161. [Google Scholar]
- Zadeh, A.; Poria, S.; Liang, P.P.; Cambria, E.; Mazumder, N.; Morency, L.-P. Memory Fusion Network for Multi-view Sequential Learning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, New Orleans, LS, USA, 2–8 February 2018; pp. 5634–5641. [Google Scholar]
- Wang, Y.; Shen, Y.; Liu, Z.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Words Can Shift: Dynamically Adjusting Word Representations Using Nonverbal Behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence AAAI, Honolulu, HI, USA, 27 January–1 February 2019; pp. 7216–7223. [Google Scholar]
- Delbrouck, J.-B.; Tits, N.; Brousmiche, M.; Dupont, S. A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis. In Proceedings of the Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML), Seattle, WA, USA, 10 July 2020; pp. 1–7. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
- Zadeh, A.; Liang, P.P.; Vanbriesen, J.; Poria, S.; Tong, E.; Cambria, E.; Chen, M.; Morency, L.-P. Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph. In Proceedings of the ACL 2018, Melbourne, Australia, 15–20 July 2018; pp. 2236–2246. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1701–1708. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 6738–6746. [Google Scholar]
- Degottex, G.; Kane, J.; Drugman, T.; Raitio, T.; Scherer, S. COVAREP—A collaborative voice analysis repository for speech technologies. In Proceedings of the ICASSP, Florence, Italy, 4–9 May 2014; pp. 960–964. [Google Scholar]
- Akhtar, M.S.; Chauhan, D.S.; Ghosal, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Multi-task learning for multi-modal emotion recognition and sentiment analysis. arXiv 2019, arXiv:1905.05812. [Google Scholar]
Statistics | Train | Validation | Test |
---|---|---|---|
videos | 3228 | ||
speakers | 1000 | ||
Utterance | 15,290 | 2291 | 4832 |
Positive | 10,887 | 1673 | 3360 |
Negative | 4403 | 618 | 1472 |
Anger | 3433 | 427 | 971 |
Disgust | 2720 | 352 | 922 |
Fear | 1319 | 186 | 332 |
Happy | 8147 | 1313 | 2522 |
Sad | 3906 | 576 | 1334 |
Surprise | 1562 | 201 | 479 |
Parameters | Sentiment Params | Emotion Params |
---|---|---|
128 neurons | 128 neurons | |
64 neurons | 32 neurons | |
64 neurons | 64 neurons | |
128 neurons | 128 neurons | |
64 neurons | 128 neurons | |
Adam ( = ) | Adam ( = ) | |
Sigmoid | Softmax | |
16 | 32 | |
ReLu | ||
Early Stopping (8) |
Test Set | Sentiment | Emotion | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2-Class | Happy | Sad | Angry | Fear | Disgust | Surprise | Avg | ||||||||
Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Macro-F1 | |
EF-LSTM * | - | - | 57.8 | - | 59.2 | - | - | - | 56.7 | - | - | - | - | - | - |
TFN * | - | - | 66.5 | 66.6 | 58.9 | - | 60.5 | - | - | - | - | - | 52.2 | - | - |
G-MFN * | 76.9 | 77.0 | 66.3 | 66.3 | 60.4 | 66.9 | 62.6 | 72.8 | 62.0 | 89.9 | 69.1 | 76.6 | 53.7 | 85.5 | - |
MTMM-ES (S) | 79.8 | 77.6 | 61.6 | 59.3 | 65.4 | 72.4 | 64.5 | 75.6 | 51.5 | 87.7 | 72.2 | 81.0 | 53.0 | 86.5 | - |
MTMM-ES (M) | 80.5 | 78.8 | 53.6 | 67.0 | 61.4 | 72.4 | 66.8 | 75.9 | 62.2 | 87.9 | 72.7 | 81.9 | 60.6 | 86.0 | - |
TBJE (LA) | 82.4 | 66.0 | 65.5 | 73.9 | 67.9 | 81.9 | 76.0 | 89.2 | 87.2 | 86.5 | 84.5 | 90.6 | 86.1 | 77.5 | |
TBJE (LAV) | 81.5 | 65.0 | 64.0 | 72.0 | 67.9 | 81.6 | 74.7 | 89.1 | 84.0 | 85.9 | 83.6 | 90.5 | 86.1 | 76.7 | |
MSFE (LAV) | 79.7 | 81.8 | 67.7 | 67.8 | 72.1 | 67.0 | 81.1 | 74.4 | 93.1 | 89.8 | 80.7 | 78.9 | 90.7 | 85.4 | 77.2 |
Test Set | Emotion | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Happy | Sad | Angry | Fear | Disgust | Surprise | Avg | |||||||
Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Acc | F1 | Macro-F1 | |
L-V-A | 67.7 | 67.8 | 72.1 | 67.0 | 81.1 | 74.4 | 93.1 | 89.8 | 80.7 | 78.9 | 90.7 | 85.4 | 77.2 |
L-V | 64.5 | 64.2 | 73.0 | 64.5 | 83.1 | 73.2 | 93.4 | 89.9 | 79.8 | 75.9 | 89.4 | 87.9 | 76.0 |
L-A | 66.1 | 66.1 | 73.6 | 64.4 | 80.4 | 74.1 | 93.1 | 89.8 | 80.6 | 78.5 | 90.1 | 85.5 | 76.4 |
V-A | 65.0 | 65.2 | 68.8 | 61.7 | 79.7 | 71.8 | 93.2 | 89.8 | 81.5 | 75.0 | 90.5 | 85.9 | 74.9 |
Language + Visual + Audio | Truth | MSFE | ||
---|---|---|---|---|
1 | But, I mean, if you’re going to watch a movie like that, go see Saw again or something, because this movie is really not good at all. | + frown + rapid | Sad Angry Disgust | Sad Angry Disgust Fear |
2 | It’s one of the best action blockbuster I’ve seen this whole summer and I highly recommend you guys seeing it. | + smile + excited | Happy Surprise | Happy Surprise |
3 | Bruce Willis is your old, your (umm) your (stutter) old typical cop but basically this time he’s fighting internet (umm) terrorists. | + calm + smooth | No class | Fear Disgust Angry Sad |
4 | (uhh) the story’s just kind of a rehash of the previous movie and it overall just feels very forced | + frown + slight | Disgust | Disgust Angry Happy Sad |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qi, Q.; Lin, L.; Zhang, R. Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis. Information 2021, 12, 342. https://doi.org/10.3390/info12090342
Qi Q, Lin L, Zhang R. Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis. Information. 2021; 12(9):342. https://doi.org/10.3390/info12090342
Chicago/Turabian StyleQi, Qingfu, Liyuan Lin, and Rui Zhang. 2021. "Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis" Information 12, no. 9: 342. https://doi.org/10.3390/info12090342
APA StyleQi, Q., Lin, L., & Zhang, R. (2021). Feature Extraction Network with Attention Mechanism for Data Enhancement and Recombination Fusion for Multimodal Sentiment Analysis. Information, 12(9), 342. https://doi.org/10.3390/info12090342