Next Article in Journal
A Wearable Solution of Muscle Atrophy Assessment: Oriented Toward Upper Limb Rehabilitation
Next Article in Special Issue
EiGC: An Event-Induced Graph with Constraints for Event Causality Identification
Previous Article in Journal
Explainable Quantum Neural Networks: Example-Based and Feature-Based Methods
Previous Article in Special Issue
A Comparison-Based Framework for Argument Quality Assessment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments

1
Big Data & Intelligence Engineering School, Chongqing College of International Business and Economics, Chongqing 401520, China
2
College of Computer and Information Technology, China Three Gorges University, Yichang 443002, China
3
Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University, Yichang 443002, China
4
Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(20), 4137; https://doi.org/10.3390/electronics13204137
Submission received: 10 September 2024 / Revised: 16 October 2024 / Accepted: 18 October 2024 / Published: 21 October 2024
(This article belongs to the Special Issue New Advances in Affective Computing)

Abstract

:
Multimodal sentiment analysis models can determine users’ sentiments by utilizing rich information from various sources (e.g., textual, visual, and audio). However, there are two key challenges when deploying the model in real-world environments: (1) the limitations of relying on the performance of automatic speech recognition (ASR) models can lead to errors in recognizing sentiment words, which may mislead the sentiment analysis of the textual modality, and (2) variations in information density across modalities complicate the development of a high-quality fusion framework. To address these challenges, this paper proposes a novel Multimodal Sentiment Word Optimization Module and a heterogeneous hierarchical fusion (MSWOHHF) framework. Specifically, the proposed Multimodal Sentiment Word Optimization Module optimizes the sentiment words extracted from the textual modality by the ASR model, thereby reducing sentiment word recognition errors. In the multimodal fusion phase, a heterogeneous hierarchical fusion network architecture is introduced, which first utilizes a Transformer Aggregation Module to fuse the visual and audio modalities, enhancing the high-level semantic features of each modality. A Cross-Attention Fusion Module then integrates the textual modality with the audiovisual fusion. Next, a Feature-Based Attention Fusion Module is proposed that enables fusion by dynamically tuning the weights of both the combined and unimodal representations. It then predicts sentiment polarity using a nonlinear neural network. Finally, the experimental results on the MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek datasets show that the MSWOHHF outperforms several baselines, demonstrating better performance.

1. Introduction

With the rapid growth of social media and short video content, the online information related to emotions has greatly increased [1]. This has attracted widespread attention to multimodal sentiment analysis (MSA) and its applications in many fields (e.g., sentiment recognition and emotion-intelligent robots [2]). Unlike the traditional unimodal sentiment analysis, MSA extracts users’ emotional features from various modal inputs (i.e., textual, visual, audio, and others) [3]. Typically, modality interactions can enrich emotional semantics by utilizing complementary information [4]. However, the different modal information used in MSA is often heterogeneous [5]. Therefore, a key issue in MSA is how to design a multimodal fusion scheme that efficiently integrates heterogeneous data [6], thereby learning multimodal representations containing more emotion-related information while maintaining the consistency and differentiation of each modality.
To enhance fusion representations, earlier studies like CATF-LSTM [7] emphasized designing novel LSTM architectures and attention mechanisms to capture the contextual information within each modality. With the emergence of Transformers, models such as CTFN [8] and MulT [9] introduced modality translation using Transformer-based structures, facilitating cross-modal encoding. Moreover, several representation-learning-based methods (e.g., MISA [5] and Self-MM [10]) have been developed to capture features by leveraging both consistent and divergent information across modalities, aiming to enhance the accuracy of MSA. However, these approaches often assign equal importance to all the modalities, neglecting the varying semantic depth each modality offers. This can lead to undervaluing dominant modalities while overemphasizing weaker ones. Given the distinctive features of each modality in sentiment analysis, the resulting fusion representations may lack critical emotion-related information, ultimately reducing the model performance. Previous studies (e.g., [11,12]) emphasize the dominant role of textual emotional features in MSA tasks, which provide highly structured and semantically rich features and further improve the text modeling techniques in natural language processing. However, in practical scenarios, a large amount of text from ASR systems may lead to errors in recognizing emotional words, affecting the accuracy of MSA models in determining emotional polarity. In summary, the main challenges of multimodal sentiment analysis in the real world include the following: (1) the incorrect recognition of emotional words and (2) the impact of different modes on the emotion analysis being imbalanced.
To address the above challenges, we propose a new MSWOHHF framework. This framework includes four core units: the Multimodal Sentiment Word Optimization Module, the Transformer Aggregation Module, the Cross-Attention Fusion Module, and the Feature-Based Attention Fusion Module.
In the Multimodal Sentiment Word Optimization Module, we mitigate the negative impact of ASR errors by employing a language model that captures both syntactic and semantic information. This allows us to predict the most probable positions of sentiment words. By dynamically reconstructing incomplete sentiment semantics, the model generates new word vectors using multimodal sentiment information.
The Transformer Aggregation Module is designed to elevate the low-level features of audio and visual modalities by converting them into higher-level representations. Initially, the module organizes and compresses information from both modalities. These integrated data are then processed through the aggregation module, enabling complementary feature learning and addressing disparities in the information density between the two modalities.
The Cross-Attention Fusion Module improves the effectiveness of fusion in multimodal sentiment analysis by referencing a learnable matrix. This helps the model to better capture interactions and emotional information across modalities. The attention-based interrelations facilitate the capture of semantic correlations between audiovisual fusion and textual features.
In the Feature-Based Attention Fusion Module, we integrate the representations from the three modalities through vector concatenation. This approach consolidates the features from each modality into a single unified representation. Next, we implement a feature-based attention mechanism that dynamically adjusts the weight of each dimension within these representations. This mechanism identifies and emphasizes the most significant features by establishing correlations between dimensions, thereby optimizing the fusion of multimodal data. The module enhances the overall fusion process by focusing on the key features and filtering out less relevant information. The refined joint representation, resulting from this attention mechanism, is then utilized to generate sentiment predictions, improving the accuracy and reliability of the model’s outputs.
The main contributions of this paper are as follows:
  • We propose the MSWOHHF model, which enhances the robustness of sentiment prediction by dynamically completing the emotional semantics of damaged ASR textual modalities through a Multimodal Sentiment Word Optimization Module.
  • We develop a novel heterogeneous hierarchical multimodal fusion network to facilitate effective interaction among the three modalities, each with varying information densities. This network incorporates a fusion module that utilizes attention aggregation and cross-modal attention mechanisms, enabling fair and efficient complementary learning across modalities.
  • To evaluate each modality’s significance during the fusion process, we design a feature-based attention module that dynamically adjusts the weighting of its representation.

2. Related Work

In recent years, MSA has garnered increasing interest and has become a significant research field [13]. It primarily focuses on extracting human emotions from complex multimedia data comprising text, visual, and speech components. Before multimodal information fusion can occur, it is essential to represent heterogeneous multimodal data. Extensive research has been conducted on modality-specific representations, ranging from handcrafted designs tailored for specific applications to data-driven approaches [2]. For textual feature representation, the commonly used word embedding models like Google’s Word2Vec model [14] mainly rely on either Skip-grams or the CBOW models. For instance, Yang et al. [15] used the Word2Vec model for feature extraction in the textual modality. GloVe [16] word vectors utilize a co-occurrence matrix to capture global information, while ELMo [17] word vectors can capture context-related meanings in words as the linguistic context changes. In 2018, Google introduced the BERT [18] pre-training model, which has been widely adopted by scholars who use large-scale corpora for pre-training to learn semantic relationships before applying it to downstream tasks for word vector input. Chen et al. [19] used both GloVe word vectors and the BERT model for textual feature representation and compared their performance. For visual feature representation in MSA, researchers often use specialized tools like Facet and OpenFace [20] to capture facial emotions. Manually crafted acoustic features like Mel-frequency cepstral coefficients (MFCC) are frequently used in the audio domain. Multimodal representation, which builds upon individual modal features, can be achieved through either straightforward concatenation or more sophisticated deep neural networks (DNNs) [21,22,23,24]. Xu et al. [25] introduced the MultiSentiNet model, which employs LSTM and attention mechanisms guided by visual features to fuse multimodal data.
The early research on multimodal fusion often relied on either early fusion, where the raw features from multiple sources were combined directly [26,27], or late fusion, where the decisions from multiple sentiment classifiers were aggregated [28]. However, early fusion can result in redundant input vectors and increased computational complexity, while late fusion struggles to capture cross-modal correlations. Various deep fusion and multistage fusion strategies have been introduced to overcome these limitations for multimodal sentiment analysis. For instance, Zadeh et al. [29] proposed a Tensor Fusion Network (TFN) model that leverages tensor representations and operations to better capture dynamic properties and interactions across modalities. However, the computational demands of three-dimensional tensors grow exponentially with increasing feature dimensions. To address this, Liu et al. [30] developed Low-Rank Multimodal Fusion (LMF), which reduces the complexity of high-order tensors using low-rank decomposition, although it overlooks local interaction modeling. The previous methods often treated each utterance as independent, neglecting the dependencies between the utterances within a video. Advances have occurred regarding models that incorporate contextual information using recurrent neural networks (RNNs) [31,32]. Poria et al. [33] introduced the BC-LSTM model to capture the contextual dependencies between utterances in the same video, but this approach did not account for the varying importance of each utterance across different modalities. With the introduction of the Transformer architecture [34], multimodal applications have expanded. Tsai et al. [9] proposed MulT, a multimodal model that uses pairwise cross-modal attention in a directional manner, enhancing adaptability between modalities. Among the fusion techniques, the attention-based methods have proven to be highly efficient and effective [35]. Huddar et al. [36] calculated bimodal attention matrices separately and combined them into a trimodal attention matrix, enabling the fusion of interactive information across modalities. Similarly, Wu et al. [3] developed a bimodal-information-enhanced multi-head attention mechanism to study the relative importance of different modality pairs and to fuse multimodal data.
In addition, some researchers have explored the issues of noise reduction and missing modalities. Pham et al. [37] proposed the MCTN model to handle scenarios where there is a possibility of missing visual and acoustic data. Given the importance of textual modalities in multimodal sentiment analysis tasks, Lei et al. [38] introduced the TMRN model, which uses the textual modality as the primary thread, interacting with and enhancing the other two modalities to achieve low-redundancy and denoised feature representations. Chen et al. [11] proposed a gated multimodal embedding approach to filter out noise in acoustic and visual data. Zhu et al. [39] proposed the SKEAFN model, which constructs an emotion knowledge graph and performs graph computations for MSA tasks, designing an additional emotional knowledge generation module.
Although many excellent multimodal sentiment analysis models have been proposed, they often do not address textual acquisition in the current environments. Notably, Wu et al. [40] proposed the SWRM model, which utilizes an emotion word position detection module to determine the most likely locations of emotion words in the textual modality and dynamically completes the emotional semantics of word representations at these locations through a multimodal word refinement module. However, the model does not account for the differences in information density between modalities.
In this paper, we consider the differences in information density across different modalities and allow them to interact fairly with each other while also considering the impact of ASR models in real-world environments.

3. Methodology

Our proposed architecture can be divided into six sub-modules: (1) the modality feature extraction module uses BERT to extract textual features and s L S T M to extract audio and visual features. BERT [18] can capture words’ complex relationships in context through bidirectional encoding. The LSTM network is the basic unit of the s L S T M model structure. LSTM is often used for sequence data processing because it effectively leverages long-distance dependencies in sequential data without encountering the vanishing gradient problem that RNN faces. The s L S T M model contains multiple hidden LSTM layers, each consisting of several LSTM units capable of handling long-term states. Audio features typically include pitch, intensity, speech rate, and tone, which can reflect the speaker’s emotional state. For example, a high pitch, fast speech rate, and louder volume may be associated with positive emotions such as excitement, while a low pitch and slow speech rate may be related to negative emotions such as sadness or disappointment. s L S T M captures these changes by analyzing the temporal characteristics of the audio and extracting high-dimensional audio feature vectors for emotion recognition. Visual features include facial expressions, movements of the eyes and mouth, and head posture, all of which can directly or indirectly reflect emotions. s L S T M processes the temporal data of the video to extract continuous facial expression changes, body posture, and other features, capturing visual sentiment cues. (2) Multimodal Sentiment Word Optimization Module: This module dynamically refines the sentiment word semantics using multimodal information. (3) Transformer Aggregation Module: This module enhances the low-level features of the audio and visual modalities into high-level features. (4) Cross-Attention Fusion Module: Based on cross-attention, this module helps to capture the semantic correlations between textual features and audiovisual fusion features. (5) Feature-Based Attention Fusion Module: This module integrates multi-source representations. (6) Prediction Module: This module generates sentiment regression labels. Figure 1 shows the overall architecture.

3.1. Multimodal Sentiment Word Optimization Module (MSWOM)

We use an MSWOM to improve the accuracy of ASR in recognizing emotional words. Figure 2 illustrates the process of this module. First, we sequentially mask each word w i in a sentence. The word is replaced with a special token [MASK]. Second, a Multimodal Gating Network (MGN) is employed to filter out irrelevant sentence information. At the same time, a Multimodal Sentiment Word Attention Network (MSWAN) is designed to incorporate valuable information from candidate words generated by the BERT model, resulting in optimized word embeddings. The unaligned representations of three modes—word embeddings, visual features, and audio features—are represented as x m = { x l m : 1 l n i , x l m d x m } , m { t , v , a } . To acquire the multimodal information corresponding to each word, a pseudo-alignment method is applied to align textual features with visual and audio features. The audio and visual features are then divided into non-overlapping segments of lengths n a / n t and n v / n t , respectively. The average is taken for each group of features, yielding the pseudo-aligned feature representation u i = { u l i : 1 l n l , u l i d x i } , i { v , a } . Next, BERT and s L S T M are used to model the textual, visual, and audio features separately, resulting in s i = { s l i : 1 l n i , s l i d s i } , i { t , v , a } in Equations (1)–(3). Additionally, we employ a Transformer network to fuse audio and visual features, enabling the capture of high-level sentiment semantics and obtaining s v a = { s l v a : 1 l n i , s l v a d s v a } in Equation (4). The specific equation is as follows:
s t = B E R T t ( x t )
s v = s L S T M v ( u v )
s a = s L S T M a ( u a )
s v a = T r a n s f o r m e r v a ( [ s v ; s a ] )
Subsequently, we employ an MGN to remove irrelevant information from the input sentence. Specifically, we connect the unimodal context-aware representations s p t , s p v , s p a , and the bimodal representation s p v a , and the detected sentiment word position p from the language model. These are fed into a linear neural network and through a sigmoid-activated function to generate a gate value g m in Equation (5), which applies to filter out irrelevant information from the word embeddings. To ensure that the model ignores improbable signals, we obtain the representation information r m of the position in which sentiment words are located by a gate mask k in Equation (6).
g m = S i g m o i d ( W 1 ( [ s p t ; s p v ; s p a ; s p v a ] ) + b 1 )
r m = ( 1 g m k ) x p t
where W 1 1 × Σ i { t , v , a , v a } d s i , b 1 1 .
Additionally, we use an MSWAN to extract sentiment-related information from the candidate words selected by BERT model. First, at as many time steps p as possible, we concatenate the word embeddings x c l p of the candidate word c l p with the multimodal representations s p v , s p a , and s p v a , and then through a linear layer to obtain the attention score g l e in Equation (7). The attention scores are processed through a s o f t m a x function to derive the attention weights τ t e in Equation (8). These weights are then applied to the candidate word embeddings, resulting in the sentiment embeddings r e in Equation (9).
g l e = W 2 ( [ x c l p ; s p v ; s p a ; s p v a ] ) + b 2
τ t e = s o f t m a x ( g l e )
r e = t = 1 k τ t e x c l p
where W 2 1 × ( d x t + Σ i { v , a , v a } d s i ) , b 2 1 .
Additionally, since the desired sentiment words may not be included in the candidate word list, we introduce the representation of [MASK], x m a s k , during the optimization of sentiment word representations. This allows the BERT model to handle the issue based on the context. We then design an aggregation network to balance the contributions of the special word embedding x m a s k and the sentiment embedding r e . Finally, we combine the above representations to obtain the optimized word embedding function for the target word r t . The specific equation is as follows:
g m a s k = S i g m o i d ( W 3 ( [ r e ; x m a s k ] ) + b 3 )
r t = ( g m p ) ( g m a s k r e + ( 1 g m a s k ) x m a s k ) + r m
where W 3 1 × 2 d x t , b 3 1 .

3.2. Unimodal Features Extraction

After obtaining the new word embeddings z t = { x 1 t , x 2 t , , r t , , x n l t } in MSWOM, we perform feature extraction using the BERT model to obtain h t , which is provided as
h t = B E R T t e x t u a l ( z t )
We use two separate s L S T M networks to extract visual and audio features, resulting in h v and h a , which are expressed as
h v = s L S T M v i s u a l ( x v )
h a = s L S T M a u d i o ( x a )

3.3. Transformer Aggregation Module (TAM)

Considering that visual and audio modalities have low-level semantic features, while textual features are high-level semantic features, we designed a low-level feature fusion mechanism within the TAM. This mechanism is intended to compensate for the differences in information density between different modalities, allowing them to be fused more equitably.
First, since the feature dimensions of the visual and audio modalities are inconsistent after being extracted by the s L S T M network, the extracted visual and audio features are processed through a temporal convolutional network to unify the dimensions and further extract features X v and X a , which are denoted as
X v = C o n v 1 D ( h v , k v ) f v * d
X a = C o n v 1 D ( h a , k a ) f a * d
Then, we use a Transformer encoder to enable attention to flow independently within each modality for unimodal representation learning in Equations (17) and (18).
T v = T r a n s f o r m e r ( X v )
T a = T r a n s f o r m e r ( X a )
Finally, low-level feature fusion is performed through the Transformer aggregation network. This module forces each modality to organize and condense its information before sharing it with another modality. The core idea is to introduce T f = T 1 f , T 2 f , , T B f in the input sequence, where the dimension B token is much smaller than the dimensions of the audio and visual modalities. The specific equation is as follows:
[ T ¯ T f ¯ ] = T r a n s f o r m e r ( [ T v T f ] )
[ T a ¯ T f + 1 ¯ ] = T r a n s f o r m e r ( [ T a T f ¯ ] )
T v a = [ T a ¯ T f + 1 ¯ ]
The aggregation module sequentially compresses the information of each modality using visual and audio information. Introducing the B tokens to share only the necessary information improves the performance of multimodal fusion while reducing computational complexity.

3.4. Cross-Attention Fusion Module (CAFM)

We use a CAFM to enhance information complementarity and improve feature representation capability. Through the TAM module, we obtain high-level semantic features T v a from audiovisual fusion. Cross-attention is used to achieve mutual enhancement between the textual modality and the audiovisual fusion.
First, a fully connected layer is used to align the dimensions of the textual modality feature vector with the audiovisual fusion modality feature vector, facilitating the attention calculation. The specific equations are as follows:
X t = W t h t + b t
X v a = W v a T v a + b v a
where W t d m × d t , b t d m , W v a d m × d v a , b v a d m .
The correlation between textual modality features and audiovisual fusion features is assessed to capture their interrelation. To address modality differences, a learnable weight matrix W K × K is employed, and the cross-correlation is expressed as follows:
Z = X v a T W X t
where Z d m × d m , W represents the mutual weights between the textual and audiovisual fusion features, denotes the feature dimensions of the textual and audiovisual fusion features, and T represents the transpose.
The Z quantifies the relationship between textual and audiovisual fusion features. The higher the Z value, the stronger the correlation between the text fusion features and the corresponding subsequences of the audiovisual fusion features. Therefore, the textual and audiovisual fusion features are computed by applying a column-wise s o f t m a x to Z and Z T , where A t and A v a represent the cross-attention weights, which are denoted as
A t i , j = e Z i , j T k = 1 K e Z k , j T
A v a i , j = e Z T i , j T k = 1 K e Z T i , k T
where i and j represent the i -th row and j -th column of the cross-correlation matrix, respectively, and T denotes the s o f t m a x temperature coefficient.
Since the weight matrix W is learned based on the mutual correlations between textual and audiovisual fusion features, the attention weights for each modality are guided by the other modality. This effectively leverages the complementary nature of the textual and audiovisual fusion modalities. Subsequently, the textual and audiovisual fusion features are weighted using cross-attention weights, making the cross-attention weights more comprehensive. The specific equation is as follows:
X t ¯ = X t A t
X v a ¯ = X v a A v a
where A t and A v a represent the cross-attention weights for the textual and audiovisual fused features, respectively.
Then, the reweighted representation is added to the corresponding features to obtain the attended features X a t t , t , X a t t , v a , which are represented as
X a t t , t = t a n h ( X t + X t ¯ )
X a t t , v a = t a n h ( X v a + X v a ¯ )
Finally, X a t t , t and X a t t , v a are concatenated to obtain the final representation X ¯ of the cross-attention fusion in Equation (31).
X ¯ = [ X a t t , t ; X a t t , v a ]

3.5. Feature-Based Attention Fusion Module (FBAFM)

We designed a feature-based attention mechanism fusion module to dynamically adjust the weights of each dimension of X t , X v a , and X ¯ obtained from the above modules. This module identifies and emphasizes the most important features by establishing correlations between dimensions, thereby optimizing the fusion of multimodal data. This module enhances the fusion process by focusing on key features and filtering out less relevant information.
First, the obtained X t , X v a , and X ¯ are concatenated to form a multimodal representation H in Equation (32). Then, the weighted feature attention matrix W a t t is computed in Equation (33).
H = [ X t X v a X ¯ ]
W a t t = S i g m o i d ( W 2 ( Re L U ( W 1 * H ) ) )
where H 3 × d m , W 1 , and W 2 are learnable parameters, with W 1 , W 2 3 × d m .
Additionally, to assess the importance of each feature channel, we use the sigmoid activation function to ensure that each value in W a t t is between 0 and 1. This enhances the contribution of key features to the multimodal fusion process. Finally, the concatenated representation is multiplied by W a t t to obtain the final output H in Equation (34).
H = W a t t * H

3.6. Regression Analysis

In the final regression analysis, we first apply a linear transformation to the multimodal features using a linear layer, followed by a R e L U activation function to introduce nonlinearity. The specific equation is as follows:
s c o r e = R e L U ( W F H + b F )
where W F 1 × d F , b F 1 .

4. Experiments

4.1. Dataset

The three real-world datasets include MOSI-SpeechBrain, MOSI-IBM, and MOSI-iFlytek, constructed by Wu et al. [40]. Below is a brief introduction to these datasets:
These datasets are built upon the CMU-MOSI dataset [41], which contains 93 videos collected from YouTube, ranging in length from 2 to 5 min, and divided into 2199 short video clips annotated with emotion scores. The emotion scores range from −3 (strongly negative) to 3 (strongly positive). However, the text provided in the MOSI dataset was manually transcribed by experts, making it unsuitable for practical applications. We replaced manually transcribed text with ASR systems to enhance the model’s applicability in real-world scenarios. We used the SpeechBrain ASR [42] model to construct the MOSI-SpeechBrain dataset, as well as two commonly used commercial APIs, IBM and iFlytek, to create the MOSI-IBM and MOSI-iFlytek datasets, respectively. Table 1 presents the sentiment word replacement accuracy scores for the experimental datasets using the three ASR models, with a maximum score of 100. In the following descriptions, dataset codes are used instead of dataset names. The MOSI-SpeechBrain dataset scores 73.5, indicating that only about 74 out of every 100 utterances are free of errors.

4.2. Implementation Details

We employed identical parameters for experiments across all three datasets to guarantee fairness. We trained the network utilizing mean square error (MSE) loss and the Adam optimizer with a consistent learning rate of 5 × 10 5 . The batch size was established at 64. The hyperparameters d x t , d s v , d s a , and d s v a applied in training were 768, 32, 64, and 96, correspondingly. All experiments were performed on an NVIDIA GeForce RTX 3090 GPU (manufactured by NVIDIA Corporation, Santa Clara, CA, USA).

4.3. Baseline

We will conduct a comparative analysis with the following representative multimodal sentiment analysis models:
MulT [9], which utilizes the Transformer architecture to model the interaction process of multimodal sequences.
MISA [5], which decomposes modalities into modality-invariant and modality-specific representations, performs multimodal fusion on these representations.
Self-MM [10], which is based on a self-supervised learning module for single-modal label generation, explores single-modal supervision.
SWRM [40], which leverages multimodal information to correct text with erroneous sentiment words generated by ASR models.
MMIM [4], which addresses the issues of cross-modal information interaction and multimodal feature fusion effectiveness by using multitask learning to maximize information gain.
TETA [43], which employs a label encoding technique to assist the Transformer network, covering cases of uncertain modality missing through supervised joint representation learning.

4.4. Results and Analysis

In this paper, we evaluate our model’s performance on the Ds 1, Ds 2, and Ds 3 datasets using accuracy (Acc2), F1 score (F1), mean absolute error (MAE), and correlation (Corr). We provide Acc2 and F1 scores for negative/non-negative (Non0-Acc, Non0-F1) and negative/positive (Has0-Acc, Has0-F1) classifications. Continuous predictions were converted to sentiment labels for classification. The results of the baseline models are shown in Table 2.
Based on the experimental results in Table 2, we can conclude that the MSWOHHF model outperforms several common multimodal models across multiple datasets. Compared to the MulT model, the MSWOHHF shows significant improvements regarding all the datasets. This is mainly because MulT overlooks the differences in information density between modalities, whereas the MSWOHHF more effectively addresses this issue, particularly excelling in the fusion of visual and audio features. Compared to the Self-MM and MMIM models, the MSWOHHF also demonstrates superior performance on most datasets. While multi-task learning methods can share information across modalities and create synergy, their performance may be impacted when the correlation between certain modalities is weak. The MSWOHHF overcomes this challenge by better handling the heterogeneity between modalities. Additionally, the comparison with the TATE model shows that, when the probability of replacing the emotional words in the text is low, the performance gap between the two models narrows. However, while TATE has advantages in handling missing modalities, it does not pay attention to the heterogeneity among different modes, resulting in insufficient fusion and redundant information in the fusion process. Lastly, the comparison between the MSWOHHF and SWRM models indicates that the MSWOHHF promotes more balanced interactions between modalities by more finely addressing the differences in information density, especially excelling in the fusion of both low-level and high-level features.

4.5. Ablation Study

4.5.1. Modal Ablation

Since the three datasets differ only in their textual content, this section performs modality ablation experiments using only the Ds 1 dataset. Table 3 presents the experimental results for the unimodal, bimodal, and trimodal combinations. The results show that the textual modality performs best among the unimodal configurations, trailing the MSWOHHF model by only 3.50% on the Has0-Acc metric. This highlights the decisive role of the textual modality. The trimodal fusion outperforms both the bimodal and unimodal configurations. The performance is slightly lower when only the bimodal information is fused, and it is the worst when the model only includes a unimodal branch. The results indicate that both the consistency and divergence between the modalities are captured through multimodal information fusion, and the accuracy of the sentiment prediction in the model is enhanced by providing the global feature information.

4.5.2. Model Ablation

This section presents an ablation study to evaluate the contribution of each model component, as shown in Table 4. The MSWOHHF represents the complete model proposed in this paper. Moreover, w/o MSWOM refers to the model without the MSWOM, meaning that no corrections occur for the sentiment words misrecognized by the ASR system. The experimental results show a significant decline in the model performance, although it still outperforms the baseline models, thus proving the effectiveness and superiority of the MSWOHHF fusion network. Moreover, w/o TAM refers to the model without the TAM. The experimental results indicate that compensating for the differences in information density between modalities is necessary as fair information exchange under such conditions yields better results. Further, w/o CAFM and w/o FBAFM refer to the models without the CAFM and without the FBAFM, respectively. The experimental results show that both components are essential as they enhance the interaction between the modalities and thus improve the model’s performance.

5. Conclusions

In this paper, we propose the MSWOHHF framework for multimodal sentiment analysis in real-world scenarios. By utilizing a Multimodal Sentiment Word Optimization Module, the MSWOHHF reduces the sentiment recognition errors caused by the limitations of the ASR model performance. During the multimodal fusion process, we introduce a heterogeneous hierarchical fusion network to address the variations in the information density between the different modalities. Specifically, the Transformer aggregation module enhances the semantic representation of the audio and visual modes, while the Cross-Attention Fusion Module facilitates interactions among textual, audio, and visual data. Furthermore, the Feature-Based Attention Fusion Module dynamically adjusts each modality’s weights, aiming to improve the multimodal fusion’s effectiveness. Finally, a nonlinear layer predicts the emotion labels. The experimental results demonstrate that the MSWOHHF model outperforms the other benchmark models on the Ds 1, Ds 2, and Ds 3 datasets, showcasing its robustness in handling various types of multimodal data.
Our current model is trained primarily on datasets composed of English and Western-centric content. This may limit its generalization to multilingual and cross-cultural contexts as sentiment expressions vary. In future work, we plan to enhance the model’s generalizability by incorporating multilingual and cross-cultural training data to capture these variations better and improve its global applicability. Additionally, we acknowledge that the current model may not fully capture long-term temporal dependency features, and enhancing the modeling of these long-term dependencies will be a focus for future improvements.

Author Contributions

Conceptualization, J.H. and H.Z.; methodology, J.H. and H.Z.; software, J.H.; validation, J.H., F.W. and H.Z.; formal analysis, J.H.; investigation, H.Z. and F.W.; resources, J.H. and W.C.; data curation, J.H.; writing—original draft preparation, J.H.; writing—review and editing, H.Z. and F.W.; visualization, J.H.; supervision, H.Z. and F.W.; project administration, H.Z.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Statistical Theory Proof and Model Solution within the Unified Framework of Support Vector Machines (SVMs), grant number KJQN202302003.

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found at https://github.com/huangju1/MWRCMH (accessed on 17 October 2024) dataset.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Shi, Q.; Fan, J.; Wang, Z.; Zhang, Z. Multimodal channel-wise attention transformer inspired by multisensory integration mechanisms of the brain. Pattern Recognit. 2022, 130, 108837. [Google Scholar] [CrossRef]
  2. Baltrušaitis, T.; Ahuja, C.; Morency, L.-P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
  3. Wu, T.; Peng, J.; Zhang, W.; Zhang, H.; Tan, S.; Yi, F.; Ma, C.; Huang, Y. Video sentiment analysis with bimodal information-augmented multi-head attention. Knowl.-Based Syst. 2022, 235, 107676. [Google Scholar] [CrossRef]
  4. Han, W.; Chen, H.; Poria, S.J.a.p.a. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv 2021, arXiv:2109.00412. [Google Scholar]
  5. Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
  6. Liu, Y.; Liu, L.; Guo, Y.; Lew, M.S. Learning visual and textual representations for multimodal matching and classification. Pattern Recognit. 2018, 84, 51–67. [Google Scholar] [CrossRef]
  7. Poria, S.; Cambria, E.; Hazarika, D.; Mazumder, N.; Zadeh, A.; Morency, L.-P. Multi-level multiple attentions for contextual multimodal sentiment analysis. In Proceedings of the 2017 IEEE International Conference on Data Mining (ICDM), New Orleans, LA, USA, 18–21 November 2017; pp. 1033–1038. [Google Scholar]
  8. Tang, J.; Li, K.; Jin, X.; Cichocki, A.; Zhao, Q.; Kong, W. CTFN: Hierarchical learning for multimodal sentiment analysis using coupled-translation fusion network. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 5301–5311. [Google Scholar]
  9. Tsai, Y.-H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.-P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the Conference. Association for Computational Linguistics. Meeting, Florence, Italy, 28 July–2 August 2019; p. 6558. [Google Scholar]
  10. Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; pp. 10790–10797. [Google Scholar]
  11. Chen, M.; Wang, S.; Liang, P.P.; Baltrušaitis, T.; Zadeh, A.; Morency, L.-P. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 163–171. [Google Scholar]
  12. Wu, Y.; Lin, Z.; Zhao, Y.; Qin, B.; Zhu, L.-N. A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; pp. 4730–4738. [Google Scholar]
  13. Prabowo, R.; Thelwall, M. Sentiment analysis: A combined approach. J. Informetr. 2009, 3, 143–157. [Google Scholar] [CrossRef]
  14. Goldberg, Y.; Levy, O. word2vec Explained: Deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv 2014, arXiv:1402.3722. [Google Scholar]
  15. Yang, H.-J.; Lee, G.-S.; Kim, S.-H. End-to-end learning for multimodal emotion recognition in video with adaptive loss. IEEE Multimed. 2021, 28, 59–66. [Google Scholar]
  16. Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
  17. Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Zettlemoyer, L. Deep Contextualized Word Representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
  18. Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  19. Chen, F.; Sun, Z.; Ouyang, D.; Liu, X.; Shao, J. Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 1064–1073. [Google Scholar]
  20. Baltrušaitis, T.; Robinson, P.; Morency, L.-P. Openface: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
  21. Chen, R.; Zhou, W.; Li, Y.; Zhou, H. Video-based cross-modal auxiliary network for multimodal sentiment analysis. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 8703–8716. [Google Scholar] [CrossRef]
  22. Ma, Y.; Hao, Y.; Chen, M.; Chen, J.; Lu, P.; Košir, A. Audio-visual emotion fusion (AVEF): A deep efficient weighted approach. Inf. Fusion 2019, 46, 184–192. [Google Scholar] [CrossRef]
  23. Tsai, Y.-H.H.; Liang, P.P.; Zadeh, A.; Morency, L.-P.; Salakhutdinov, R. Learning factorized multimodal representations. arXiv 2018, arXiv:1806.06176. [Google Scholar]
  24. Zhu, T.; Li, L.; Yang, J.; Zhao, S.; Liu, H.; Qian, J. Multimodal sentiment analysis with image-text interaction network. IEEE Trans. Multimed. 2022, 25, 3375–3385. [Google Scholar] [CrossRef]
  25. Xu, N.; Mao, W. Multisentinet: A deep semantic network for multimodal sentiment analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 2399–2402. [Google Scholar]
  26. Mazloom, M.; Rietveld, R.; Rudinac, S.; Worring, M.; Van Dolen, W. Multimodal popularity prediction of brand-related social media posts. In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 197–201. [Google Scholar]
  27. Pérez-Rosas, V.; Mihalcea, R.; Morency, L.P. Utterance-Level Multimodal Sentiment Analysis. In Proceedings of the Association for Computational Linguistics. ACL, Sofia, Bulgaria, 4–9 August 2013. [Google Scholar]
  28. Atrey, P.K.; Hossain, M.A.; El Saddik, A.; Kankanhalli, M.S. Multimodal fusion for multimedia analysis: A survey. Multimed. Syst. 2010, 16, 345–379. [Google Scholar] [CrossRef]
  29. Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.-P. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017. [Google Scholar]
  30. Liu, Z.; Shen, Y.; Lakshminarasimhan, V.B.; Liang, P.P.; Zadeh, A.; Morency, L.-P. Efficient low-rank multimodal fusion with modality-specific factors. arXiv 2018, arXiv:1806.00064. [Google Scholar]
  31. Yang, X.; Molchanov, P.; Kautz, J. Multilayer and multimodal fusion of deep neural networks for video classification. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 978–987. [Google Scholar]
  32. Agarwal, A.; Yadav, A.; Vishwakarma, D.K. Multimodal sentiment analysis via RNN variants. In Proceedings of the 2019 IEEE International Conference on Big Data, Cloud Computing, Data Science & Engineering (BCD), Honolulu, HI, USA, 29–31 May 2019; pp. 19–23. [Google Scholar]
  33. Poria, S.; Cambria, E.; Hazarika, D.; Majumder, N.; Zadeh, A.; Morency, L.-P. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 873–883. [Google Scholar]
  34. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  35. Fu, Z.; Liu, F.; Xu, Q.; Qi, J.; Fu, X.; Zhou, A.; Li, Z. NHFNET: A non-homogeneous fusion network for multimodal sentiment analysis. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
  36. Huddar, M.G.; Sannakki, S.S.; Rajpurohit, V.S. Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification. Int. J. Multimed. Inf. Retr. 2020, 9, 103–112. [Google Scholar] [CrossRef]
  37. Pham, H.; Liang, P.P.; Manzini, T.; Morency, L.-P.; Póczos, B. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 6892–6899. [Google Scholar]
  38. Lei, Y.; Yang, D.; Li, M.; Wang, S.; Chen, J.; Zhang, L. Text-oriented modality reinforcement network for multimodal sentiment analysis from unaligned multimodal sequences. In Proceedings of the CAAI International Conference on Artificial Intelligence, Fuzhou, China, 22–23 July 2023; pp. 189–200. [Google Scholar]
  39. Zhu, C.; Chen, M.; Zhang, S.; Sun, C.; Liang, H.; Liu, Y.; Chen, J. SKEAFN: Sentiment Knowledge Enhanced Attention Fusion Network for multimodal sentiment analysis. Inf. Fusion 2023, 100. [Google Scholar] [CrossRef]
  40. Wu, Y.; Zhao, Y.; Yang, H.; Chen, S.; Qin, B.; Cao, X.; Zhao, W. Sentiment word aware multimodal refinement for multimodal sentiment analysis with ASR errors. arXiv 2022, arXiv:2203.00257. [Google Scholar]
  41. Zadeh, A.; Zellers, R.; Pincus, E.; Morency, L.-P. Mosi: Multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv 2016, arXiv:1606.06259. [Google Scholar]
  42. Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; et al. SpeechBrain: A general-purpose speech toolkit. arXiv 2021, arXiv:2106.04624. [Google Scholar]
  43. Zeng, J.; Liu, T.; Zhou, J. Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 1545–1554. [Google Scholar]
Figure 1. The architecture of MSWOHHF.
Figure 1. The architecture of MSWOHHF.
Electronics 13 04137 g001
Figure 2. The module of MSWOM.
Figure 2. The module of MSWOM.
Electronics 13 04137 g002
Table 1. The performance of sentiment word replacement scores in the experimental datasets.
Table 1. The performance of sentiment word replacement scores in the experimental datasets.
No.DatasetScore
Ds 1MOSI-SpeechBrain73.5
Ds 2MOSI-IBM82.4
Ds 3MOSI-iFlytek89.4
Table 2. Performance comparison with baseline models. An upward arrow (↑) signifies that higher numerical values indicate better model performance, while a downward arrow (↓) denotes that lower numerical values represent better performance.
Table 2. Performance comparison with baseline models. An upward arrow (↑) signifies that higher numerical values indicate better model performance, while a downward arrow (↓) denotes that lower numerical values represent better performance.
Dataset
No.
ModelEvaluation Metrics
Has0-Acc ↑Has0-F1 ↑Non0-Acc ↑Non0-F1 ↑MAE ↓Corr ↑
Ds 1MulT71.7871.7072.7472.75109.0054.69
Self-MM73.6773.7274.8574.9890.9567.23
MMIM73.8173.9375.0275.1190.8367.43
TATE74.4174.4875.6375.6990.4867.52
SWRM74.5874.6275.7075.8290.5667.47
Ours76.8276.7478.2078.1887.2368.35
Ds 2MulT75.5775.5476.7476.79100.3264.34
Self-MM77.3277.3778.6078.7285.6573.23
MMIM78.2878.3079.0279.0883.0473.68
TATE78.5178.6379.6479.7283.2273.84
SWRM78.4378.4779.7079.8082.9173.91
Ours80.5280.6382.0582.3179.1275.78
Ds 3MulT77.3277.0578.7578.5689.8468.14
Self-MM80.2680.2681.1681.2078.7975.83
MMIM79.2479.3679.8980.0878.5075.62
TATE81.3281.3882.0682.1076.9876.60
SWRM80.4780.4781.2881.3478.3975.97
Ours82.3182.2583.7383.7874.7476.89
Bolded values indicate the optimal performance.
Table 3. Ablation experiments from the modality perspective on the Ds 1 dataset.
Table 3. Ablation experiments from the modality perspective on the Ds 1 dataset.
TaskHas0-Acc ↑Non0-Acc ↑MAE ↓
Textual73.3275.0994.22
Audio44.7545.10144.32
Visual55.2456.30140.30
Textual, Visual73.9275.3691.43
Textual, Audio73.5674.7092.46
Visual, Audio58.2359.86136.75
Textual, Visual, Audio74.2075.1289.10
MSWOHHF76.8278.2087.23
Table 4. Performance comparison from the model perspective.
Table 4. Performance comparison from the model perspective.
Dataset
No.
ModelHas0-Acc ↑Non0-Acc ↑MAE ↓
Ds 1MSWOHHF76.8278.2087.23
w/o MSWOM75.8877.4989.52
w/o TAM74.4675.8790.23
w/o CAFM75.6477.5688.80
w/o FBAFM75.3776.6289.69
Ds 2MSWOHHF80.5282.0579.12
w/o MSWOM79.7281.1680.76
w/o TAM78.2379.8882.02
w/o CAFM80.1281.6480.48
w/o FBAFM78.8280.8681.72
Ds 3MSWOHHF82.3183.7374.74
w/o MSWOM81.0281.9476.12
w/o TAM80.4681.2277.13
w/o CAFM81.5682.7475.98
w/o FBAFM80.3281.4077.32
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, J.; Chen, W.; Wang, F.; Zhang, H. Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments. Electronics 2024, 13, 4137. https://doi.org/10.3390/electronics13204137

AMA Style

Huang J, Chen W, Wang F, Zhang H. Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments. Electronics. 2024; 13(20):4137. https://doi.org/10.3390/electronics13204137

Chicago/Turabian Style

Huang, Ju, Wenkang Chen, Fangyi Wang, and Haijun Zhang. 2024. "Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments" Electronics 13, no. 20: 4137. https://doi.org/10.3390/electronics13204137

APA Style

Huang, J., Chen, W., Wang, F., & Zhang, H. (2024). Heterogeneous Hierarchical Fusion Network for Multimodal Sentiment Analysis in Real-World Environments. Electronics, 13(20), 4137. https://doi.org/10.3390/electronics13204137

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop