1. Introduction
With the development of social media, netizens have begun to make comments and expressed views by utilizing diverse modalities other than text, such as audio and video. Analyzing the emotional information embedded in these multimodal messages has become crucial for market analysis, preference analysis, and other related fields. Consequently, multimodal sentiment analysis (MSA) has gained growing attention in recent years [
1,
2,
3].
Compared to the conventional text-specific sentiment analysis tasks, MSA incorporates two or more modalities as inputs. MSA achieves superior accuracy in sentiment prediction by integrating various pieces of modal information, including natural language, facial expressions, and voice intonation.
Figure 1 illustrates that leveraging multiple modalities provides significant advantages over relying solely on data from a single modality for sentiment analysis.
The five main challenges of multimodal tasks are representation, translation, alignment, fusion, and colearning [
4]. Multimodal fusion is one of the most important topics in multimodal learning, which can be categorized into three types: early, late, and hybrid fusion [
5]. A significant research focus in multimodal fusion is how to extract effective complementary information from multiple modalities and integrate it into a fused representation. Zadeh et al. [
6] introduced the tensor fusion network (TFN), which captured interactions within and between modalities by computing tensor cross-products of modalities. Liu et al. [
7] proposed a low-rank multimodal fusion approach, leveraging low-rank tensors to reduce the computational complexity of tensor methods while performing multimodal fusion. Hu et al. [
8] proposed a multimodal sentiment knowledge-sharing framework (UniMSE) which fused modalities at the syntactic and semantic levels and incorporated contrastive learning to capture the consistency and differences between sentiments and emotions.
Based on the review of previous works, it is observed that most studies employ simple traditional neural networks to extract modal vectors, which are then directly used as inputs for the representation fusion module. These works primarily emphasize representation fusion, while potentially ignoring the significance of learning modal representations. The fusion of representations can sometimes suppress the predictive power of individual modalities, despite the fact that each piece of unimodal data contains rich sentiment information. Therefore, leveraging the hidden information in these heterogeneous data sources can contribute to training an effective model and enhancing the prediction accuracy. Tsai et al. [
9] proposed a joint generation–discriminant objective optimization method, which decomposed the representation into multimodal discriminants and mode-specific generation factors. The former was employed for sentiment classification, while the latter facilitated the learning of mode-specific generation features. Sun et al. [
10] introduced the interaction canonical correlation network (ICCN), which learned multimodal embeddings by capturing hidden correlations among different modalities. In the context of multimodal data where random modality data may be missing, Sun et al. [
11] proposed a framework called efficient multimodal transformer with dual-level feature restoration (EMT-DLFR) to enhance the robustness of models in such scenarios. EMT utilized utterance-level representations of each modality as global multimodal context and interacted with local unimodal features, thereby encouraging the model to learn semantic information from incomplete data.
In this paper, we introduce a model named MCM (multitask learning and contrastive learning for multimodal sentiment analysis) to assist in learning modal representations. Our proposed model comprises two key components: a multitask learning module and a contrastive learning module. Given the diverse nature of multimodal data, it presents an excellent opportunity for leveraging multitask learning. To exploit this potential, we design subtasks specific to different modal representations, aiming to effectively extract the underlying emotional information. While multitask learning primarily serves as a method for unimodal representation learning, we extend our investigation to the learning of fusion representations. We achieve this by incorporating contrastive learning to constrain the fusion prediction, enabling the assisted learning of fusion representations. This approach adds additional complexity to our model, allowing for a more comprehensive learning of the multimodal representations.
In previous studies, multitask learning was applied to multimodal sentiment analysis tasks, and a common characteristic of these studies was that they only utilized unimodal data as subtasks for auxiliary learning [
12]. However, we believe that such a design tends to excessively focus on modeling within each modality while overlooking the modeling of interactions between modalities. The key distinction between multimodal and unimodal tasks lies in the interactions among different modalities. Therefore, in addition to the subtasks targeting unimodal data, we propose incorporating subtasks specifically designed for bimodal representations generated by a gating mechanism. This design allows our model to simultaneously consider both intramodal and intermodal interactions, fully harnessing the advantages of multimodal tasks.
Contrastive learning was initially used in unsupervised learning tasks to learn sentence embeddings from unsupervised data, and later it was gradually extended to supervised data [
13]. Previous research has demonstrated that dropout can lead to inconsistencies between the training and inference stages [
14,
15], which can have a detrimental effect on the final multimodal sentiment polarity prediction results. Therefore, the randomness of the dropout mechanism can be effectively utilized to maintain an output consistency by constraining multiple prediction results, thereby enhancing the overall performance of the model.
For the multitask module in MCM, the main task focuses on predicting the sentiment polarity of fusion representations, while the subtasks involve predicting the sentiment polarity of unimodal representations extracted by traditional networks and bimodal representations generated through a gating unit. By jointly training these tasks, we aim to capture the sentiment information hidden in the vectors. To enhance the performance of multitask learning, we propose a strategy that dynamically acquires task weights based on the existence of homoscedastic uncertainty [
16] in the data. This approach replaces the conventional manual weight setting method, improving the predictive results and reducing computation time. For the contrastive learning module in MCM, we mitigate the issue of training and inference inconsistency caused by the randomness of the dropout mechanism by constraining the consistency of the output results from the two submodels.
The contributions of our work can be summarized as follows:
We propose a dynamic weighted multitask learning method to facilitate the learning of hidden emotional information within modal representations. By assigning dynamic weights based on homoscedastic uncertainty, our approach enhances the effectiveness of multitask learning.
Our method incorporates a contrastive learning module, which ensures the consistency between training and inference by constraining the training of the model. This module optimizes the training process and improves the overall performance of the model.
Experimental results on two widely used datasets, MOSI and MOSEI, demonstrate the superiority of our method compared to current approaches in the field of multimodal sentiment analysis. Our approach achieves comprehensive representation learning under the consideration of both intramodal and intermodal interactions, resulting in improved performance.
3. Methodology
In this section, we provide an overview of the MCM model. We first introduce the general structure of MCM, followed by a detailed description of the dynamic weighted multitask learning module and the contrastive learning module incorporated within MCM.
3.1. Overall Architecture
The architecture of the proposed MCM model in this paper is illustrated in
Figure 3. Our model comprises three sets of tasks, with multimodal sentiment analysis as the main task and unimodal and bimodal sentiment analysis as subtasks.
Unimodal Sentiment Analysis. The original representations of text, audio, and video are denoted as
and
, respectively.
d denotes the dimension of the representation, and
denotes the trainable parameters in the corresponding network. For the text modality, we extracted features using a pretrained BERT model. The vector
, which corresponds to the embedding of the first word in the output of the last layer of BERT, was taken as the representation of the whole sentence.
For the audio and video modalities, we extracted features by using a bidirectional LSTM network [
33] to capture the sequential characteristic of the audio and video data.
The final prediction part consisted of two linear transformations followed by a ReLU activation function.
W denotes the weights in the linear layer, while
b denotes the bias parameters.
where
,
,
, and
is the prediction result of the unimodal task. We took these three unimodal prediction tasks as the first group of subtasks. The upstream network and parameters for learning unimodal representations were shared with the main task.
Bimodal Sentiment Analysis. The purpose of the gating mechanism is to generate an intermediate representation by combining data from different modalities [
34]. To achieve this, we designed a bimodal gated module that learned bimodal representations with dimension
h by integrating the information from two unimodal representations. The module structure is shown in
Figure 4.
Specifically, first we combined the three unimodal representations pairwise as
, where
. These combined representations were then taken as the input for linear layers with a tanh activation function.
where
,
. We concatenated the two representations
and
. Then,
, a weight controlling the contribution of two unimodal inputs, was calculated by a linear transformation and a ReLU activation function.
where
. Finally, the bimodal representation
was calculated by a weighted sum.
The final prediction part consists of two linear transformations and a ReLU activation function.
where
,
, and
denotes the prediction result of the bimodal task. These three bimodal prediction tasks were taken as the second group of subtasks. Similar to unimodal sentiment analysis, the upstream layers used to learn bimodal representations were shared with the main task.
Multimodal Sentiment Analysis. As the main task of MCM, this task combines the unimodal and bimodal representations and utilizes the fusion representations as input for the multimodal sentiment prediction network. In addition to the representation learning layers shared with unimodal and bimodal sentiment prediction, this task includes a task-specific sentiment polarity prediction layer to get the final results.
In the first stage, the multimodal fusion representation
was constructed by concatenating three unimodal and three bimodal representations.
In the second stage, the multimodal prediction result was derived through a linear regression.
where
,
, and
denotes the prediction result of the multimodal task. It represents the final sentiment analysis prediction result of MCM.
3.2. Multitask Learning Module
In this paper, the independent sentiment polarity prediction of three unimodal representations
and three bimodal representations
were considered as two groups of subtasks. The prediction of the multimodal fusion representation
was taken as the main task. By the joint training of multiple tasks, the shared layers between the main task and subtasks were trained simultaneously. The visual sharing graph between different tasks in MCM is shown in
Figure 5. From the graph, we can intuitively observe that the shared layers between the unimodal subtasks and the main task were designed to capture the emotional information present in the unimodal data and generate unimodal representations. This design enabled the model to capture underlying sentimental information within the unimodal data, thereby enhancing the effectiveness of the generated unimodal representations.
However, relying solely on unimodal subtasks may cause certain limitations. In the case of multimodal tasks, it is crucial to consider both intramodal and intermodal interactions. Training auxiliary tasks for individual modalities primarily focuses on capturing intramodal interactions, neglecting intermodal interactions. Consequently, the unimodal representations obtained through this approach may exhibit stronger independent predictive capabilities, but they may not be optimal for subsequent modal fusion stages. To address this, we introduced bimodal prediction subtasks. The bimodal representations derived from the gating unit could effectively capture intermodal information, thereby compensating for the limited learning of intermodal interactions in the unimodal subtasks. Thus, the bimodal subtasks facilitated the learning of intermodal interaction information, serving as valuable support for multimodal tasks. Additionally, they acted as constraints on the unimodal subtasks, preventing the learned unimodal representations from deviating too far from the requirements of the multimodal task.
The loss of each task was calculated by the mean squared error (MSE). The simple loss function of multitask learning was calculated as follows:
where
.
represents the weight coefficient for each task and is a hyperparameter.
represents the predicted sentiment intensity score for each task, and
represents the sentiment intensity label. The weight coefficient
described above is usually set manually, which is inaccurate and time-consuming. Thus, we proposed a method to weigh the loss function by considering the homoscedastic uncertainty of each task.
There are two types of uncertainties commonly observed in deep learning: epistemic uncertainty and aleatoric uncertainty [
35]. Epistemic uncertainty arises from a lack of training samples, while aleatoric uncertainty arises from unexplained information in the training data and can be further categorized into data-dependent heteroscedastic uncertainty and task-dependent homoscedastic uncertainty [
36]. The former depends on the input data, while the latter depends on different tasks. Both types of uncertainty can be captured using Bayesian deep learning methods [
37]. In this paper, we addressed the heteroscedastic uncertainty by incorporating a dynamic weighted multitask learning loss function based on a Bayesian neural network. This approach allowed us to effectively consider the uncertainty associated with different tasks and optimize the model accordingly.
Define
as the final output of the neural network when the input is
x with weight
w. For regression tasks, we defined a Gaussian likelihood with a noise scalar
:
For a multitask learning model with multiple outputs, we defined
as the sufficient statistics. For
k outputs
the multitask likelihood was defined as follows:
The negative log likelihood was calculated as follows:
The optimization objective, which served as the loss function for multitask learning, was defined based on the maximum likelihood estimate.
Therefore, the multitask learning loss function in this paper was modified as follows:
is a dynamic adaptive parameter. is the output of modality k, where , for input x. is a truth label. The ground truth labels used by all the tasks in this paper were multimodal sentiment intensity labels provided by the dataset.
3.3. Contrastive Learning Module
For the task of predicting the sentiment polarity for fusion representations, the multimodal fusion vector
was passed into the network twice to get two sets of prediction results. Subsequently, we compute the mean squared error (MSE) between these two sets of results, which serves as the loss function for the contrastive learning module.
where
N is the size of the dataset, and
and
represent the two predicted sentiment intensity scores obtained after applying dropout twice, respectively. The MSE between the output of two submodels was utilized as the optimal objective, aiming to minimize the discrepancy between the two predictions and encourage the model to generate consistent and reliable results. The contrastive learning module played a crucial role in enhancing the alignment and coherence of the predictions, thereby improving the overall performance of the multimodal sentiment analysis task.
3.4. Optimization Objectives
By incorporating both multitask learning and contrastive learning into the training objective, the final loss function was defined as follows:
where
.
represents a hyperparameter.
4. Experiments
4.1. Dataset
In this paper, our model was evaluated on two open multimodal sentiment analysis datasets, MOSI [
38] and MOSEI [
39].
MOSI. CMU-MOSI consists of 2199 short videos edited from 93 monologue movie commentary videos available on YouTube. The dataset is divided into 1284 training samples, 229 validation samples, and 686 test samples. Each sample is manually annotated by human annotators with a sentiment score ranging from −3 to 3. A higher score indicates a stronger positive emotion, while a lower score indicates a stronger negative emotion.
MOSEI. CMU-MOSEI consists of 23,453 annotated sentences collected from videos featuring over 1000 online speakers discussing 250 different topics on YouTube. The dataset is divided into 16,326 training samples, 1871 validation samples, and 4659 test samples. Similar to MOSI, each sample in CMU-MOSEI is annotated with a sentiment score ranging from −3 to 3, representing the intensity of the sentiment expressed in the sentence.
4.2. Feature Extraction
Text. The raw text data were obtained by manually transcribing the utterances from the video sources. To extract sentence-level text features, we utilized the BERT pretrained model. This choice was motivated by the fact that BERT had undergone extensive pretraining on a large corpus and had demonstrated excellent feature capturing and language representation capabilities. BERT’s pretraining involves two tasks, namely masked language model (MLM) and next sentence prediction (NSP). The resulting pretrained model is able to generate text features with a dimension of 768, which is consistent with both datasets under consideration.
Audio. More than 32 audio features, including NAQ (normalized amplitude quotient), MFCCs (Mel-frequency cepstral coefficients), peak slope, energy slope, were extracted using the COVAREP toolkit [
40]. The dimension of the audio features was 5 for the MOSI dataset and 74 for the MOSEI dataset. These features provided valuable information about the acoustic characteristics of the utterances, enabling the model to capture audio-related cues for sentiment analysis.
Video. The video features were extracted using Facet, a tool that captures facial expression features for each frame. These features include 16 facial action units, 68 facial landmarks, head pose and orientation, 6 basic emotions, and eye gaze [
41,
42], based on a facial action coding system. The dimension of the video features was 20 for the MOSI dataset and 35 for the MOSEI dataset. These features, which captured facial expressions associated with emotions, played a crucial role in enabling the model to perform sentiment analysis effectively.
4.3. Baselines
We compared the performance of MCM with the following baseline methods.
TFN (tensor fusion network) obtains unimodal, bimodal, and trimodal interaction information by calculating the outer product of the multimodal tensor.
LMF (low-rank multimodal fusion) is an improvement of the TFN that uses a low-rank tensor for the multimodal fusion, reducing the computational complexity of tensor-based methods.
MFN (memory fusion network) is a multiview sequential gated memory network that models view-specific and cross-view interactions.
RAVEN (recurrent attended variation embedding network) is a method that assists in learning word embeddings by modeling the fine-grained structure of nonverbal modalities.
MFM (multimodal factorization model) learns modality-specific generative features and discriminative features for classification by decomposing representations into generative and discriminative factors.
MulT (multimodal transformer) proposes a multimodal transformer structure that captures the interactions between different multimodal sequences by using bidirectional cross-modal attention.
Self-MM implements joint learning for unimodal and multimodal to learn the consistency and differences between different modal representations.
4.4. Basic Settings
Experimental Design. We use Adam as the optimizer with learning rates of . The hyperparameter was set to one. The dimension h of the three bimodal representations was unified as 512. To ensure the robustness of our results, we ran our model five times with different random seeds within for each task. The final result was obtained by averaging the outcomes of these five runs.
Evaluation Metrics. In line with previous works, we employed four evaluation metrics to assess the effectiveness of our method. Specifically, we utilized the Acc-2 (binary classification accuracy) and F1 score to evaluate the classification performance. The Acc-2 metric measures the percentage of correctly classified samples out of the total number of samples, which is evaluated in two ways: negative/non-negative [
43], and negative/positive [
18]. The former considers zero as a negative sentiment intensity, while the latter excludes zero from the classification, focusing solely on nonzero sentiment intensities. The F1 score is calculated in the same two ways, and its calculation formula is as follows:
Precision refers to the percentage of samples that are correctly predicted as “positive” out of all samples predicted as “positive”. Recall refers to the percentage of samples predicted as “positive” out of all samples that are actually “positive”. F1 score can effectively evaluate the datasets with imbalanced samples. The F1 score is a metric that combines precision and recall and is particularly useful for evaluating datasets with imbalanced samples. It provides a balanced measure of the model’s performance by considering both precision and recall.
In addition, we used the MAE (mean absolute error) and Corr (Pearson correlation coefficient) to assess the regression performance. The MAE measures the average absolute difference between the predicted values and the true values. It is calculated as follows:
where
N is the number of samples,
represents the predicted value, and
represents the true value. Corr is a metric used to measure the degree of similarity between the predicted results and the true labels. It is calculated using the Pearson correlation coefficient, which is defined as follows:
where
X represents the predicted results, and
Y represents the true labels in our method.
4.5. Results
Table 1 shows the experimental results of the MCM method proposed in this paper on the MOSI and MOSEI datasets. We reproduced the best baseline Self-MM.
The results demonstrated that MCM outperformed all the baseline methods in terms of most evaluation metrics on both datasets, particularly in Acc-2 and the F1 score. Notably, MCM surpassed the performance of Self-MM, a method that leverages automatically generated labels for unimodal subtasks in multitask learning. Similar to Self-MM, previous multitask learning methods used in multimodal sentiment analysis have typically employed unimodal representations as subtasks, neglecting the learning of interactions between modalities. However, MCM addresses this limitation by jointly training single-mode and bimodal subtasks, considering both intramodal and intermodal interactions. This advancement in learning modal representations contributes to the overall improvement in model performance.
4.6. Ablation Study
To further analyze the contribution of each module in MCM, we conducted experiments on the MOSI dataset to compare the performance of models with different combinations of modules. The results of these experiments are shown in
Table 2.
The experimental results demonstrated that the complete MCM model outperformed the model without the multitask learning and contrastive learning modules across all evaluation metrics. Regarding multitask learning, the MCM model with both unimodal and bimodal subtasks achieved superior results compared to the model without any subtasks. This finding highlights the beneficial role of multitask learning in our model, as it enabled the model to leverage the shared information across different tasks and enhance overall performance.
The incorporation of contrastive learning led to better results compared to the model that solely included the multitask learning module. This outcome validated the effectiveness of contrastive learning in improving the multimodal sentiment analysis task. By encouraging the consistency in the predictions, the contrastive learning module enhanced the reliability and robustness of the generated representations.
The inclusion of both unimodal and bimodal subtasks yielded superior performance compared to models with only one set of subtasks. This outcome suggested that the bimodal subtasks effectively constrained the learning of unimodal representations and prevented the model from overly focusing on intramodal interactions at the expense of neglecting intermodal interactions.
In addition to the ablation experiments conducted for different modules, we further analyzed the impact of each subtask in the multitask learning module. Specifically, we compared the performance of various combinations of unimodal or bimodal subtasks. By examining the results from these experiments, we gained insights into the individual contributions of each subtask in the multitask learning module of our model. These findings provided a deeper understanding of how the model benefited from the integration of different subtasks and shed light on the significance of capturing both intramodal and intermodal interactions for an effective multimodal sentiment analysis. The results of these experiments are shown in
Table 3 for the unimodal subtasks and
Table 4 for the bimodal subtasks, respectively.
Based on the results obtained from the two sets of experiments, it is observed that the performance of the model remained relatively similar when using only one or two subtasks. However, the improvement achieved in comparison to the base model was quite limited. This finding emphasized the necessity and effectiveness of joint training for both unimodal and bimodal representations in our model, which was consistent with our previous analysis regarding the purpose of incorporating unimodal and bimodal subtasks.
In summary, the experimental results stressed the effectiveness of multitask learning and contrastive learning in improving the multimodal sentiment analysis task. The combination of these modules, along with the inclusion of both unimodal and bimodal subtasks, led to significant performance improvements for the MCM model.
4.7. Dynamic Weights in Multitask Learning
In order to analyze the effect of dynamic weights in multitask learning in our model, we conducted five groups of comparison experiments. In these experiments, instead of adjusting the weights dynamically, we manually specified the weights of different tasks. The weight
for the main task was set to a value in
, while the weight for the subtask was
. The experimental results on the MOSI dataset are shown in
Figure 6.
The results demonstrated that the dynamically adjusted weights method outperformed the manually adjusted weights method in terms of both Acc-2 and F1 score. This suggested that by using dynamic weights, we could obtain optimal weight configurations that led to improved prediction results. Additionally, the dynamic weights were learned simultaneously during model training, which offered the advantage of saving a significant time compared to manually adjusting weights.
4.8. Case Study
To assess the efficacy of learning modal representations with the assistance of multitask learning, we selected two samples from the MOSI dataset and analyzed their unimodal as well as fusion multimodal prediction results. For comparison, the results obtained from the model without the multitask learning module were also included. The experimental results are shown in
Table 5.
To ensure the representativeness of the selected samples, we intentionally chose one sample with a positive sentiment polarity and another with a negative sentiment polarity. Analyzing the results presented in the table, we can observe that the multitask learning module had a more pronounced impact on the prediction results for the text modality compared to the video and audio modalities. This suggested that multitask learning could refine the text representations, which played a crucial role in the final prediction.
Furthermore, in the absence of the multitask learning module, the individual modal representations learned independently may contain misinformation that contradicts the true sentiment polarity. However, with the incorporation of multitask learning, these misinformation effects could be effectively mitigated. By jointly training the model on multiple tasks, the conflicting information from individual modalities could be rectified, ensuring that the overall prediction remained consistent with the true sentiment polarity.
5. Discussion
With the development of multimedia, multimodal tasks have attracted increasing attention. Multimodal sentiment analysis aims to predict the sentiment polarity of data by integrating emotional information from multiple modalities. Current research primarily focuses on modal fusion, neglecting the importance of modality representation learning in prediction tasks. Therefore, we employed a multitask learning approach to assist in learning modality representation. In contrast to previous multitask learning methods, we designed subtasks specifically for bimodal representations generated by a gating mechanism. This allowed us to fully leverage the advantages of multimodal data, consider both intramodal and intermodal interactions, and enhance the model’s ability to capture hidden emotional information in multimodal data. Additionally, we introduced a dynamic weight computation method to improve the performance of multitask learning. Furthermore, considering the presence of dropout layers in the model with the issue of inconsistencies between training and inference, we proposed a contrastive learning approach, which promoted consistency among the outputs of submodels with different dropout configurations, thereby strengthening the model’s robustness and enhancing its performance. Through the auxiliary learning of modality representation and the resolution of dropout-related issues, this paper effectively improved the model’s performance and holds practical value in real-world applications.
6. Conclusions
In this paper, we proposed MCM, a model that utilized multitask learning and contrastive learning to facilitate the learning of modal representations. A large number of previous studies of multimodal sentiment analysis take the modal representations obtained by training with a traditional neural network as the input to the modal fusion phase directly and focus their research on the fusion of multiple modal representations. However, modal data contain valuable emotional information that can significantly enhance the predictive power of the model. Therefore, we introduced a dynamic weighted multitask learning module to enable our model to capture the hidden information in the modal data. In addition, to alleviate the problem of inconsistent training and inference caused by the dropout layer, a contrastive learning module was added to further improve the effectiveness of our method. Experiment results indicated the efficacy of our proposed method.
During the experiment, it was observed that the prediction accuracy of audio, video, and video–audio representations was much lower than that of other modalities. This indicated that there was still room for improvement in the preprocessing of audio and video source data, as well as the fusion of audio and video features. In future research, we will continue to investigate and improve these areas.