Next Article in Journal
A Novel Method for Improving Baggage Classification Using a Hyper Model of Fusion of DenseNet-161 and EfficientNet-B5
Previous Article in Journal
Almost Nobody Is Using ChatGPT to Write Academic Science Papers (Yet)
Previous Article in Special Issue
Improving Clothing Product Quality and Reducing Waste Based on Consumer Review Using RoBERTa and BERTopic Language Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Augmenting Multimodal Content Representation with Transformers for Misinformation Detection †

1
Department of Computer Science and Information Engineering, National Taipei University of Technology, Taipei 106, Taiwan
2
Electrical Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221, USA
3
Inventory Department, Cheng Hsin General Hospital, Taipei 112, Taiwan
*
Authors to whom correspondence should be addressed.
This paper is an extension of the conference paper entitled: “Multimodal Content Veracity Assessment with Bidirectional Transformers and Self-Attention-based Bi-GRU Networks”, as published in Proceedings of IEEE International Conference on Multimedia Big Data (BigMM 2022).
Big Data Cogn. Comput. 2024, 8(10), 134; https://doi.org/10.3390/bdcc8100134
Submission received: 1 July 2024 / Revised: 16 August 2024 / Accepted: 9 September 2024 / Published: 11 October 2024
(This article belongs to the Special Issue Sustainable Big Data Analytics and Machine Learning Technologies)

Abstract

:
Information sharing on social media has become a common practice for people around the world. Since it is difficult to check user-generated content on social media, huge amounts of rumors and misinformation are being spread with authentic information. On the one hand, most of the social platforms identify rumors through manual fact-checking, which is very inefficient. On the other hand, with an emerging form of misinformation that contains inconsistent image–text pairs, it would be beneficial if we could compare the meaning of multimodal content within the same post for detecting image–text inconsistency. In this paper, we propose a novel approach to misinformation detection by multimodal feature fusion with transformers and credibility assessment with self-attention-based Bi-RNN networks. Firstly, captions are derived from images using an image captioning module to obtain their semantic descriptions. These are compared with surrounding text by fine-tuning transformers for consistency check in semantics. Then, to further aggregate sentiment features into text representation, we fine-tune a separate transformer for text sentiment classification, where the output is concatenated to augment text embeddings. Finally, Multi-Cell Bi-GRUs with self-attention are used to train the credibility assessment model for misinformation detection. From the experimental results on tweets, the best performance with an accuracy of 0.904 and an F1-score of 0.921 can be obtained when applying feature fusion of augmented embeddings with sentiment classification results. This shows the potential of the innovative way of applying transformers in our proposed approach to misinformation detection. Further investigation is needed to validate the performance on various types of multimodal discrepancies.

1. Introduction

With the prevalence of social network platforms, people can easily obtain the latest updates from social media and share their opinions. The increase in user-generated content (UGC) could help the growth of social network platforms, but might also cause problems if not moderated appropriately. When enabling user content generation and dissemination, most social network platforms are also open to potential risks through unwanted forms of content such as hate speech, sexual harassment, terror, and violence, to name but a few. Given such objectionable content, most platforms implement their content policy and enforce content moderation to remove them. In addition to authentic information, people are receiving more and more misinformation and disinformation. This causes serious problems to social media platforms and users. People are losing trust to the media and the messages received, while the platforms strive to maintain their reputations to the users. To improve the quality of online information, rumor and misinformation detection has been an increasingly important research topic in social media.
To moderate misinformation on most social networking sites, third-party fact-checking services such as FactCheck.org, PolitiFact, and Snopes.com are commonly used for validating the credibility of user posts. Even when automatic tools are used to help content moderation, the reliability is still limited. Manual verification or labeling is usually required. However, both fact-checking services and manual labeling require a significant amount of human effort. As the speed of content generation and dissemination grows, the quality of information cannot be effectively validated in a timely manner.
Machine learning methods and deep learning methods have been applied in many different fields. For example, transformers [1] are based on the multi-head attention mechanism. The idea is to simulate how human attention works by learning different weights of importance for various components in a sequence. This technology resolves the problem of Seq2Seq models such as Recurrent Neural Networks (RNNs), which face bottlenecks with a very long input sequence. Since its proposal in 2017, it has been shown to achieve good performance in many fields such as machine translation, natural language processing, and others. In 2018, BERT [2] was proposed as a transformer encoder that can be useful for extracting key features. It is very popular due to its good performance in semantic reasoning and language processing. It is also used to train models for misinformation detection, for example, refs. [3,4,5], to name but a few. However, there are some issues in existing methods. First of all, misinformation could come in diverse forms. For any machine learning or deep learning method to work, training examples with suitable features are required. There is a lack of examples of rumors or misinformation in current studies. Given the very diverse forms of examples, we could not train a model with good performance. To deal with this issue, we target at an emerging form of misinformation called text-image de-contextualization [6] or out-of-context misinformation [7], which contains inconsistent image–text pairs within the same post. It is not well addressed in related works. Secondly, although existing methods on misinformation detection utilize various features from user posts, the relations among multimodal features are not considered. Furthermore, in addition to the text content in user posts, people usually express their own opinions, where sentiments expressed by users might be helpful to distinguish between misinformation and authentic information. But they are often ignored in existing works.
In this paper, we propose a novel approach to misinformation detection by multimodal feature fusion with transformers [1] and credibility assessment with self-attention-based Bi-GRUs. On social networks, when people share multimedia content, including texts and images, within the same post to express their ideas, this should be consistent and related to the topic being discussed. When there is any mismatch or inconsistency in semantics, it could be attributed to a potential form of misinformation. Specifically, with regard to text–image inconsistency, we propose a deep learning model for misinformation detection by comparing and fusing multimodal features from multimedia contents. Firstly, since captions are usually concise semantic descriptions of images, they are derived from images using an image captioning module, which uses CNN encoder to extract image features and an LSTM decoder to generate captions. They are compared with surrounding text by fine-tuning transformers for consistency check in semantics. Then, to further integrate sentiment features into the text representation, we fine-tune a separate transformer model with Twitter datasets for sentiment classification. The classification output is further concatenated to augment the text representation after consistency check. Finally, a self-attention-based Bi-GRU network is used to find context dependency among the augmented representation of texts and train the credibility assessment model for misinformation detection.
In the experiments, to verify the effects of transformer-based consistency checks of text–image pairs in tweets, ALBERT [8] was used instead of BERT [2] for efficiency reasons. Specifically, the hidden state of the last layer of the ALBERT model was added to the embeddings of the texts and the output of the ALBERT model was connected to one fully connected layer for misinformation detection. The experimental results showed good performance with an accuracy of 0.828. This already outperforms the baseline model of Bi-GRU with attention. This shows the potential of using transformers for text–image consistency checks. Secondly, when we input the ALBERT-augmented text representation after consistency checking to a Bi-GRU network with attention for credibility assessment, the best performance could be obtained with an accuracy of 0.90 and an F1 score of 0.92. This validates the effectiveness of using attention-based Bi-GRU networks for credibility assessment. Finally, when we further augmented text representations by concatenating the consistency check result with the sentiment classification output as fine-tuned by a separate ALBERT model, the best performance could be obtained with an accuracy of 0.904 and an F1 score of 0.921. This shows the potential of our proposed approach to misinformation detection in the case of text–image inconsistency. Further investigation on different types of multimodal content discrepancies is needed.
The main contributions of this paper are summarized as follows. Firstly, we propose a novel approach to misinformation detection by fine-tuning transformers for multimodal content consistency checking where the captions derived from images can be semantically compared and fused with texts. Also, sentiment features in tweets as extracted by transformers are helpful in augmenting text representation for improving the performance of misinformation detection. Secondly, multi-cell Bi-GRUs with self-attention are useful in discovering semantic relations in the augmented representation of multimodal features for distinguishing between misinformation and authentic information. Finally, the experimental results on tweets validated the effectiveness of misinformation detection on social media.
The rest of the paper is structured as follows. In Section 2, we provide a review of related research works. Then, the proposed method is presented and discussed in Section 3. In Section 4, we show the experimental results of the proposed method. Finally, we conclude the paper in Section 5.

2. Related Works

The quality issue of information dissemination has been one of the major research directions in recent years. Related problems might be formulated as rumor detection, fake news detection, and misinformation or disinformation detection. Since most of the existing social network platforms have adopted content moderation policies where fact-checking services are time-consuming, machine learning and deep learning methods are frequently adopted in these tasks. For example, unlike previous works that used hand-crafted features, Ma et al. [9] proposed to learn the hidden representation of microblogs with Recurrent Neural Networks (RNNs) for rumor detection. They found it feasible to improve the performance using a multi-layer Gated Recurrent Unit (GRU) for capturing higher-level feature interactions. Yu et al. [10] pointed out the drawbacks of RNN-based models for the bias towards the latest input elements. They proposed to use Convolutional Neural Networks (CNNs) since they can extract key features scattered in the input sequence. At an early stage of rumor diffusion, Chen et al. [11] found that attention-based RNNs can help better detect rumors in terms of effectiveness and earliness. In addition to the content features, structure among posts was also utilized for rumor detection. Sampson et al. [12] showed that the performance of rumor classification can be greatly improved by discovering the implicit linkage among conversation fragments on social media. Since determining post stances is pertinent to the success of rumor detection, Ma et al. [13] proposed a joint framework that unifies the two tasks: rumor detection and stance classification. Their experimental results showed that the performance of both can be improved with the help of inter-task connections.
Given that many users might share their opinions in texts and images within the same post, more and more multimodal feature fusion approaches have been proposed for rumor detection. For example, Jin et al. [14] proposed an RNN with attention (Att-RNN) to fuse multimodal features from texts, images, and social contexts for rumor detection. RNNs are used to learn the joint representations of texts and social contexts, while CNNs are used to represent image visual features. Then, the attention mechanism is used to capture the relations between text and visual features. However, it is not guaranteed that the matching relations between texts and visual features can be learned in the attention model.
Fu and Sui [3] proposed to enhance semantics information in textual features by extracting entities in addition to BERT-extracted text features. Then, affine fusion was applied to fuse text and visual features. Chen et al. [15] used Multi-head Self-attention Fusion (MSF) that learns the weights of multimodal features. Then, Contrary Latent Topic Memory (CLTM) was used to store semantic information for comparison. Azri et al. [16] extracted more advanced image features from image quality assessment, and proposed a multimodal fusion network that selects features by ensemble learning for rumor detection. However, they only used concatenation for feature fusion. Meel and Vishwakarma [5] extracted text features using BERT and image features using Inception-ResNetv2. Again, they are fused simply by concatenation.
Han et al. [4] extracted text features including word and sentence dual embeddings using SBERT and image features using ResNet50. Then, text and visual features went through self-attention, before an additional text–visual co-attention was performed. Hu et al. [17] proposed to use multimodal retrieval for augmenting datasets for rumor detection. They used world knowledge as evidence to help detect well-known misinformation, and they did not rely on structures in social media. Liu et al. [18] extracted text features using hierarchical textual feature extraction from local and global levels, and visual features using ResNet50. Then, a modified cross-attention was used for feature fusion. In this paper, we propose an architecture to fine-tune transformers for contrasting texts with captions derived from images for multimodal consistency checking. The hidden state of the last layer of the fine-tuned transformer is then added to the representation of each word for augmenting text embeddings. These are then fed to the Mutli-cell Bi-GRU with self-attention for credibility assessment. This design is innovative in the way that transformers are integrated in the system architecture.
Instead of relying on the attention mechanism for fusing multimodal features, we focus on using image captioning and transformers for semantic consistency checking between texts and visual features in this paper. Zhao et al. [19] proposed a multimodal sentiment analysis approach based on image-text consistency. Image features were extracted by CNN models, but only image tags were used in estimating the image–text similarity. Chen et al. [20] proposed to model semantic consistency by learning text and image embedding spaces jointly for cross-modal retrieval. By imposing a regularization constraint, semantic similarity would be consistent across both embedding spaces. To match between texts and image regions, Lee et al. [21] proposed to learn the full latent alignments between image regions and words in a sentence using a stacked cross-attention network for inferring image–text similarity. Since coherence between texts and images in real-world news might give hints to detect fake news, Muller-Budack et al. [22] proposed a multimodal approach to quantifying such cross-modal entity consistency. The cross-modal similarity of named entities such as persons, locations, and events could be calculated. Different from previous works, we propose to derive captions from images and fine-tune pretrained transformers for multimodal content consistency checking between images and texts. By capturing visual features from images and deriving the corresponding text captions, their semantic meanings are easier to understand than image feature vectors. Then, they are compared with text representations by fine-tuning transformers which are good at semantic understanding.
Since some liars have been found to use more negative emotion words [23], we further conduct feature fusion by augmenting the embeddings of text representations with the sentiment classification result. Finally, for credibility assessment, we use the feature fusion result as the input to a Multi-Cell Bi-GRU with a self-attention mechanism.
Since complex deep learning models usually involve more parameters in the training process, more computational power is required for efficient training and testing. To achieve a balance between effectiveness and efficiency, in this paper, we adopt ALBERT [8] as our transformer model, which is a light version of BERT with much fewer parameters than the original BERT model. For RNN models, Gated Recurrent Units (GRUs) [24] are used instead of LSTM, since the vanishing gradient problem can be better solved with better efficiency.

3. The Proposed Method

In this paper, we propose a deep learning approach to misinformation detection consisting of the following modules: feature extraction, image captioning, consistency checking, sentiment classification, and credibility assessment. The architecture is shown in Figure 1.
As shown in Figure 1, we first extract images, text contents, and sentiments from tweets using the feature extraction module. To check if images have consistent contents with surrounding texts, captions are derived from images using the image captioning module, and they are semantically compared with texts by fine-tuning transformers for consistency checks. Then, after sentiment classification by fine-tuning another transformer model, feature fusion is performed by concatenating the consistency checking output with the one-hot encoding of sentiment classification result as the augmented embedding of the texts. Finally, the augmented embedding is input to Multi-cell Bi-GRUs with self-attention for credibility assessment. In the following subsections, each module will be described in more detail.

3.1. Feature Extraction

The major components of a user post such as tweets include text content, images, and social information. First of all, texts constitute the primary content that people want to share, including post content (denoted as post) and responses from people (denoted as reply). Secondly, in order to illustrate the ideas better, people usually post image–text posts [19] where text-embedded images are used to visually describe related information surrounding the text content. Thirdly, people reveal various types of social information in their posts such as hashtags, mentions, and social relations. Hashtags are user-provided words to highlight the main topics in the content. Also, people interact with each other by mentioning their friends, following others, posting articles, and replying to comments. To facilitate fair comparison, the same five social relation features as in Jin et al. [14] are considered, including the number of friends, the number of followers, the ratio of friends to followers, the number of tweets, and whether the account is Twitter-verified or not.

3.2. Image Captioning

To discover inherent semantic meanings from images, we design our image captioning module based on our previous work on rumor detection [25]. It is adapted from the model by Vinyals et al. [26], as shown in Figure 2.
Firstly, visual features are extracted from images by a CNN image embedder. To obtain representations of images, it is based on the Inception Net model with 42 layers of operations. Since these visual features might contain ideas about some objects or components in the image, the output feature vector is then sent to the LSTM-based sentence generator, followed by a one-hot encoding of word Si one at a time. The most probable words are predicted based on their log probabilities log pi(Si) and the corresponding sequence of short text descriptions is derived as the captions. Since these generated captions are in textual forms, they are then compared and contrasted with the surrounding texts in the same post to check whether they are consistent or not.

3.3. Multimodal Consistency Checking

After extracting various types of features from tweets, we propose to fine-tune a pretrained transformer model for multimodal consistency checking that compares and contrasts these multimodal features. It can be divided into two major phases: model fine-tuning and embedding augmentation. Firstly, in the model fine-tuning phase, we use different combinations of the texts and the generated image captions as the input to fine-tune the pretrained transformer model. The hidden state of the last layer in transformer model is regarded as the contextual representation of input words. Then, in the embedding augmentation phase, the contextual representation of these input words is added to the original embedding of each word as the augmented embedding.
Specifically, for a typical tweet consisting of images and texts, the input to the transformer model consists of a sequence of sentences S = {s1, s2, …, sT}, where each sentence si is input in sequential order. Since the goal is to identify if there is any inconsistency between texts and images within the same post, we compare the semantic meaning of texts and the generated image captions by using different ways of fine-tuning for the BERT-based transformer encoder [2], as shown in Figure 3.
As shown in Figure 3, there are two possible downstream tasks when fine-tuning the BERT model: next-sentence prediction (NSP), and Single-Sentence Classification (SSC). In either task, the hidden state of the special token or class label [CLS] in the last layer of BERT can be regarded as the learned representation of input sentences. In the next-sentence prediction task, the special token [SEP] is used to separate the sentences to be contrasted. In order to check consistency between images and text, we propose three different feature contrasting methods for combining features from texts (including post and reply) and image captions by fine-tuning the BERT model as in the following formula:
S(caption, text) = [ [CLS] Scaption [SEP] Stext [SEP] ],
S(caption, text) = [ [CLS] Scaption Stext [SEP] ],
S(caption, post, reply) = [ [CLS] Spost Scaption [SEP] Sreply [SEP] ]
In the first case, we want the model to predict texts using image captions as the clue as in the next-sentence prediction task. In the second case, the model is trained to classify the concatenated representation of image captions and texts as in the single-sentence classification task. The idea of these two cases is to find out the relation between image captions and texts. In the third case, we further separate each text into the post and reply parts, and train the model to predict the reply from the concatenation of post and captions as in the next-sentence prediction task. The idea is to find out the possible relations between the original post and image caption with the response in the reply. After fine-tuning BERT with these three feature contrasting methods, the hidden states of the last layer in the transformer model are considered as the contextual representation of input tokens. They are added to the original embedding of each word in the texts to augment their representation.

3.4. Sentiment Classification

Since users might often express their opinions on related topics in social media, we want to identify the sentiment polarity of user opinions using the sentiment classification module. To extract sentiments from tweets, several methods are available. Firstly, the simplest method is to match words with sentiment lexicons such as SentiWordNet (https://github.com/aesuli/SentiWordNet, accessed on 15 August 2024), where the overall polarity of the texts is calculated as the average of individual sentiments of words. The performance depends on the domain of the texts. Since SentiWordNet is a general-purpose dictionary, it might not be directly applied to user-generated content on social media. Secondly, since it is a classification problem, we can apply any machine learning or deep learning methods for sentiment classification, such as bidirectional LSTM with attention. However, without large amounts of training data with labels, the performance will not be good. In this paper, to obtain better classification results for sentiment prediction, we adopt a transformer-based approach for better semantic reasoning. Specifically, we fine-tune a separate transformer model with Twitter datasets to classify the sentiment of tweets into three categories: positive, neutral, and negative. The output of sentiment classification is then represented using one-hot encoding. Finally, it is appended after the augmented text representations as the input for credibility assessment in the next stage. Specifically, the augmented representation is denoted as follows: <augmented texts>+<pos>+<neutral>+<neg>.
Other social features such as hashtags are user-selected words to describe people, events, or topics related to the text contents. Since these words might never appear in the posts, they might be out-of-vocabulary in the pretrained transformer model. Thus, they are also represented using one-hot encoding, and then appended into the vector representation after other features, i.e., <augmented texts>+<hashtag1>+<hashtag2>+…+<hashtagn>. The vector representation will be checked in the last step of credibility assessment. We will compare the performance of sentiment classification in the experiments.

3.5. Credibility Assessment

After the processing of feature fusion by transformer-based consistency checking and transformer-based sentiment classification, all features are fused into the augmented embedding of texts. Then, for credibility assessment, we need to determine if this vector representation is credible or not. Again, it can be viewed as a classification problem as in many existing works. Thus, different classifiers can be used for credibility assessment, for example, Recurrent Neural Networks (RNNs) such as LSTM or GRU. It is natural to adopt a sequence-to-sequence model in assessing the credibility of texts since it is similar to human reading. In this paper, instead of the vanilla RNNs, we adapt our previous model called Multi-Cell Bidirectional GRU with self-attention [25] to train the classification model for credibility assessment.
This model is selected since previous research shows superior performance when stacking multiple layers of bidirectional RNNs for rumor detection. As shown in Figure 4, there are two possible designs: Multi-Cell Bi-GRUs and Multi-Layer Bi-GRUs. On the one hand, Multi-Layer Bi-GRUs simply stack two individual Bi-GRUs, as shown in Figure 4b. After the processing of the first Bi-GRU layer, the sequence is already different from the original representation. On the other hand, Multi-Cell Bi-GRUs give the best performance since each direction comprises multiple levels of GRU cells before propagating to the next direction, as shown in Figure 4a. This makes sure that each direction received the same representation. We will compare their performance for misinformation detection with other classification models in the experiments.

4. Experiments

To evaluate the performance of our proposed approach to misinformation detection, the Twitter datasets for rumor detection in MediaEval 2015, 2016 [22] was used in this research since they have been verified by Twitter. We follow a similar approach to Jin et al. [14] that keeps only tweets with both texts and images. Then, from the rumor and non-rumor events, the corresponding posts are identified as real and fake posts in the dataset, as shown in Table 1, as follows. In order to compare with existing works, we also collected two additional datasets: Weibo and AllData. After inspecting the datasets, we found many missing images in both Weibo and AllData datasets. Specifically, since not all images used in the Weibo dataset were available, we did not compare performance on this dataset. In the AllData dataset, after removing texts without available images, there were 1153 items left, with 1048 real and 105 fake data items. The distribution of real and fake data is shown in Table 2. Secondly, in order to justify the effects of fine-tuning transformers for sentiment analysis, the dataset in SemEval 2016 Task 4: Sentiment Analysis in Twitter was used for evaluation. The statistics of the second dataset is shown in Table 3.
In the following subsections, we will analyze the effects of different modules on the performance of misinformation detection: credibility assessment, consistency checking, sentiment classification, and feature fusion. In fine-tuning our ALBERT model, we set the hyper-parameters as follows. The maximum input length was set to 512 tokens, with a batch size of 16. The number of epochs was set to 30, and the optimizer was LAMB, with a learning rate of 0.0005. The activation function was Leaky-ReLU.
To evaluate the binary classification results of misinformation detection, we used the performance evaluation metrics including accuracy, precision, recall, and F1-score. In order to validate the robustness of the evaluation results, we further used Student’s T-test to verify the statistical significance.

4.1. Effects of Credibility Assessment

To evaluate the effects of credibility assessment on the performance of misinformation detection, we adopted different types of RNNs using only text features as our baseline models: single-layer RNN [14], single-layer Attn-BiRNN, multi-layer Attn-BiRNN, and multi-cell Attn-BiRNN [25]. According to our previous research [25], a better performance of rumor detection can be obtained when using GRU instead of LSTM in different RNN structures. In our experiments, we only compared the performance of different variations of RNN structures using GRU with the baseline models on Twitter rumor detection datasets, as shown in Figure 5.
As shown in Figure 5, when we only use text features, the best performance can be obtained for both attention-based Multi-layer Bi-GRU and Multi-cell Bi-GRU, with the same F1-score of 0.816. Also, three different variants of the self-attention-based Bi-GRUs greatly outperform the LSTM-based model by Jin et al. [14], which is simply a single-layer RNN with attention mechanism. This shows the advantage of stacking multiple layers of Bi-RNNs with an attention mechanism for credibility assessment.
Next, we further evaluated the effects of credibility assessment using different models on the performance of misinformation detection when we combine all features, as shown in Figure 6.
As shown in Figure 6, when we combine all features, performance improvements in all models can be obtained when compared with the models with text features only in Figure 5. Specifically, Multi-cell Bi-GRUs achieved the best performance with an F1-score of 0.882, followed by an F1-score of 0.859 for single-layer Bi-GRUs and an F1-score of 0.843 for Multi-layer Bi-GRUs. From our observations, Multi-cell Bi-GRUs had the advantage of incorporating deeper hidden layers with multiple cells in each separate direction, and so the hidden relations among features in word sequences could be better learned. On the other hand, for Multi-layer Bi-GRU, we stacked two individual layers of Bi-GRUs. After the first Bi-GRU layer already learned the relations among words in texts, the second one could not learn more relations since the input was different from the raw text. That also explains why Multi-layer Bi-GRU has a similar performance to single-layer Bi-GRU. To validate these results, we further conducted the statistical Student’s T-test between the performance of Multi-layer and Multi-cell Bi-GRUs, with a p-value of 0.039 (<0.05), which shows the statistical significance of their difference. In the remaining experiments, we will use the Multi-cell Bi-GRUs with self-attention as our baseline.

4.2. Effects of Consistency Checking

To evaluate the effects of multimodal consistency checking on the performance of misinformation detection, we temporarily removed the Multi-Cell Bi-GRU and instead connected the output of consistency checking to a simple fully connected layer for misinformation detection. Specifically, we compared the effects of three different feature contrasting methods for texts and image captions when fine-tuning ALBERT models, as discussed in Section 3.3. The results are shown in Table 4.
As shown in Table 4, the best accuracy of 0.828 could be obtained when we used the feature contrasting method with fine-tuned ALBERT models to predict the response in reply given the concatenation of the original text in post and image caption. It already demonstrated slightly better performance than our baseline model of Multi-Cell Bi-GRU with attention for credibility assessment. This shows the benefits of consistency checking between texts and images using fine-tuning transformers, even without credibility assessment.

4.3. Effects of Sentiment Classification

In order to further aggregate sentiments from texts as potential social features, we further evaluated the performance of sentiment classification on the SemEval 2016 Task 4 dataset using different architectures including ALBERT, BiLSTM, BiLSTM + attention, and SentiWordNet, as shown in Table 5. ALBERT is the proposed approach to fine-tuning transformers using tweets for sentiment classification. The baseline of SentiWordNet is to classify the sentiment of a text through the simple matching of words in the SentiWordNet lexicon.
As shown in Table 5, we can observe the best performance with ALBERT, with an accuracy of 0.576. It shows better performance than the commonly used model of BiLSTM with attention. Thus, it will be used to obtain the sentiment classification results, which are then encoded using one-hot encoding as the sentiment features to be fused with other features in the remaining experiments.

4.4. Effects of Feature Fusion

To further verify the effects of fusing various features on misinformation detection, we evaluated the performance using different feature combinations in the augmented representation of the texts, as input to the Multi-cell Bi-GRU with self-attention for misinformation detection. The baseline only involves a simple concatenation of all features as the input to the Multi-Cell BiGRU with attention. The performance comparison of fusing different features is shown in Table 6.
As shown in Table 6, the best performance can be obtained for the feature fusion of Text + Image + Sentiment (TIS) with an accuracy of 0.904 and an F1-score of 0.921. Compared with the baseline of a simple concatenation of all features, it shows an improvement of 9.3% in accuracy and 4.4% in F1-score. Although the baseline achieved a higher recall, the precision is much worse since the possible relations among features might not be well addressed. Among the three different combinations in feature fusion, when we only fused texts and image features (TI), the performance was already very good, with an accuracy of 0.90 and an F1-score of 0.92. By adding sentiments in feature fusion, the F1-score was slightly improved to 0.921, with a higher precision at the cost of a lower recall. This shows the potential benefits of sentiment analysis for judging misinformation. But when we further included hashtags in feature fusion (TISH), we observed a great degradation in performance, which was even worse than the baseline. This is due to the fact that most of the user-defined hashtags are very diverse. Since the hashtags might be out-of-vocabulary in our data and seldom overlap, when we include them as our features, the semantic meanings will be drifted towards unpredictable ways. This shows the importance of feature fusion. From the experimental results, we can validate the effectiveness of our proposed approach to multimodal misinformation detection by fine-tuning transformers and Multi-cell Bi-GRUs with attention.

4.5. Performance Comparison for AllData Dataset

Finally, to check if our proposed method works in general, we compare the performance of misinformation detection using the AllData dataset. Since we only kept the data items with images available, it was different from the original AllData dataset. The baseline were the results as reported by Meel and Vishwakarma [5]. Since the source code of the baseline was not available, we directly used the evaluation metrics as reported in the paper. In addition to our proposed method, we also use the transformer without fine-tuning, denoted as ALBERTp, which stands for the pretrained ALBERT model without fine-tuning. The performance comparison is shown in Table 7.
As shown in Table 7, we can observe the best performance for our proposed method on the AllData dataset, with an accuracy of 0.950, and an F1-score of 0.949. This shows the potential of our proposed approach to misinformation detection, when we consider the consistency between texts and images in the same post. Also, when we compare the results with (The Proposed Method) and without fine-tuning (ALBERTp), we can observe an improvement in all evaluations metrics when the embeddings are fine-tuned by transformers. This further validates the benefits of fine-tuning transformers in our proposed approach.

5. Discussion

From the experimental results, we can observe the following:
  • Firstly, from the performance comparison for the effects of different RNN models on misinformation detection in Figure 5 and Figure 6, multi-layer and multi-cell Bi-GRUs showed clear performance advantage over single-layer models. When all features were used, multi-cell Bi-GRUs with attention outperformed other models. This shows the advantage of our proposed method of credibility assessment.
  • Secondly, from the evaluation results of three different feature contrasting methods for misinformation detection, we can observe the best accuracy of 0.828 when fine-tuning ALBERT models for the next-sentence prediction of the response in reply by the text in post and image captions. This shows the potential of fine-tuning transformers for consistency checking in image–text pairs.
  • When we augmented the text embeddings for fusing text, image, and sentiment features by fine-tuning ALBERT models for the consistency checking module, and fine-tuning another ALBERT model for sentiment classification, the best performance could be obtained with an accuracy of 0.904 and an F1-score of 0.921. This further validates the effectiveness of our proposed approach to multimodal misinformation detection using transformers and multi-cell Bi-GRUs.
  • When considering different datasets, we could observe a better performance for our proposed approach when we used fine-tuned transformers. The best performance could be achieved with an accuracy of 0.950 and an F1-score of 0.949. This shows advantage over the pretrained ALBERT model, as well as the benefits of fine-tuning transformers in our proposed approach.
  • Figure 7 offers a visual representation of fake and real posts using t-SNE when the embeddings of the posts are with or without fine-tuning transformers, respectively. After fine-tuning transformers, the clusters of fake and real posts could be better separated. This further validates the effectiveness of transformer fine-tuning for distinguishing misinformation from real posts.

6. Conclusions

In this paper, we have proposed a transformer-based deep learning model for multimodal consistency checking and feature fusion. Firstly, we applied a Seq2Seq image captioning model to generate captions from images. These were semantically compared against texts by fine-tuning transformers for consistency checking. Then, the sentiment classification result by fine-tuning another ALBERT model was further used to augment the embeddings as the modified text representations. Finally, Multi-cell Bi-GRUs with a self-attention mechanism were used to assess the credibility of texts for misinformation detection. The best performance with an F1-score of 0.921 could be obtained when we fused texts, images, and sentiments into an augmented representation as inputs to our Multi-cell Bi-GRUs. This was superior to the current state-of-the-art model. In future, we plan to evaluate the performance of our proposed method with different types of multimodal content on social networks.

Author Contributions

Conceptualization, J.-H.W. and M.N.; methodology, J.-H.W.; software, J.-H.W.; validation, J.-H.W., M.N., and S.M.T.; formal analysis, J.-H.W.; investigation, J.-H.W. and M.N.; resources, M.N. and S.M.T.; data curation, J.-H.W.; writing—original draft preparation, J.-H.W.; writing—review and editing, J.-H.W. and M.N.; visualization, M.N.; supervision, J.-H.W.; project administration, J.-H.W.; funding acquisition, J.-H.W. and S.M.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the National Science and Technology Council, Taiwan, grant number NSTC112-2221-E-027-101, and partly by research grants from the National Taipei University of Technology and Cheng Hsin General Hospital Joint Research Program (NTUT-CHGH Joint Research Program), under grant number NTUT-CHGH-110-04.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in MediaEval 2015, 2016 and SemEval 2016.

Acknowledgments

This paper reports the research results from the collaboration among Mehdi Norouzi, Shu Ming Tsai, and Jenq-Haur Wang. The authors would like to thank the research grants from National Taipei University of Technology and Cheng Hsin general Hospital. Special thanks for paper proofreading and suggestions from Mehdi Norouzi, University of Cincinnati.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  2. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  3. Fu, B.; Sui, J. Multi-modal affine fusion network for social media rumor detection. PeerJ Comput. Sci. 2022, 8, e928. [Google Scholar] [CrossRef] [PubMed]
  4. Han, H.; Ke, Z.; Nie, X.; Dai, L.; Slamu, W. Multimodal fusion with dual-attention based on textual double-embedding networks for rumor detection. Appl. Sci. 2023, 13, 4886. [Google Scholar] [CrossRef]
  5. Meel, P.; Vishwakarma, D.K. Multi-modal fusion using fine-tuned self-attention and transfer learning for veracity analysis of web information. Expert Syst. Appl. 2023, 229, 120537. [Google Scholar] [CrossRef]
  6. Huang, M.; Jia, S.; Chang, M.-C.; Lyu, S. Text-image de-contextualization detection using vision-language models. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022), Virtual, 7–13 May 2022. [Google Scholar]
  7. Aneja, S.; Bregler, C.; Niessner, M. Cosmos: Catching out-of-context image misuse with self-supervised learning. In Proceedings of the AAAI 2023, Washington DC, USA, 7–14 February 2023. [Google Scholar]
  8. Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the ICLR 2020, Virtual, 26 April–1 May 2020. [Google Scholar]
  9. Ma, J.; Gao, W.; Mitra, P.; Kwon, S.; Jansen, B.J.; Wong, K.-F.; Cha, M. Detecting rumors from microblogs with recurrent neural networks. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA, 9–15 July 2016; pp. 3818–3824. [Google Scholar]
  10. Yu, F.; Liu, Q.; Wu, S.; Wang, L.; Tan, T. A convolutional approach for misinformation identification. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI 2017), Melbourne, VIC, Australia, 19–25 August 2017; pp. 3901–3907. [Google Scholar]
  11. Chen, T.; Li, X.; Yin, H.; Zhang, J. Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018), Melbourne, VIC, Australia, 3–6 June 2018; pp. 40–52. [Google Scholar]
  12. Sampson, J.; Morstatter, F.; Wu, L.; Liu, H. Leveraging the implicit structure within social media for emergent rumor detection. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM 2016), Indianapolis, IN, USA, 24 October 2016; pp. 2377–2382. [Google Scholar]
  13. Ma, J.; Gao, W.; Wong, K.-F. Detect rumor and stance jointly by neural multi-task learning. In Proceedings of the Companion Proceedings of the Web Conference 2018 (WWW 2018), Lyon, France, 23–27 April 2018; pp. 585–593. [Google Scholar]
  14. Jin, Z.; Cao, J.; Guo, H.; Zhang, Y.; Luo, J. Multimodal fusion with recurrent neural networks for rumor detection on microblogs. In Proceedings of the 25th ACM international conference on Multimedia (MM 2017), Mountain View, CA, USA, 23–27 October 2017; pp. 795–816. [Google Scholar]
  15. Chen, J.; Wu, Z.; Yang, Z.; Xie, H.; Wang, F.L.; Liu, W. Multimodal fusion network with contrary latent topic memory for rumor detection. IEEE MultiMedia 2022, 29, 104–113. [Google Scholar] [CrossRef]
  16. Azri, A.; Favre, C.; Harbi, N.; Darmont, J.; Noûs, C. Rumor classification through a multimodal fusion framework and ensemble learning. Inf. Syst. Front. 2023, 25, 1795–1810. [Google Scholar] [CrossRef] [PubMed]
  17. Hu, X.; Guo, Z.; Chen, J.; Wen, L.; Yu, P.S. Mr2: A benchmark for multimodal retrieval-augmented rumor detection in social media. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 2901–2912. [Google Scholar]
  18. Liu, X.; Pang, M.; Li, Q.; Zhou, J.; Wang, H.; Yang, D. MVACLNet: A multimodal virtual augmentation contrastive learning network for rumor detection. Algorithms 2024, 17, 199. [Google Scholar] [CrossRef]
  19. Zhao, Z.; Zhu, H.; Xue, Z.; Liu, Z.; Tian, J.; Chua, M.C.H.; Liu, M. An image-text consistency driven multimodal sentiment analysis approach for social media. Inf. Process. Manag. 2019, 56, 102097. [Google Scholar] [CrossRef]
  20. Chen, H.; Ding, G.; Lin, Z.; Zhao, S.; Han, J. Cross-modal image-text retrieval with semantic consistency. In Proceedings of the 27th ACM International Conference on Multimedia (MM 2019), Nice, France, 21–25 October 2019; pp. 1749–1757. [Google Scholar]
  21. Lee, K.-H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 212–228. [Google Scholar]
  22. Müller-Budack, E.; Theiner, J.; Diering, S.; Idahl, M.; Ewerth, R. Multimodal analytics for real-world news using measures of cross-modal entity consistency. In Proceedings of the 2020 International Conference on Multimedia Retrieval (ICMR 2020), Dublin, Ireland, 8–11 June 2020; pp. 16–25. [Google Scholar]
  23. Newman, M.L.; Pennebaker, J.W.; Berry, D.S.; Richards, J.M. Lying words: Predicting deception from linguistic styles. Personal. Soc. Psychol. Bull. 2003, 29, 665–675. [Google Scholar] [CrossRef] [PubMed]
  24. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modelling. In Proceedings of the NIPS, Montreal, QC, Canada, 13 December 2014. [Google Scholar]
  25. Wang, J.-H.; Norouzi, M.; Tsai, S.M. Multimodal content veracity assessment with bidirectional transformers and self-attention-based bi-GRU networks. In Proceedings of the IEEE International Conference on Multimedia Big Data (BigMM 2022), Naples, Italy, 5–17 December 2022; pp. 133–137. [Google Scholar]
  26. Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Figure 1. System architecture of the proposed approach to misinformation detection.
Figure 1. System architecture of the proposed approach to misinformation detection.
Bdcc 08 00134 g001
Figure 2. The architecture of the image captioning module as adapted from [26].
Figure 2. The architecture of the image captioning module as adapted from [26].
Bdcc 08 00134 g002
Figure 3. Fine-tuning BERT model in two different downstream tasks: (a) next-sentence prediction task; (b) single-sentence classification task [2].
Figure 3. Fine-tuning BERT model in two different downstream tasks: (a) next-sentence prediction task; (b) single-sentence classification task [2].
Bdcc 08 00134 g003
Figure 4. Two different ways of stacking multiple layers of RNN cells: (a) Multi-Cell BiRNN; (b) Multi-Layer Bi-RNN [25].
Figure 4. Two different ways of stacking multiple layers of RNN cells: (a) Multi-Cell BiRNN; (b) Multi-Layer Bi-RNN [25].
Bdcc 08 00134 g004aBdcc 08 00134 g004b
Figure 5. The effects of different types of RNNs for credibility assessment on misinformation detection (with only text features) [14,25].
Figure 5. The effects of different types of RNNs for credibility assessment on misinformation detection (with only text features) [14,25].
Bdcc 08 00134 g005
Figure 6. The effects of different types of RNNs for credibility assessment on misinformation detection (with all features) [14,25].
Figure 6. The effects of different types of RNNs for credibility assessment on misinformation detection (with all features) [14,25].
Bdcc 08 00134 g006
Figure 7. The visual representation of the fake and real posts (a) without fine-tuning transformers and (b) with fine-tuning transformers.
Figure 7. The visual representation of the fake and real posts (a) without fine-tuning transformers and (b) with fine-tuning transformers.
Bdcc 08 00134 g007
Table 1. Statistics of the Twitter rumor detection dataset in the experiments.
Table 1. Statistics of the Twitter rumor detection dataset in the experiments.
Twitter Rumor Detection# of PostsTotal
TrainingFake733412,933
Real5599
TestFake564991
Real427
TotalFake789813,924
Real6026
Table 2. Statistics of the AllData dataset in the experiments.
Table 2. Statistics of the AllData dataset in the experiments.
AllData *# of ItemsTotal
TrainingFake66958
Real892
TestFake39195
Real156
TotalFake1051153
Real1048
* We used the subset of the AllData dataset where images can be obtained.
Table 3. Statistics of the SemEval 2016 Task 4 dataset in the experiments.
Table 3. Statistics of the SemEval 2016 Task 4 dataset in the experiments.
SemEval 2016 Task 4Total # of Posts
Training20,631
Test1965
Total22,596
Table 4. Performance comparison of consistency checking for misinformation detection.
Table 4. Performance comparison of consistency checking for misinformation detection.
Feature Contrasting MethodAccuracy
[NSP] Caption vs. Text (Equation (1))0.783
[SSC] Caption + Text (Equation (2))0.678
[NSP] Post + Caption vs. Reply (Equation (3))0.828
Table 5. Performance comparison of sentiment classification.
Table 5. Performance comparison of sentiment classification.
ModelAccuracy
SentiWordNet0.395
BiLSTM0.459
BiLSTM + Attention0.536
ALBERT0.576
Table 6. Performance comparison of feature fusion on misinformation detection.
Table 6. Performance comparison of feature fusion on misinformation detection.
Feature FusionAccuracyPrecisionRecallF1-Score
Text + Image (TI)0.9000.8980.9420.920
Text + Image + Sentiment (TIS)0.9040.9160.9250.921
Text + Image + Sentiment + Hashtag (TISH)0.6490.6960.7450.720
Baseline (Concat of all features)0.8270.8160.9590.882
Table 7. Performance comparison using AllData dataset for misinformation detection.
Table 7. Performance comparison using AllData dataset for misinformation detection.
MethodAccuracyPrecisionRecallF1-Score
ALBERTp0.9470.9430.9260.934
The Proposed Method0.9500.9510.9470.949
Baseline (*)0.9410.9500.9480.949
* This result for the baseline is as reported in the paper by Meel and Vishwakarma [5].
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.-H.; Norouzi, M.; Tsai, S.M. Augmenting Multimodal Content Representation with Transformers for Misinformation Detection. Big Data Cogn. Comput. 2024, 8, 134. https://doi.org/10.3390/bdcc8100134

AMA Style

Wang J-H, Norouzi M, Tsai SM. Augmenting Multimodal Content Representation with Transformers for Misinformation Detection. Big Data and Cognitive Computing. 2024; 8(10):134. https://doi.org/10.3390/bdcc8100134

Chicago/Turabian Style

Wang, Jenq-Haur, Mehdi Norouzi, and Shu Ming Tsai. 2024. "Augmenting Multimodal Content Representation with Transformers for Misinformation Detection" Big Data and Cognitive Computing 8, no. 10: 134. https://doi.org/10.3390/bdcc8100134

APA Style

Wang, J. -H., Norouzi, M., & Tsai, S. M. (2024). Augmenting Multimodal Content Representation with Transformers for Misinformation Detection. Big Data and Cognitive Computing, 8(10), 134. https://doi.org/10.3390/bdcc8100134

Article Metrics

Back to TopTop