1. Introduction
With the prevalence of social network platforms, people can easily obtain the latest updates from social media and share their opinions. The increase in user-generated content (UGC) could help the growth of social network platforms, but might also cause problems if not moderated appropriately. When enabling user content generation and dissemination, most social network platforms are also open to potential risks through unwanted forms of content such as hate speech, sexual harassment, terror, and violence, to name but a few. Given such objectionable content, most platforms implement their content policy and enforce content moderation to remove them. In addition to authentic information, people are receiving more and more misinformation and disinformation. This causes serious problems to social media platforms and users. People are losing trust to the media and the messages received, while the platforms strive to maintain their reputations to the users. To improve the quality of online information, rumor and misinformation detection has been an increasingly important research topic in social media.
To moderate misinformation on most social networking sites, third-party fact-checking services such as FactCheck.org, PolitiFact, and Snopes.com are commonly used for validating the credibility of user posts. Even when automatic tools are used to help content moderation, the reliability is still limited. Manual verification or labeling is usually required. However, both fact-checking services and manual labeling require a significant amount of human effort. As the speed of content generation and dissemination grows, the quality of information cannot be effectively validated in a timely manner.
Machine learning methods and deep learning methods have been applied in many different fields. For example, transformers [
1] are based on the multi-head attention mechanism. The idea is to simulate how human attention works by learning different weights of importance for various components in a sequence. This technology resolves the problem of Seq2Seq models such as Recurrent Neural Networks (RNNs), which face bottlenecks with a very long input sequence. Since its proposal in 2017, it has been shown to achieve good performance in many fields such as machine translation, natural language processing, and others. In 2018, BERT [
2] was proposed as a transformer encoder that can be useful for extracting key features. It is very popular due to its good performance in semantic reasoning and language processing. It is also used to train models for misinformation detection, for example, refs. [
3,
4,
5], to name but a few. However, there are some issues in existing methods. First of all, misinformation could come in diverse forms. For any machine learning or deep learning method to work, training examples with suitable features are required. There is a lack of examples of rumors or misinformation in current studies. Given the very diverse forms of examples, we could not train a model with good performance. To deal with this issue, we target at an emerging form of misinformation called text-image de-contextualization [
6] or out-of-context misinformation [
7], which contains inconsistent image–text pairs within the same post. It is not well addressed in related works. Secondly, although existing methods on misinformation detection utilize various features from user posts, the relations among multimodal features are not considered. Furthermore, in addition to the text content in user posts, people usually express their own opinions, where sentiments expressed by users might be helpful to distinguish between misinformation and authentic information. But they are often ignored in existing works.
In this paper, we propose a novel approach to misinformation detection by multimodal feature fusion with transformers [
1] and credibility assessment with self-attention-based Bi-GRUs. On social networks, when people share multimedia content, including texts and images, within the same post to express their ideas, this should be consistent and related to the topic being discussed. When there is any mismatch or inconsistency in semantics, it could be attributed to a potential form of misinformation. Specifically, with regard to text–image inconsistency, we propose a deep learning model for misinformation detection by comparing and fusing multimodal features from multimedia contents. Firstly, since captions are usually concise semantic descriptions of images, they are derived from images using an image captioning module, which uses CNN encoder to extract image features and an LSTM decoder to generate captions. They are compared with surrounding text by fine-tuning transformers for consistency check in semantics. Then, to further integrate sentiment features into the text representation, we fine-tune a separate transformer model with Twitter datasets for sentiment classification. The classification output is further concatenated to augment the text representation after consistency check. Finally, a self-attention-based Bi-GRU network is used to find context dependency among the augmented representation of texts and train the credibility assessment model for misinformation detection.
In the experiments, to verify the effects of transformer-based consistency checks of text–image pairs in tweets, ALBERT [
8] was used instead of BERT [
2] for efficiency reasons. Specifically, the hidden state of the last layer of the ALBERT model was added to the embeddings of the texts and the output of the ALBERT model was connected to one fully connected layer for misinformation detection. The experimental results showed good performance with an accuracy of 0.828. This already outperforms the baseline model of Bi-GRU with attention. This shows the potential of using transformers for text–image consistency checks. Secondly, when we input the ALBERT-augmented text representation after consistency checking to a Bi-GRU network with attention for credibility assessment, the best performance could be obtained with an accuracy of 0.90 and an F1 score of 0.92. This validates the effectiveness of using attention-based Bi-GRU networks for credibility assessment. Finally, when we further augmented text representations by concatenating the consistency check result with the sentiment classification output as fine-tuned by a separate ALBERT model, the best performance could be obtained with an accuracy of 0.904 and an F1 score of 0.921. This shows the potential of our proposed approach to misinformation detection in the case of text–image inconsistency. Further investigation on different types of multimodal content discrepancies is needed.
The main contributions of this paper are summarized as follows. Firstly, we propose a novel approach to misinformation detection by fine-tuning transformers for multimodal content consistency checking where the captions derived from images can be semantically compared and fused with texts. Also, sentiment features in tweets as extracted by transformers are helpful in augmenting text representation for improving the performance of misinformation detection. Secondly, multi-cell Bi-GRUs with self-attention are useful in discovering semantic relations in the augmented representation of multimodal features for distinguishing between misinformation and authentic information. Finally, the experimental results on tweets validated the effectiveness of misinformation detection on social media.
The rest of the paper is structured as follows. In
Section 2, we provide a review of related research works. Then, the proposed method is presented and discussed in
Section 3. In
Section 4, we show the experimental results of the proposed method. Finally, we conclude the paper in
Section 5.
2. Related Works
The quality issue of information dissemination has been one of the major research directions in recent years. Related problems might be formulated as rumor detection, fake news detection, and misinformation or disinformation detection. Since most of the existing social network platforms have adopted content moderation policies where fact-checking services are time-consuming, machine learning and deep learning methods are frequently adopted in these tasks. For example, unlike previous works that used hand-crafted features, Ma et al. [
9] proposed to learn the hidden representation of microblogs with Recurrent Neural Networks (RNNs) for rumor detection. They found it feasible to improve the performance using a multi-layer Gated Recurrent Unit (GRU) for capturing higher-level feature interactions. Yu et al. [
10] pointed out the drawbacks of RNN-based models for the bias towards the latest input elements. They proposed to use Convolutional Neural Networks (CNNs) since they can extract key features scattered in the input sequence. At an early stage of rumor diffusion, Chen et al. [
11] found that attention-based RNNs can help better detect rumors in terms of effectiveness and earliness. In addition to the content features, structure among posts was also utilized for rumor detection. Sampson et al. [
12] showed that the performance of rumor classification can be greatly improved by discovering the implicit linkage among conversation fragments on social media. Since determining post stances is pertinent to the success of rumor detection, Ma et al. [
13] proposed a joint framework that unifies the two tasks: rumor detection and stance classification. Their experimental results showed that the performance of both can be improved with the help of inter-task connections.
Given that many users might share their opinions in texts and images within the same post, more and more multimodal feature fusion approaches have been proposed for rumor detection. For example, Jin et al. [
14] proposed an RNN with attention (Att-RNN) to fuse multimodal features from texts, images, and social contexts for rumor detection. RNNs are used to learn the joint representations of texts and social contexts, while CNNs are used to represent image visual features. Then, the attention mechanism is used to capture the relations between text and visual features. However, it is not guaranteed that the matching relations between texts and visual features can be learned in the attention model.
Fu and Sui [
3] proposed to enhance semantics information in textual features by extracting entities in addition to BERT-extracted text features. Then, affine fusion was applied to fuse text and visual features. Chen et al. [
15] used Multi-head Self-attention Fusion (MSF) that learns the weights of multimodal features. Then, Contrary Latent Topic Memory (CLTM) was used to store semantic information for comparison. Azri et al. [
16] extracted more advanced image features from image quality assessment, and proposed a multimodal fusion network that selects features by ensemble learning for rumor detection. However, they only used concatenation for feature fusion. Meel and Vishwakarma [
5] extracted text features using BERT and image features using Inception-ResNetv2. Again, they are fused simply by concatenation.
Han et al. [
4] extracted text features including word and sentence dual embeddings using SBERT and image features using ResNet50. Then, text and visual features went through self-attention, before an additional text–visual co-attention was performed. Hu et al. [
17] proposed to use multimodal retrieval for augmenting datasets for rumor detection. They used world knowledge as evidence to help detect well-known misinformation, and they did not rely on structures in social media. Liu et al. [
18] extracted text features using hierarchical textual feature extraction from local and global levels, and visual features using ResNet50. Then, a modified cross-attention was used for feature fusion. In this paper, we propose an architecture to fine-tune transformers for contrasting texts with captions derived from images for multimodal consistency checking. The hidden state of the last layer of the fine-tuned transformer is then added to the representation of each word for augmenting text embeddings. These are then fed to the Mutli-cell Bi-GRU with self-attention for credibility assessment. This design is innovative in the way that transformers are integrated in the system architecture.
Instead of relying on the attention mechanism for fusing multimodal features, we focus on using image captioning and transformers for semantic consistency checking between texts and visual features in this paper. Zhao et al. [
19] proposed a multimodal sentiment analysis approach based on image-text consistency. Image features were extracted by CNN models, but only image tags were used in estimating the image–text similarity. Chen et al. [
20] proposed to model semantic consistency by learning text and image embedding spaces jointly for cross-modal retrieval. By imposing a regularization constraint, semantic similarity would be consistent across both embedding spaces. To match between texts and image regions, Lee et al. [
21] proposed to learn the full latent alignments between image regions and words in a sentence using a stacked cross-attention network for inferring image–text similarity. Since coherence between texts and images in real-world news might give hints to detect fake news, Muller-Budack et al. [
22] proposed a multimodal approach to quantifying such cross-modal entity consistency. The cross-modal similarity of named entities such as persons, locations, and events could be calculated. Different from previous works, we propose to derive captions from images and fine-tune pretrained transformers for multimodal content consistency checking between images and texts. By capturing visual features from images and deriving the corresponding text captions, their semantic meanings are easier to understand than image feature vectors. Then, they are compared with text representations by fine-tuning transformers which are good at semantic understanding.
Since some liars have been found to use more negative emotion words [
23], we further conduct feature fusion by augmenting the embeddings of text representations with the sentiment classification result. Finally, for credibility assessment, we use the feature fusion result as the input to a Multi-Cell Bi-GRU with a self-attention mechanism.
Since complex deep learning models usually involve more parameters in the training process, more computational power is required for efficient training and testing. To achieve a balance between effectiveness and efficiency, in this paper, we adopt ALBERT [
8] as our transformer model, which is a light version of BERT with much fewer parameters than the original BERT model. For RNN models, Gated Recurrent Units (GRUs) [
24] are used instead of LSTM, since the vanishing gradient problem can be better solved with better efficiency.
3. The Proposed Method
In this paper, we propose a deep learning approach to misinformation detection consisting of the following modules: feature extraction, image captioning, consistency checking, sentiment classification, and credibility assessment. The architecture is shown in
Figure 1.
As shown in
Figure 1, we first extract images, text contents, and sentiments from tweets using the feature extraction module. To check if images have consistent contents with surrounding texts, captions are derived from images using the image captioning module, and they are semantically compared with texts by fine-tuning transformers for consistency checks. Then, after sentiment classification by fine-tuning another transformer model, feature fusion is performed by concatenating the consistency checking output with the one-hot encoding of sentiment classification result as the augmented embedding of the texts. Finally, the augmented embedding is input to Multi-cell Bi-GRUs with self-attention for credibility assessment. In the following subsections, each module will be described in more detail.
3.1. Feature Extraction
The major components of a user post such as tweets include text content, images, and social information. First of all, texts constitute the primary content that people want to share, including post content (denoted as
post) and responses from people (denoted as
reply). Secondly, in order to illustrate the ideas better, people usually post image–text posts [
19] where text-embedded images are used to visually describe related information surrounding the text content. Thirdly, people reveal various types of social information in their posts such as hashtags, mentions, and social relations. Hashtags are user-provided words to highlight the main topics in the content. Also, people interact with each other by mentioning their friends, following others, posting articles, and replying to comments. To facilitate fair comparison, the same five social relation features as in Jin et al. [
14] are considered, including the number of friends, the number of followers, the ratio of friends to followers, the number of tweets, and whether the account is Twitter-verified or not.
3.2. Image Captioning
To discover inherent semantic meanings from images, we design our image captioning module based on our previous work on rumor detection [
25]. It is adapted from the model by Vinyals et al. [
26], as shown in
Figure 2.
Firstly, visual features are extracted from images by a CNN image embedder. To obtain representations of images, it is based on the Inception Net model with 42 layers of operations. Since these visual features might contain ideas about some objects or components in the image, the output feature vector is then sent to the LSTM-based sentence generator, followed by a one-hot encoding of word Si one at a time. The most probable words are predicted based on their log probabilities log pi(Si) and the corresponding sequence of short text descriptions is derived as the captions. Since these generated captions are in textual forms, they are then compared and contrasted with the surrounding texts in the same post to check whether they are consistent or not.
3.3. Multimodal Consistency Checking
After extracting various types of features from tweets, we propose to fine-tune a pretrained transformer model for multimodal consistency checking that compares and contrasts these multimodal features. It can be divided into two major phases: model fine-tuning and embedding augmentation. Firstly, in the model fine-tuning phase, we use different combinations of the texts and the generated image captions as the input to fine-tune the pretrained transformer model. The hidden state of the last layer in transformer model is regarded as the contextual representation of input words. Then, in the embedding augmentation phase, the contextual representation of these input words is added to the original embedding of each word as the augmented embedding.
Specifically, for a typical tweet consisting of images and texts, the input to the transformer model consists of a sequence of sentences S = {s
1, s
2, …, s
T}, where each sentence s
i is input in sequential order. Since the goal is to identify if there is any inconsistency between texts and images within the same post, we compare the semantic meaning of texts and the generated image captions by using different ways of fine-tuning for the BERT-based transformer encoder [
2], as shown in
Figure 3.
As shown in
Figure 3, there are two possible downstream tasks when fine-tuning the BERT model: next-sentence prediction (NSP), and Single-Sentence Classification (SSC). In either task, the hidden state of the special token or class label [CLS] in the last layer of BERT can be regarded as the learned representation of input sentences. In the next-sentence prediction task, the special token [SEP] is used to separate the sentences to be contrasted. In order to check consistency between images and text, we propose three different
feature contrasting methods for combining features from texts (including
post and
reply) and image captions by fine-tuning the BERT model as in the following formula:
In the first case, we want the model to predict texts using image captions as the clue as in the next-sentence prediction task. In the second case, the model is trained to classify the concatenated representation of image captions and texts as in the single-sentence classification task. The idea of these two cases is to find out the relation between image captions and texts. In the third case, we further separate each text into the post and reply parts, and train the model to predict the reply from the concatenation of post and captions as in the next-sentence prediction task. The idea is to find out the possible relations between the original post and image caption with the response in the reply. After fine-tuning BERT with these three feature contrasting methods, the hidden states of the last layer in the transformer model are considered as the contextual representation of input tokens. They are added to the original embedding of each word in the texts to augment their representation.
3.4. Sentiment Classification
Since users might often express their opinions on related topics in social media, we want to identify the sentiment polarity of user opinions using the sentiment classification module. To extract sentiments from tweets, several methods are available. Firstly, the simplest method is to match words with sentiment lexicons such as SentiWordNet (
https://github.com/aesuli/SentiWordNet, accessed on 15 August 2024), where the overall polarity of the texts is calculated as the average of individual sentiments of words. The performance depends on the domain of the texts. Since SentiWordNet is a general-purpose dictionary, it might not be directly applied to user-generated content on social media. Secondly, since it is a classification problem, we can apply any machine learning or deep learning methods for sentiment classification, such as bidirectional LSTM with attention. However, without large amounts of training data with labels, the performance will not be good. In this paper, to obtain better classification results for sentiment prediction, we adopt a transformer-based approach for better semantic reasoning. Specifically, we fine-tune a separate transformer model with Twitter datasets to classify the sentiment of tweets into three categories: positive, neutral, and negative. The output of sentiment classification is then represented using one-hot encoding. Finally, it is appended after the augmented text representations as the input for credibility assessment in the next stage. Specifically, the augmented representation is denoted as follows: <augmented texts>+<pos>+<neutral>+<neg>.
Other social features such as hashtags are user-selected words to describe people, events, or topics related to the text contents. Since these words might never appear in the posts, they might be out-of-vocabulary in the pretrained transformer model. Thus, they are also represented using one-hot encoding, and then appended into the vector representation after other features, i.e., <augmented texts>+<hashtag1>+<hashtag2>+…+<hashtagn>. The vector representation will be checked in the last step of credibility assessment. We will compare the performance of sentiment classification in the experiments.
3.5. Credibility Assessment
After the processing of feature fusion by transformer-based consistency checking and transformer-based sentiment classification, all features are fused into the augmented embedding of texts. Then, for credibility assessment, we need to determine if this vector representation is credible or not. Again, it can be viewed as a classification problem as in many existing works. Thus, different classifiers can be used for credibility assessment, for example, Recurrent Neural Networks (RNNs) such as LSTM or GRU. It is natural to adopt a sequence-to-sequence model in assessing the credibility of texts since it is similar to human reading. In this paper, instead of the vanilla RNNs, we adapt our previous model called Multi-Cell Bidirectional GRU with self-attention [
25] to train the classification model for credibility assessment.
This model is selected since previous research shows superior performance when stacking multiple layers of bidirectional RNNs for rumor detection. As shown in
Figure 4, there are two possible designs: Multi-Cell Bi-GRUs and Multi-Layer Bi-GRUs. On the one hand, Multi-Layer Bi-GRUs simply stack two individual Bi-GRUs, as shown in
Figure 4b. After the processing of the first Bi-GRU layer, the sequence is already different from the original representation. On the other hand, Multi-Cell Bi-GRUs give the best performance since each direction comprises multiple levels of GRU cells before propagating to the next direction, as shown in
Figure 4a. This makes sure that each direction received the same representation. We will compare their performance for misinformation detection with other classification models in the experiments.
4. Experiments
To evaluate the performance of our proposed approach to misinformation detection, the Twitter datasets for rumor detection in MediaEval 2015, 2016 [
22] was used in this research since they have been verified by Twitter. We follow a similar approach to Jin et al. [
14] that keeps only tweets with both texts and images. Then, from the rumor and non-rumor events, the corresponding posts are identified as real and fake posts in the dataset, as shown in
Table 1, as follows. In order to compare with existing works, we also collected two additional datasets: Weibo and AllData. After inspecting the datasets, we found many missing images in both Weibo and AllData datasets. Specifically, since not all images used in the Weibo dataset were available, we did not compare performance on this dataset. In the AllData dataset, after removing texts without available images, there were 1153 items left, with 1048 real and 105 fake data items. The distribution of real and fake data is shown in
Table 2. Secondly, in order to justify the effects of fine-tuning transformers for sentiment analysis, the dataset in SemEval 2016 Task 4: Sentiment Analysis in Twitter was used for evaluation. The statistics of the second dataset is shown in
Table 3.
In the following subsections, we will analyze the effects of different modules on the performance of misinformation detection: credibility assessment, consistency checking, sentiment classification, and feature fusion. In fine-tuning our ALBERT model, we set the hyper-parameters as follows. The maximum input length was set to 512 tokens, with a batch size of 16. The number of epochs was set to 30, and the optimizer was LAMB, with a learning rate of 0.0005. The activation function was Leaky-ReLU.
To evaluate the binary classification results of misinformation detection, we used the performance evaluation metrics including accuracy, precision, recall, and F1-score. In order to validate the robustness of the evaluation results, we further used Student’s T-test to verify the statistical significance.
4.1. Effects of Credibility Assessment
To evaluate the effects of credibility assessment on the performance of misinformation detection, we adopted different types of RNNs using only text features as our baseline models: single-layer RNN [
14], single-layer Attn-BiRNN, multi-layer Attn-BiRNN, and multi-cell Attn-BiRNN [
25]. According to our previous research [
25], a better performance of rumor detection can be obtained when using GRU instead of LSTM in different RNN structures. In our experiments, we only compared the performance of different variations of RNN structures using GRU with the baseline models on Twitter rumor detection datasets, as shown in
Figure 5.
As shown in
Figure 5, when we only use text features, the best performance can be obtained for both attention-based Multi-layer Bi-GRU and Multi-cell Bi-GRU, with the same F1-score of 0.816. Also, three different variants of the self-attention-based Bi-GRUs greatly outperform the LSTM-based model by Jin et al. [
14], which is simply a single-layer RNN with attention mechanism. This shows the advantage of stacking multiple layers of Bi-RNNs with an attention mechanism for credibility assessment.
Next, we further evaluated the effects of credibility assessment using different models on the performance of misinformation detection when we combine all features, as shown in
Figure 6.
As shown in
Figure 6, when we combine all features, performance improvements in all models can be obtained when compared with the models with text features only in
Figure 5. Specifically, Multi-cell Bi-GRUs achieved the best performance with an F1-score of 0.882, followed by an F1-score of 0.859 for single-layer Bi-GRUs and an F1-score of 0.843 for Multi-layer Bi-GRUs. From our observations, Multi-cell Bi-GRUs had the advantage of incorporating deeper hidden layers with multiple cells in each separate direction, and so the hidden relations among features in word sequences could be better learned. On the other hand, for Multi-layer Bi-GRU, we stacked two individual layers of Bi-GRUs. After the first Bi-GRU layer already learned the relations among words in texts, the second one could not learn more relations since the input was different from the raw text. That also explains why Multi-layer Bi-GRU has a similar performance to single-layer Bi-GRU. To validate these results, we further conducted the statistical Student’s
T-test between the performance of Multi-layer and Multi-cell Bi-GRUs, with a
p-value of 0.039 (<0.05), which shows the statistical significance of their difference. In the remaining experiments, we will use the Multi-cell Bi-GRUs with self-attention as our baseline.
4.2. Effects of Consistency Checking
To evaluate the effects of multimodal consistency checking on the performance of misinformation detection, we temporarily removed the Multi-Cell Bi-GRU and instead connected the output of consistency checking to a simple fully connected layer for misinformation detection. Specifically, we compared the effects of three different feature contrasting methods for texts and image captions when fine-tuning ALBERT models, as discussed in
Section 3.3. The results are shown in
Table 4.
As shown in
Table 4, the best accuracy of 0.828 could be obtained when we used the feature contrasting method with fine-tuned ALBERT models to predict the response in
reply given the concatenation of the original text in
post and image caption. It already demonstrated slightly better performance than our baseline model of Multi-Cell Bi-GRU with attention for credibility assessment. This shows the benefits of consistency checking between texts and images using fine-tuning transformers, even without credibility assessment.
4.3. Effects of Sentiment Classification
In order to further aggregate sentiments from texts as potential social features, we further evaluated the performance of sentiment classification on the SemEval 2016 Task 4 dataset using different architectures including ALBERT, BiLSTM, BiLSTM + attention, and SentiWordNet, as shown in
Table 5. ALBERT is the proposed approach to fine-tuning transformers using tweets for sentiment classification. The baseline of SentiWordNet is to classify the sentiment of a text through the simple matching of words in the SentiWordNet lexicon.
As shown in
Table 5, we can observe the best performance with ALBERT, with an accuracy of 0.576. It shows better performance than the commonly used model of BiLSTM with attention. Thus, it will be used to obtain the sentiment classification results, which are then encoded using one-hot encoding as the sentiment features to be fused with other features in the remaining experiments.
4.4. Effects of Feature Fusion
To further verify the effects of fusing various features on misinformation detection, we evaluated the performance using different feature combinations in the augmented representation of the texts, as input to the Multi-cell Bi-GRU with self-attention for misinformation detection. The baseline only involves a simple concatenation of all features as the input to the Multi-Cell BiGRU with attention. The performance comparison of fusing different features is shown in
Table 6.
As shown in
Table 6, the best performance can be obtained for the feature fusion of Text + Image + Sentiment (TIS) with an accuracy of 0.904 and an F1-score of 0.921. Compared with the baseline of a simple concatenation of all features, it shows an improvement of 9.3% in accuracy and 4.4% in F1-score. Although the baseline achieved a higher recall, the precision is much worse since the possible relations among features might not be well addressed. Among the three different combinations in feature fusion, when we only fused texts and image features (TI), the performance was already very good, with an accuracy of 0.90 and an F1-score of 0.92. By adding sentiments in feature fusion, the F1-score was slightly improved to 0.921, with a higher precision at the cost of a lower recall. This shows the potential benefits of sentiment analysis for judging misinformation. But when we further included hashtags in feature fusion (TISH), we observed a great degradation in performance, which was even worse than the baseline. This is due to the fact that most of the user-defined hashtags are very diverse. Since the hashtags might be out-of-vocabulary in our data and seldom overlap, when we include them as our features, the semantic meanings will be drifted towards unpredictable ways. This shows the importance of feature fusion. From the experimental results, we can validate the effectiveness of our proposed approach to multimodal misinformation detection by fine-tuning transformers and Multi-cell Bi-GRUs with attention.
4.5. Performance Comparison for AllData Dataset
Finally, to check if our proposed method works in general, we compare the performance of misinformation detection using the AllData dataset. Since we only kept the data items with images available, it was different from the original AllData dataset. The baseline were the results as reported by Meel and Vishwakarma [
5]. Since the source code of the baseline was not available, we directly used the evaluation metrics as reported in the paper. In addition to our proposed method, we also use the transformer without fine-tuning, denoted as ALBERTp, which stands for the pretrained ALBERT model without fine-tuning. The performance comparison is shown in
Table 7.
As shown in
Table 7, we can observe the best performance for our proposed method on the AllData dataset, with an accuracy of 0.950, and an F1-score of 0.949. This shows the potential of our proposed approach to misinformation detection, when we consider the consistency between texts and images in the same post. Also, when we compare the results with (The Proposed Method) and without fine-tuning (ALBERTp), we can observe an improvement in all evaluations metrics when the embeddings are fine-tuned by transformers. This further validates the benefits of fine-tuning transformers in our proposed approach.