The architecture of the proposed MiBeMC for media credibility evaluation is presented in this section. MiBeMC evaluates the credibility of media by simulating the human’s process of reading and verifying information, which can effectively extract and integrate features within and between different parts.
As shown in
Figure 3, MiBeMC has six different modules, which are two different intra module feature extractors, one similarity module, one interaction module, one aggregation module, and one media credibility evaluation module. Specifically, these modules modeled three stages of human evaluation behavior into the following five steps: (1) the different intra modules extract the latent semantic of each component of the information, (2) the similarity module achieves the additional information by search operation, (3) the interaction module simulates the reading and understanding behavior between intra modules, (4) the aggregation module integrates these interaction features based on mimicking human verification behaviors, and (5) the media credibility evaluation module predicts the factuality label of news outlet based on the output of the final aggregation module.
3.1. Problem Formulation
To better simulate the human user’s process of reading information and verifying operation, we divide a news article into two components: the title and the content.
Following the definition in CLEF 2023 CheckThat! Task 4 [
22], we consider news media credibility evaluation as an ordinal classification task that corresponds to three levels of credibility at the media level: low, mixed, and high. A news article (
) can be divided into two elements: title (
) and content (
), which are formally written as
. Our task is to find an evaluator
that can model and integrate the information between different components, thereby evaluating the media into three levels of credibility: low (
), mixed (
), and high (
), i.e.,
.
3.3. Information Association Similarity Module
This module simulates the operation of humans obtaining additional information from search pages and historical memories.
For the content related to the search pages, we use the title as the search term in order to obtain the results returned from the first page of Google News. We only use the title and the truncation content on the results page. We use the special separator [CLS] to connect each news article.
For historical memories, we utilize the historical news release from the same media. We vector all titles using Sentence-BERT [
24] first, and then we choose the top 10 most similar forms of content based on cosine similarity. For similar search results, we use the special separator [CLS] to connect each news article. In addition, considering the characteristics of news type text itself, we found that the beginning and end parts of the news also provide relatively important semantic information. In particular, the first sentence or paragraph of a news article often serves as the introduction to the entire article, briefly revealing the content of the article and reinterpreting the essential semantics of the title. The last sentence or the last paragraph of the news contains a summary of the entire news article or the major points that the author wants to convey. Therefore, we adopted three different combinations of content interception methods in order to recombine the text and to feed it into the subsequent network.
Head truncation. After truncating the start position of all news content in the media to the specified position, use a special separator to identify the [SEP] connection.
Tail truncation. After truncating the end position of all news content in the media to the specified forward position, use a special separator to identify the [SEP] connection.
Middle content truncation. After truncating the content of all news in the media from a specified position to a fixed length, use a special separator to identify the [SEP] connection.
3.4. Feature Extraction Intra Module
In this work, we use two different feature extractors intra modules to mine the latent semantics of the title, content, search news results, and history similar news, which are written as TEF, CEF, SsEF, and HsEF respectively. Specifically, the feature extractor has one attention layer and, at most, two Recurrent Neural Network (RNN) layers.
Due to the fact that all textual articles consist of sentences and words, we naturally write the title and content as
, where
I and
J are the number of title and content sentences, and
is a sentence with
words. Similarly,
represents a sentence containing
words. In this work, we limit the title length to one, i.e.,
. In addition, we introduce Glove [
25] to vectorize each word with
dimension.
Many studies have shown the significant ability of the hierarchical structure to extract features [
9,
26], so we constructed a hierarchical attention network in order to capture the internal semantic information from words to sentences, namely,
TEF,
CEF,
SsEF, and
HsEF, and, just like when humans read a news text, it reads the entire text from local words to the whole sentence.
Taking content elements as an instance and considering information loss during long sentence input, we use two Bi-GRU [
27] layer to encode words and sentences separately. The first layer is used for word-level encoding and can be mimicked as the human forward and reverse reading process:
where
c represents the
c-th sentence of content, and where
is the output feature of Bi-GRU in both directions before and after the word
.
Due to the strong ability of the attention mechanism in natural language processing [
26,
28], we use an attention layer to redistribute the weight of words according to their importance. Each word’s weight can be calculated as follows:
where
are the learnable parameters,
is the attention weight of the word
, and
is a hidden representation of
achieved from a fully connected layer with
. Therefore, the weighted sentence embedding
can be calculated as follows:
After the attention layer, we apply the second layer to explore the contextual information of sentences by simulating the human reading process. Therefore, the feature of the
j-th sentence can be expressed as follows:
where
is the output of Bi-GRU in two directions around sentence
.
The structures of SsEF and HsEF are the same with CEF. TEF can be considered as a variant of CEF. For providing the human with a direct understanding of the article, a news editor often needs to describe the entire story in language that is as concise as possible. Therefore, a news title usually contains the most critical semantic information, which is a highly condensed form of the overall content. Therefore, for a news title, we use Bi-LSTM instead of Bi-GRU because the LSTM unit contains more parameters.
3.5. Intra Feature Interaction Module
This module simulates the reading and understanding process of humans, including five interaction modules within four intra features. As shown in
Figure 1, different news articles on the internet can complement and verify each other, which is also one of the bases that humans use to distinguish fake news and low credibility news outlets more effectively [
29,
30]. In addition, research has shown that mining information from two different components is effective [
31,
32]. Therefore, replicating the human reading and understanding process of news content components, we can combine these two different components and then use feature extractors between the components to separately mine the interaction affinity of each combination.
In this work, we use the parallel co-attention [
30] as intra-feature interaction modules and simulate the understanding process of each combination according to its interaction affinity, such as the complement of the title and the content. The parallel attention operation calculates weights to each other based on the interaction affinity of two different inputs. The higher the interaction affinity, the higher the media credibility to the human. The specific module for the two different intra modules is shown in Equations (6)–(13). We calculate the interaction matrix
to measure the interaction affinity between two different intra modules:
where
,
are the two different modules inputs,
N and
Q are the number of each sentences, and
is a learnable matrix. For the case where four intra modules are accessible,
. The newly feature representations of
A,
, and
are calculated as follows:
where
are represented as the learnable weight matrix,
, respectively, represent the interaction co-attention scores of
and
, and
are the output of
and
, which contain the weighted information of
and
.
is the feature output of the attention operation with two different inputs.
3.7. Media Credibility Evaluation Module
We apply three-layer MLP to evaluate the credibility level of the news media through
:
where
represents the evaluation probability of credibility, and
represents the three MLP layer used to detect
labels.
In the real scenario of media credibility evaluation, high credibility media often accounts for the majority, while mixed and low credibility media account for a relatively small proportion, which is also reflected in the public dataset. To solve this problem, we used the multi category Focal Loss [
33] in order to alleviate the problem of category imbalance. This loss function can effectively reduce the weight of simple samples in training so that the model can pay more attention to hard and error prone samples. We provide each category with a different α coefficient value in Focal Loss, such as a high category α coefficient of 0.35:
where
p is the prediction probability with credibility level label of the sample, α is the weight coefficient corresponding to each category, and
is used to control the relative importance of loss in different categories, representing the difference between hard samples and soft samples.
3.8. Regularized Adversarial Training
Despite data preprocessing on the input text, many of the articles in news media frequently contain a lot of duplicate information. To improve the model’s generalization capability and to lessen the effect of redundant information on the performance of the model, we include adversarial training. Adversarial training is a training strategy that improves model robustness by introducing adversarial samples into the training process. It also has an obvious disadvantage in that it can easily lead to model performance degradation due to inconsistencies between the training and the testing stages. Ni et al. proposed a regularization training strategy R-AT [
34], which introduces Dropout and regularizes the output probability distribution of different submodules to maintain consistency under the same adversarial sample. Specifically, R-AT generates adversarial samples by dynamically perturbing the input embedding vector, and then passes the adversarial samples to two submodules referencing Dropout in order to generate two probability distributions. By reducing the bidirectional KL divergence of two output probability distributions, R-AT regularizes model predictions.
In this work, we adopt the R-AT strategy and we use the Focal Loss function to replace the original Cross entropy loss function. In each training step, the forward propagation and backward calculation of origin samples are completed first. Following the original R-AT strategy configuration, we use the fast Gradient method (FGM) [
35] proposed by Miyato et al. in order to perturb the language model layer of the model and to obtain two confrontation samples. By calculating the corresponding classification loss and the KL divergence of the confrontation samples, the calculate process is as follows:
Finally, the loss of the second backward propagation is obtained:
where
is the output of the model using FGM, and
references
KL distance function. We rewrite the original loss function of R-AT and modify the original fixed weight parameter into an adaptive parameter
k, which is utilized to adjust the weight of multiple different loss function during the second forward propagation.