1. Introduction
With the proliferation of social media platforms such as
Twitter (
https://twitter.com/, accessed on 3 November 2023) and
Sina Weibo (
https://weibo.com/, accessed on 3 November 2023) and rapid development of smart mobile devices, people increasingly tend to consume news from social media platforms rather than from traditional news sources [
1]. According to a report by the Pew research center, more than half of Twitter user regularly access news on the site [
2].
The anonymity and openness of social media enable users to consume and share news as well as to generate real-time information. When events such as earthquakes or accidents occur, smart mobile devices can act as real-time news sensors, allowing people to immediately upload information to social media. This has greatly changed the propagation and timeliness of traditional news media.
Nevertheless, the convenience of information dissemination on social media platforms facilitates the proliferation of rumors. Rumors often refer to information around which truth and sourcing are unreliable, and are likely to be generated under emergency situations [
3]. Notably, most rumors exhibit distinct characteristics, enabling them to propagate faster, deeper, and further throughout social networks [
4]. Beyond the inadvertent spreading of rumors, social media users may deliberately initiate and circulate rumors using sophisticated generative models, often motivated by commercial or political interests. Startlingly, it has been reported that more than a third of trending events on microblogs contain rumors [
5].
The spread of rumors can pose significant threats to the credibility of the internet and have far-reaching real-life consequences, including causing public panic, disrupting the social order, eroding government credibility, and even endangering national security [
6]. A notable case of rampant rumor propagation occurred during the 2016 U.S. presidential election. During the election, as many as 529 different rumor stories pertaining to presidential candidates Donald Trump and Hillary Clinton were spreading on Twitter, instantly reaching millions of voters and potentially influencing the election’s outcome [
7]. A more recent example revolves around the plethora of rumors regarding the COVID-19 pandemic [
8]. These rumors on social platforms have significantly undermined the credibility and reliability of information shared on these platforms, consequently diminishing users’ willingness to turn to social media for information. A 2021 survey conducted by the Pew Research Center [
2] further underscores this decline in trust and reliance on social media for news. It revealed a decrease in the percentage of adult American users who frequently or occasionally obtain news from social media platforms, dropping from 53% in 2020 to 48% in 2021. This decline coincides with mounting criticism directed at social media and technology companies for their perceived inadequacy in curbing the spread of misleading information on their platforms. Therefore, it is of paramount importance to detect rumors spreading on social media platforms as early as possible.
Rumor detection has attracted significant attention from both social media platforms and researchers over the past decade. Typically, users on various social media platforms are encouraged to report or annotate suspicious posts as potential rumors. Subsequently, the accuracy of these possible rumors is verified with the assistance of human moderators and third-party fact-checkers. While this approach yields high-quality results, the substantial human effort required, including manual labeling and rumor verification, is challenging to reconcile with the sheer volume of emerging rumors. Therefore, there is a need for robust and efficient automated rumor detection approaches.
Automatic rumor detection is normally deemed a binary classification task, in which classifiers are employed to distinguish between rumors and non-rumors. These methods encompass a range of approaches, including traditional machine learning models [
3,
9] and neural network-based approaches [
10,
11,
12], which all follow a supervised learning paradigm. In this paradigm, posts are first transformed into representations, which are then fed into a supervised learning model guided by ground-truth labels. Traditional machine learning-based approaches often rely on hand-crafted features, while neural network-based models automatically learn latent deep feature representations of rumors. However, both approaches require a sizable annotated dataset, such as RUMDECT [
10] or PHEME [
13], for training reliable classifiers.
While the aforementioned methods have demonstrated promising results, they face several significant challenges, as highlighted by previous research in the field of automatic rumor detection. One of the most critical challenge pertains to the labor-intensive and costly nature of constructing rumor datasets [
14]. Labeling rumors within the ever-flowing stream of social media is a resource-intensive task associated with substantial costs. To illustrate this, consider the Sina Community Management Center’s rumor reporting process (
https://service.account.weibo.com/, accessed on 3 November 2023) depicted in
Figure 1.
A social media user must navigate through three stages for rumor reporting: the reporting stage, the evidence stage, and the results announcement stage. The evidence stage demands that the reporting user provide proof that the post in question is indeed a rumor. Subsequently, this evidence is scrutinized by experts from the Sina platform. This process is both time-consuming and financially burdensome.
Moreover, the rapid advancement of artificial intelligence (AI), particularly the emergence of generative models such as Generative Adversarial Networks (GANs) and diffusion models, has led to an increase in manipulated multimodal rumors. These rumors may incorporate image, audio, and video data, rendering them increasingly challenging for ordinary social media users to differentiate from genuine content. A notable example is the use of DeepFakes, which leverage deep learning models to fabricate audio and video clips of real individuals uttering or performing actions that never actually occurred. This makes rumors appear both more realistic and harder to discern [
15,
16].
Furthermore, certain rumors may contain domain-specific knowledge and can only be debunked by experts in the respective field. Annotation of previously unseen rumors often requires in-depth domain knowledge. A notable example occurred during the COVID-19 pandemic, when rumors such as “5G caused the virus” or “facemasks do not work” had to be confirmed as false by professional or authoritative medical experts rather than ordinary social media users. In more challenging scenarios, slight modifications to aspects of a non-rumor can lead to the creation of new and more convincing rumors. For instance, altering details such as the timing, location, or individuals associated with a non-rumor event can result in the fabrication of a convincing rumor. In such situations, it becomes significantly more arduous for experts to distinguish rumors from normal posts, making it a time-consuming and domain knowledge-intensive task.
Despite the growing volume of posts on social media platforms, including rumors, obtaining high-quality, large-scale, and authoritative benchmark datasets remains a daunting task. In comparison to benchmarks such as ImageNet [
17], which contains 14,197,122 images and serves as a standard in visual object recognition,
Table 1 shows that datasets used in recent research on rumor detection are relatively small in scale or confined to specific rumor categories.
This discrepancy underscores the need for developing comprehensive benchmark datasets, particularly in the current revolutionary era in deep learning.
This epoch is frequently characterized by the phrase “Data is the new oil” [
24], signifying the pivotal role of data in driving advancements across various tasks and applications through data-driven learning approaches. These approaches place heightened demands on both the quality and quantity of data. It is crucial to recognize that the size and quality of datasets wield a profound influence on the performance and scalability of state-of-the-art (SOTA) rumor detection models [
25].
In addition to the aforementioned challenges around labeling rumors and the limited scale of datasets, the performance of learned models may deteriorate due to conceptual drift. This phenomenon occurs when the distribution of features related to rumors undergoes changes over time. Typically, mitigating conceptual drift requires the continuous annotation of new datapoints and model updates. Unfortunately, this practice can be both costly and impractical. In summary, the field of automatic rumor detection faces a significant challenge in large-scale data annotation.
To address the challenges associated with rumor detection, an intuitive idea is to selectively label valuable data instead of annotating the entire dataset for training rumor detection models. Active Learning (AL) has emerged as a promising solution to overcome the key challenges outlined earlier. As a subfield of machine learning, active learning aims to create efficient training datasets by iteratively enhancing model performance through strategic sample selection. The goal is to achieve or even surpass the expected model performance with as few labeled samples as possible [
26].
Active learning recognizes that not all samples in a dataset are equally crucial for training a machine learning model. Therefore, it intelligently selects a subset of the dataset for labeling by an oracle, such as a human annotator, to optimize model performance. This approach mitigates the labeling bottleneck and minimizes the costs associated with acquiring labeled data. Consequently, active learning is well-suited for rumor detection scenarios, in which a surplus of unlabeled data is available from real social media streams while labeled data remain a costly resource.
Despite the existence of comparative studies across various tasks and domains, active learning has not been extensively explored in the context of rumor detection. In this work, we present a comparative analysis of active learning techniques for rumor detection on social media platforms, aiming to answer the following key questions:
Can active learning effectively reduce labeling costs in the context of rumor detection while maintaining high performance?
Which active learning query strategies are most suitable for specific rumor detection methods?
This research seeks to shed light on the potential of active learning in improving rumor detection while addressing the practical challenges associated with labeling large datasets. Hence, we evaluate the feasibility of utilizing active learning for rumor detection on social media platforms. To assess the effectiveness of active learning, we conduct a comparative analysis of multiple supervised machine learning methods. Our evaluation is performed on two distinct datasets, and we explore how active learning can reduce both the sample size and its influence on various supervised machine learning models. The significant contributions of our work can be summarized as follows:
To the best of our knowledge, this is the first comprehensive and comparative investigation of rumor detection using active learning, addressing an important gap in the literature.
We examine active learning query strategies suitable for different supervised learning models in the context of automatic rumor detection within pool-based scenarios.
Through extensive evaluation on Twitter and Weibo datasets, we demonstrate that active learning achieves faster convergence with a limited amount of annotated data, offering practical benefits for rumor detection.
The rest this paper is organized as follows. In
Section 2, we provide a comprehensive review of the related literature.
Section 3 outlines the process of automatic rumor detection using active learning.
Section 4 presents our experimental setup. In
Section 5, we discuss our experiment results. Finally,
Section 6 concludes the paper and discusses directions for future work.
3. Methodology
The aim of this paper is to assess the effectiveness of various active learning query strategies in the context of rumor detection models. Specifically, we seek to determine whether active learning enhances rumor detection performance. Our study involves comparing different query strategies and their application to different datasets using various machine learning models. In particular, we aim to identify the most suitable query strategy for different rumor detection models. In this section, the following three aspects are introduced: active learning, active learning query strategies, and rumor detection classifiers.
3.1. Active Learning
Active learning is a subfield of machine learning and artificial intelligence. It falls under the category of semi-supervised machine learning, where a learning model can iteractively request information from the user or another information source to obtain desired outputs at new datapoints [
26]. In the statistics literature, it is sometimes known as “query learning” or “optimal experimental design” [
38]. Active learning encompasses various problem scenarios in which a machine learning model can ask queries, such as membership query synthesis, stream-based selective sampling, and pool-based active learning. In the context of rumor detection, as discussed earlier, it is often possible to gather a substantial volume of unlabeled data, aligning with the common scenario in pool-based active learning.
Figure 2 illustrates the typical workflow cycle of active learning in pool-based scenarios. The raw dataset for rumor detection contains a small portion of labeled data and a large amount of unlabeled data, designated
. The labeled dataset is divided into an initialized training dataset
and a test dataset
based on a certain proportion. Let
be an instance in the raw dataset, where
x is a d-dimensional feature vector and
y is its corresponding label. A machine learning model, denoted as
, begins with the small labeled training dataset
and undergoes standard supervised training to establish an initial model. This initial model is then evaluated on the unlabeled dataset
, with
representing the predicted label for instance
x.
A query strategy is employed to compute a measurement criterion with , which is used to select one or a few of instances from . The selected unlabeled instances are typically informative or representative samples, and are referred to as query instances. The query instances are then sent to an oracle for labeling. When the query instances have been labeled by querying the oracle, they are added to the training dataset . The machine learning model is then retrained with the updated labeled dataset and tested on the test dataset to evaluate its current performance. This process is repeated until the model achieves satisfactory performance on or until specific preset conditions are met.
The primary objective of active learning is to maximise the model’s effectiveness by minimizing the number of samples that require manual labelling. The main challenge lies in identifying informative or representative query instances that facilitate the rapid convergence of model training.
3.2. Active Learning Query Strategy
As mentioned before, the critical challenge in the workflow cycle of active learning lies in selecting an appropriate query strategy, known as a selector. The query strategy evaluates the “worthiness” of unlabeled samples using a specific criterion and determines whether a sample is worthy of annotation based on its suitability. Therefore, choosing the right query strategy is pivotal in enabling the model to converge effectively with minimal training data. The choice of query strategy holds significant implications in active learning.
To date, numerous strategies have been proposed in the literature for querying unlabeled instances. These query strategies can be categorized into three main groups based on the nature of the instances they select: informative-based, representative-based, and both of these in combination. Informative-based strategies focus on the informativeness of unlabeled instances, prioritizing those with higher information content for labeling by the oracle. Typically, the informativeness of unlabeled data is assessed based on the model’s uncertainty. However, informative-based strategies may overlook relationships among unlabeled instances, and often lead to the selection of multiple instances of a similar type.
On the other hand, representative strategies aim to make efficient use of the structure within the unlabeled data when selecting candidate query instances. Additionally, they strive to address the challenges encountered by informative query strategies. Representative strategies can help to alleviate issue of sampling bias by selecting instances from diverse regions within the input space. Combining informative and representative strategies can strike a balance between measures of informativeness and representativeness. It is worth noting that an increase in the informativeness of a selected instance may come at the cost of reduced representativeness.
In the following subsections, we provide a detailed description of the strategies employed in this paper.
3.2.1. Uncertainty Sampling
Uncertainty sampling is a typical informative-based strategy and is the most popular query strategy for active learning. It assumes that the uncertainty samples provide more information for training a machine learning model if they are labeled. The rationale for this is that instances with lower certainty are typically located near the decision boundary of the classification, while highly certain instances are usually far from the decision boundary. Therefore, instances that are distant from the decision boundary are often considered redundant. The uncertainty sampling strategy selects an instance with the lowest confidence predicted by the current machine learning model as the query instance.
Common criteria for evaluating the uncertainty include least confidence, uncertainty margin, and entropy.
Least confidence is a strategy based on the prediction uncertainty. It measures uncertainty as the level of confidence in the most likely label. It is based on the probability of the top-class label with the highest posterior probability for a given instance. The uncertainty of an unlabeled instance is defined by Equation (
1):
where
is the probability of the top-class label
with the highest posterior probability (for instance,
x),
represents the unlabelled data pool, and
is the uncertainty score of the query instance. The least confidence criterion strives to find the most indistinguishable instance of the current model as the query instance.
Least confidence only considers the probability of the best prediction class label, ignoring the information from other class labels. As an improved query strategy, margin sampling can calculate the difference between the two most confident posterior probabilities, as defined by Equation (
2):
where
and
are the top-1 and top-2 posterior probabilities. The instance with the smallest difference is defined as a hard-to-classify instance for labeling.
In order to further consider the information of all class, another more typical measure of uncertainty is entropy, which is defined by Equation (
3):
where
C is the number of classes. The entropy measures the purity of a class for one sample, with larger entropy denoting higher uncertainty. The instance with the largest entropy is selected as query instance.
3.2.2. Query by Committee (QBC)
QBC is another typical informative-based strategy; it is based on the inconsistency of ensemble learning. In this strategy, a committee is composed by training multiple classifiers on different subsets of instances drawn from the labeled dataset. The fundamental assumption of QBC is that different classifiers should exhibit consistency with the provided labeled data instances. Hence, the query instance is selected based on the unlabeled instance that demonstrates the highest disagreement among the committee members in label prediction.
There are two ways to construct this committee, namely, boosting and bagging. In query by bagging, a committee of m classifiers is created by applying bootstrap aggregating, which involves randomly sampling with replacement m times from the labeled training data. In query by boosting, the random instances are bootstrapped with replacement from available labeled training data.
There are two kinds of indicators for measuring disagreement, namely, vote entropy and the average Kullback–Leibler (KL) divergence. Vote entropy identifies the instances with the largest entropy among the predicted class labels. Such instances are considered hard samples, and are selected as query instances for labeling. The average KL divergence measure identifies the most informative query as the one with the largest average difference between the label distributions of any one committee member and the consensus [
26].
3.2.3. Expected Error Reduction (EER)
EER is a informative-based strategy which selects the next instance that maximally reduces the generalization “error” or “loss” in expectation [
39]. It takes into account the uncertainty or informativeness of unlabeled instances and measures the potential impact of querying them on the overall error reduction of the learning model.
The key idea behind EER is to estimate the expected reduction in error that can be achieved by labeling specific instances. It involves selecting those samples expected to have the greatest impact on improving the model’s performance. This is achieved by considering the uncertainty or lack of confidence in the current predictions made by the learning model. The intuition is that by querying instances that are difficult to classify or that lie near the decision boundary, the model can obtain crucial information to refine its decision-making process.
3.2.4. Graph Density Strategy
The graph density strategy is a representative-based strategy that employs a graph structure to identify the most representative unlabeled datapoints. The underlying intuition of the graph density strategy is that representative data points for a specific class are typically well-embedded in the graph structure, resulting in many edges
with high weights. To implement the graph density strategy [
40], a
k-nearest neighbor graph is constructed in which
if
is one of the
k smallest distances of
with Manhattan distance
d. The strategy uses a weighted matrix with a Gaussian kernel, to rank all data points based on their representativeness, as defined in Equation (
4):
3.2.5. Querying Informative and Representative Examples (QUIRE)
QUIRE combines the informative and representative strategies, taking a min–max view of active learning and providing a systematic way to measure and combine informativeness and representativeness. QUIRE measures both the informativeness and representativeness of an instance; specifically, the informativeness of an instance
x is measured using its prediction uncertainty based on the labeled data, while the representativeness of
x is measured by its prediction uncertainty based on the unlabeled data [
41].
3.2.6. Information Density Weighted Strategy
The information density weighted strategy [
42] is another combination of informative and representative strategies. Informative-based strategies may tend to select unlabeled instances that lie along the classification boundaries even when these instances are outliers that are not representative of the broader distribution in the input space. This strategy introduces the concept of information density (ID), as defined Equation (
5):
where
measures the “base” informativeness of an unlabeled instance
x; the terms in parentheses in Equation (
5) represent the similarity of
x to all other unlabeled instances in
, while the parameter
controls the relative importance of the representative term. The information density weighted strategy effectively combines uncertainty and diversity in active selection.
3.3. Rumor Detection Classifier
In this paper, we explore a wide range of supervised learning classification models for rumor detection and subject them to extensive study using different active learning strategies. These classifiers LR, SVM, DTC, NB, RFC, KNN, the Gaussian Process (GP) classifier, Multi-Layer Perceptron (MLP), Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and AdaBoost (Ada). Unless otherwise specified, all model parameters were set to their default values.
We employed two LR models: one trained with the standard approach and the other utilizing Stochastic Gradient Descent (SGD). For simplicity, we refer to these as LR and LR(SGD), respectively. Additionally, we employed three SVM classifier models: one with a linear kernel, denoted as SVM(Linear), another with a Radial Basis Function (RBF) kernel, denoted as SVM(RBF), and the third with an RBF kernel trained using SGD, which we denote as SVM(SGD). It is important to note that, unlike LR, SVM models with a linear kernel or trained with SGD lack the ability to predict probabilities for object classes. This limitation confines their usage to representative-based query strategies. Furthermore, we assessed the performance of DTC using both the Gini and Entropy criteria, which are denoted as DTC(Gini) and DTC(Entropy), respectively.
Feature extraction is one of the most crucial phases of supervised machine learning, and has a significant influence on classification accuracy. Researchers seeking to achieve better rumor classification performance have experimented with combinations of various features and supervised machine learning classifiers. Several common features, including content-based, user-based, propagation-based and behavior-based features, were selected for experimentation in our study. The diverse features extracted from online social media posts play a vital role in rumor detection using machine learning models.