1. Introduction
As a result of the proliferation of information online, we are now subjected to a barrage of advertisements and news headlines on virtually every page we access. Since so many people have access to the internet, websites and news outlets are constantly competing for viewers. As a result, they are under pressure to create ever more appealing, catchy and provocative article headlines, regardless of their accuracy. Because of this, there has been a rise in recent years of sensationalist “news” headlines that do not tell readers anything valuable about the story but are meant to grab their attention [
1]. The term ‘clickbait’ refers to misleading links with sensationalized headlines that intend to attract the viewers’ attention and entice them to click on the link [
2]. In a broad sense, clickbait headlines meet two main criteria: (1) they mislead readers about the article’s contents, and (2) they take advantage of the so-called “curiosity gap” by not explaining the entire article’s contents. To put it more simply, the text included in these headlines either makes the reader curious about the rest of the article’s contents or discusses topics that are not addressed in the body of the article itself. It is imperative that we make a clear distinction between clickbait and fake news, a topic that has been receiving a growing amount of attention as of late. The difference lies in the fact that fake news purposefully presents its audience with information that they should know is false in order to gain their trust. On the other hand, clickbait almost always just contains “junk” news that lacks any real journalistic integrity and is not designed to trick the reader into believing false claims. Clickbait can pose a threat to Internet users and has become more prevalent across the web, not just on less reputable sites [
3]. Recent research from Stanford University highlights how clickbait is making its way onto more reputable journalism sites [
4]. Clickbait can have an even more malicious purpose such as phishing for personal information, or even worse, hosting malware. For these reasons, clickbait is a serious issue that must be addressed. The first step in addressing this problem is to distinguish clickbait headlines from true headline links. On the surface, clickbait headlines can be hard to recognize, as they are designed to fool the user, but there are key semantic features that can help identify clickbait headlines. AI tools based on machine learning algorithms can detect and block clickbait in a systematic manner. Machine learning unlocks the power of data in novel ways [
5]. This technology assists computer systems in learning from and improving on their experiences by creating computer programs that can automatically access data and perform tasks through predictions and detections. As you feed more data into a machine, the algorithms learn more about the machine, which improves the results [
6]. Several works based on machine learning techniques have been proposed for clickbait classification.
Razaque et al. [
7] developed ClickBaitSecurity to distinguish between legitimate and illegitimate links by accurately using a recurrent neural network (RNN). In comparison to existing solutions, the test results showed that their proposed model has high accuracy in detecting malicious and safe links.
Shang et al. [
8] introduced a content-agnostic scheme, Online Video Clickbait Protector (OVCP), to effectively detect clickbait videos by analyzing the comments left by viewers of the video. Unlike other solutions, OVCP does not directly analyze the video’s content and pre-click data. As a result, it is resistant to sophisticated content creators who frequently create clickbait videos that can evade current clickbait detectors. Their experiments proved that OVCP could accurately identify clickbait videos.
Using social media datasets, Liao et al. [
9] proposed federated hierarchical hybrid networks to build clickbait detection models. The titles and contents are stored by different parties whose relationships must be exploited for clickbait detection. In comparison to other cutting-edge methods, their proposed approach demonstrated high efficacy.
Agrawal et al. [
10] introduced compiled clickbait corpus and proposed a model for detecting clickbait using convolutional neural networks (CNN). The corpus was built using various social media platforms and deep learning for learning features. The model outperformed other models in detecting clickbait.
Setlur et al. [
11] presented a semi-supervised classification-based approach utilizing attentions sampled from a Gumbel–Softmax distribution. An additional loss over the attention weights was applied to encode prior knowledge. The authors also presented a confidence network, which enables learning over weak labels and improves resiliency to noisy labels. According to the results, the model achieved over 97% accuracy with only 30% of strongly labeled samples.
Fakhruzzaman et al. [
12] proposed a based neural network classifier with a pre-trained language multilingual bidirectional encoder representations from transformers (M-BERT) model to classify clickbait and non-clickbait headlines. The model was evaluated on a dataset of 6632 headlines using the five-fold cross-validation approach achieving an f1-score of 0.914.
Thomas et al. [
13] presented a system based on the fusion of neural networks, which incorporates various forms of available data. The proposed system requires no linguistic preprocessing and generalizes to new domains and languages more quickly. The model achieves an f1 score of 0.564.
Kumar et al. [
14] proposed a bidirectional LSTM with an attention mechanism to learn the extent to which a word contributes to the clickbait score of a social media post in a different way. They also used a Siamese net to detect similarities between the source and target data. To add another layer of complexity to the model, they also use CNN to learn image embeddings from large amounts of data. Their experiments were carried out on a test corpus of 19538 social media posts and they achieved an F1 score of 0.65.
Cao et al. [
15] used a random forest regression algorithm to create a computational clickbait detection system. A dataset of over 21,000 headlines/titles was used and the 60 most relevant features were extracted. On the clickbait class, the model achieved an f1 score of 0.61.
While previous studies have attempted to address this issue, they have limitations. For example, some studies rely solely on lexical analysis or shallow features, which may not fully capture the semantic meaning of headlines. Other studies do not consider the impact of different machine learning techniques on classification accuracy. To address these limitations, this paper presents an effective method to categorize clickbait and non-clickbait headlines using semantic analysis and machine learning techniques. Thirty unique semantic features were investigated and six different machine learning classification algorithms were explored individually and as ensembles. The classification algorithms utilized are decision tree, logistic regression, naïve Bayes, support vector machine, k-nearest neighbor and gradient-boosted decision tree. These algorithms were selected because they are widely used in the field of text classification and have been shown to produce good results. The selection of these algorithms was based on the available literature and past studies that have used these algorithms to perform text classification tasks. Additionally, these algorithms represent a diverse range of techniques and approaches, which allows us to evaluate the effectiveness of different methods and to identify the best approach for categorizing clickbait and non-clickbait headlines. To train, test and validate the six algorithms, a large dataset of 32,000 sample headlines collected from different news websites was used. The dataset contained a 50/50 mix of clickbait and non-clickbait headlines.
The main contributions of this paper can be summarized as follows:
- 1.
A method for identifying clickbait headlines using semantic analysis and machine learning techniques is presented.
- 2.
Thirty unique semantic features are investigated and six different machine learning classification algorithms (decision tree, logistic regression, naïve Bayes, support vector machine, k-nearest neighbor and gradient-boosted decision tree) are explored, both individually and as ensembles.
- 3.
A large dataset of 32,000 sample headlines collected from different news websites is used to train, test and validate the techniques; this dataset has a 50/50 mix of clickbait and non-clickbait headlines.
This paper is organized as follows.
Section 2 discusses the composition and details of the dataset used. In
Section 3, previous research on the topic is examined to identify key semantic features that are commonly associated with clickbait headlines. The correlation between these features and clickbait headlines is analyzed in
Section 4. The classification method using different models, both individually and in ensemble form, along with their results, is described in
Section 5. In
Section 6, the accuracy of the models is compared with similar studies. Finally, the conclusions and findings of the research are presented in
Section 7, including a discussion of the limitations of the study and suggestions for future research.
3. Feature Formulation
Related works in the area of clickbait classification offer insight into semantic features that occur more frequently in clickbait headlines compared to non-clickbait headlines. This section leverages these related works to identify 30 key semantic features linked to clickbait headlines for use in classification modeling. Semantic features associated with clickbait include sentence structure, parts-of-speech, forward referencing, punctuation, common clickbait words and informality. The classification approach used in this project focuses on analyzing the semantic styles of the text in headlines and not the content of the linked pages.
Chakraborty et al. [
1] found that clickbait headlines typically have a greater word count than conventional non-clickbait headlines. In addition, they determined that even though clickbait headlines have more words, the average word length is shorter. They also recognized that stop words, the most common English words, occur more frequently in clickbait headlines. In addition, they concluded clickbait headlines often employ determiners and contractions. Determiners are comprised of articles (
a/an, the), demonstratives (
this, that, these, those), possessives (
my, your, his, her, its, our, their) and quantifiers (
many, much, more, most, some).
Similarly, Blom et al. [
17] postulate that forward referencing is another key semantic headline style used to create anticipation and curiosity to lure readers to click. Forward referencing refers to referencing forthcoming parts of the headline upfront or using a word that gets its meaning from a subsequent word or phrase. Forward referencing can be identified by the presence of demonstrative pronouns (
this, that, these, those), personal pronouns (
I, you, he, she, it, we, they, me, him, her, us and them), superlative adverbs (
–est, –ly) and definite articles (
the).
Biyani et al. [
18] describe how clickbait headlines are made more attention grabbing with the use of acronyms, numbers, upper case letters, questions, quotes, exclamations and other punctuation patterns. They determined that clickbait headlines are more likely to begin with 5W1H words (what, why, when, who, which, how) than non-clickbait headlines. They also observed that the language of clickbait headlines tends to be less formal than that of conventional non-clickbait headlines. To capture this difference in informality Biyani et al. utilized four indices that measure the readability/informality level of text. The Coleman Liau Index [
19] for readability is based on the number of letters per word and words per sentence. The
is computed by:
where
L is the average number of letters per 100 words and
S is the average number of sentences per 100 words.
Anderson’s RIX Readability Index,
, is a simplified version of Bjornsson’s LIX Readability Index [
20],
. Both indices are based on the number of words per sentence. The indices are computed by:
and
where
W is the number of words,
is the number of long words (7 or more letters) and
S is the number of sentences.
The formality measure index (
F-Score) developed by Heylighen and Dewaele [
21] provides a measure for formality based on the frequencies of different parts of speech of words in the text. They found nouns, adjectives, articles and prepositions are more frequent in formal styles; pronouns, adverbs, verbs and interjections are more frequent in informal styles. The F-Score is computed by:
Biyani et al. also determined the key words: “reason”, “why”, “just”, “this” and “one” have a high frequency of occurrence in clickbait headlines.
Of the 30 key semantic features formulated above, 22 features are comprised of binary values (
and
of the feature). The four readability/informality and two ratios plus the word count and average word length features are continuous numeric values. A summary of the 30 key semantic features to be used in the classification of clickbait headlines is presented in
Table 1.
4. Feature Analysis
The statistical analysis results in
Table 2 show that all 30 features exhibit definitive different occurrence rates between clickbait and non-clickbait confirming their usefulness for classifying headlines. Several feature statistical differences stand out more than others signifying potential top classifiers. The features
“Ratio words begin w/uppercase”, “ Ratio of stop-words ”, “Begins w/ Number”, “Contains possessive/pronoun” and “Contains demonstrative” have significant occurrence rate differences between clickbait and non-clickbait headlines.
The stacked bar chart in
Figure 1 shows the normalized occurrence rate of the 22 binary features in clickbait and non-clickbait headlines. Features on the right with normalized rates above 0.5 are more associated with clickbait headlines. The higher the rate the higher the association with clickbait headlines. The opposite is true on the left. Features on the left with normalized rates below 0.5 are more associated with non-clickbait headlines. The lower the rate the higher the association with non-clickbait headlines.
The eight continuous numeric value features (ratio and count features) were transformed into binary values using discretization. The stacked bar chart in
Figure 2 shows the normalized occurrence rate in clickbait and non-clickbait headlines for these eight transformed features. The two features
“Uppercase2” (headlines with all words beginning in uppercase) and
“Word Count2” have a higher rate of association with clickbait headlines. The other six features are more strongly associated with non-clickbait headlines.
Figure 3 contains the correlation matrix of all 30 features. The Correlation matrix utilizes the Pearson coefficient of correlation between each of the features. The Pearson coefficient of correlation is a linear correlation with a range of −1 to +1. A value of −1 signifies a strong negative correlation while a +1 indicates a strong positive correlation. The matrix is also color coded with shades of red being associated with positive (+) coefficients and shades of blue with negative (−). The darker the color, the higher the correlation.
Overall, the feature correlation +/− groupings in the matrix match the predicted clickbait/non-clickbait associations in the stacked bar charts. There are two features that have very high correlation values. The feature
“Uppercase2” has a coefficient of correlation of 0.9, which is an almost perfect correlation with the clickbait target classification. On the other end of the spectrum, the feature
“Stop Words2” has a coefficient of correlation of −0.8, which indicates a very strong correlation with a non-clickbait target classification.
Table 3 contains a listing of the 15 features with the highest correlation values.
5. Modeling
This section describes the modeling approach used to classify clickbait and non-clickbait headlines. The classification modeling was performed in Python [
22] using the scikit-learn machine learning library [
23]. The data analysis and modeling were conducted in Jupyter notebook, an open-source web application that allows for interactive data science and scientific computing using the Anaconda distribution.
5.1. Individual Modeling
Six individual classification models are tested: decision tree, logistic regression, naïve Bayes, support vector machine (svm), k-nearest neighbor (knn) and gradient-boosted decision tree (GBDT). The dataset is randomly split 80:20 into 25,600 training headlines and 6400 test headlines. In order to optimize the performance of the six machine learning models, we conducted a thorough search for the best hyperparameters. The hyperparameters of each model are selected through a randomized search, which is a probabilistic method for hyperparameter tuning. The randomized search was conducted for 100 iterations for each model and the best hyperparameters were chosen based on the highest performance metric score on the validation set. The hyperparameters selected for each machine learning model are displayed in
Table 4.
After selecting the hyperparameters for each of the six machine learning models, the next step is to evaluate their effectiveness in categorizing clickbait headlines. The headline dataset was partitioned into two subsets—a training set comprising 25,600 headlines and a test set comprising 6400 headlines, using an 80:20 ratio. Before evaluating the models, it is important to understand the key metrics used to assess the models’ performance. The selected evaluation metrics are accuracy, precision and recall.
Accuracy measures the proportion of correct predictions out of all predictions made and is calculated as:
where TP (true positive) is the number of actual clickbait headlines that were correctly classified as clickbait, TN (true negative) is the number of actual non-clickbait headlines that were correctly classified as non-clickbait, FP (false positive) is the number of actual non-clickbait headlines that were incorrectly classified as clickbait and FN (false negative) is the number of actual clickbait headlines that were incorrectly classified as non-clickbait.
Precision measures the proportion of correct clickbait predictions out of all predictions made as clickbait and is calculated as:
Recall measures the proportion of actual clickbait headlines that were correctly classified as clickbait out of all actual clickbait headlines and is calculated as:
In addition to the full 30 classification features, the models were trained and tested using less than the 30 features (i.e., Top-25/20/15/10/5/2 features). The goal is to find the simplest model that produces the best accuracy utilizing the least number of features.
Table 5 shows the validation performance of each of the models across the different combined feature sets. The results in
Table 5 show all six models produced an accuracy, precision and recall greater than 0.96. The
SVM and
GBDT models produced the best results with an accuracy of 0.98, a precision of 0.98 and a recall of 0.97. Furthermore, the performance of these two models did not significantly change between using the full 30 features and only using the top-15 features. When less than 15 features were used the performance of all the models started dropping slightly.
5.2. Ensemble Modeling
The top-5 most accurate models (SVM, GBDT, decision tree, logistic regression and KNN) were combined into an ensemble model. The ensemble model was run 10× with random 80:20 training/test sets using only the top–15 features. Majority voting between the models was used on each run to produce the accuracy results. The average accuracy from the 10 runs was 0.976, with a standard deviation of 0.001. The average standard deviation was very low, indicating a very tight grouping around the average. The average accuracy for the combined models did not improve over the accuracy of the individual top models.
One explanation for the lack of improvement in accuracy for the ensemble model is that the model may produce closely matching sets of false-positive and false-negative headlines. When analyzed, over 75% of the false-positive and over 65% of the false-negative results are the same headlines in all five models. Examining the false-positive results, the “Uppercase2” feature appears to be the primary culprit. However, this feature is also the top feature for classifying headlines as clickbait. Similarly, the “Stop Words2” feature appears to be the primary cause of the misclassifications resulting in false-negative results. Yet, this feature is the top feature for classifying headlines as non-clickbait. The conclusion from this analysis is that there are no adjustments that can be made to the ensemble models and feature set that would improve the accuracy more.
5.3. Factor Analysis Modeling
In a final attempt to improve the classification accuracy, exploratory factor analysis [
24,
25] was conducted on the features and then the independent and ensemble models were re-ran with the combined factors. Exploratory factor analysis is a linear statistical method used to summarize a large set of features into smaller variables called factors. To confirm that factor analysis was indeed feasible for the given headline features, the Bartlett sphericity test and the Kaiser–Meyer–Olkin (KMO) test were used.
The Bartlett sphericity test checks whether or not the features (observed variables) are intercorrelated by comparing the observed correlation matrix and the identity matrix. If the two are not the same, the test is significant. For the test of our feature set, the Chi-Square was 26,0470.01 and the p-value was 0, signifying that factor analysis is feasible.
The Kaiser–Meyer–Olkin (KMO) [
10] test estimates the proportion of variance among all the observed variables. KMO values range between 0 and 1 with a value of 0.6 or more indicating factor analysis is feasible. For the test of our feature set, the KMO value was 0.73, again indicating factor analysis is feasible.
Next, the Kaiser criterion [
26] was used to determine the number of factors. The Kaiser criterion is an analytical approach, which is based on the selection of factors that explain a more significant proportion of variance. The eigenvalue is used as an index for the variance as a portion of the total variance and it indicates how good a component is as a summary of the data. An eigenvalue of means that the factor contains the same amount of information as a single feature. Generally, an eigenvalue greater than 1 is considered a good selection criterion for a factor. The scree plot in
Figure 4, which is a plot of eigenvalues and feature/factor numbers, is a graphical representation of the Kaiser criterion. The “elbow” in the curve on the scree plot, just before the line flattens out, corresponds to the number of factors to select. The “elbow” in the scree plot curve indicates 5 factors as the optimum choice.
Python’s Factor Analyzer with oblique rotation was used to extract the five factors and produce a factor loading matrix in
Table 6. Factor loading indicates how well a factor is able to explain a feature. A low factor loading indicates the feature does not belong to the factor. A high factor loading indicates the item belongs to the factor. A factor load of 0.30 was used as a cutoff for the pairing of features to factors. Not all of the 30 features are rated high enough to be included in a factor. The features included in the five factors are:
Factor-1: “Word Length2”, “LIX Index2”, “RIX Index2” and “CL Score2”
Factor-2: “Begins w/5W1H”, “Possessive Pronouns”, “Uppercase2”, “Stop Words2” and “F Score2”
Factor-3: “Begins w/Det Sup”, “Determiner/Superlative”, “Demonstrative” and “Includes “this””
Factor-4: “Begins w/Number” and “Includes Number”
Factor-5: “Contraction”, “Mult Quotes”, “Word Count2” and “RIX Index2”
The new factor values were computed by summing the product of the feature values by its corresponding factor load for each instance (headline) in the dataset. Each new factor was then normalized across the dataset.
The factor-based headline dataset was randomly split 80:20 into training and test sets. The six individual classification models were trained and tested using the five factors.
Table 7 shows the validation performance of each of the models. The accuracy of the factor models was no better than the results from the feature models, with an accuracy of 0.98.
Next, the ensemble model was run 10× with random 80:20 training/test sets using the five factors. The average accuracy from the 10 runs was 0.975, with a standard deviation of 0.002. The average standard deviation was very low, indicating a very tight grouping around the average. The average accuracy for the combined models again did not improve over the accuracy of 0.976 for the individual top models.