1. Introduction
Text classification, also known as text categorization, is a classical problem in Natural Language Processing (NLP), which aims to assign labels to textual units such as documents, sentences, paragraphs, and queries. It has a wide range of applications including sentiment analysis, news categorization, question answering, user intent classification, spam detection, and content moderation, to name a few [
1]. Popular NLP research areas where the user opinions are analyzed to detect sentiment polarity include opinion mining and sentiment analysis [
2]. Polarity determination has been performed for product reviews, forums, blogs, news articles, and micro-blogs. The field of sentiment analysis is greatly aided by a rich and large source of information from platforms such as Twitter. The tasks of Twitter sentiment analysis include sentiment polarity detection, Twitter opinion retrieval, tracking sentiments over time [
3], irony detection, and emotion detection [
4,
5,
6,
7,
8]. Due to the word limit of 280 characters, micro-blogs do not contain complete sentences. Moreover, micro-blogs often contain abbreviations and noisy texts. Therefore, it needs standard pre-processing techniques such Parts-of-Speech (POS) tagging, removing of URLs, hashtags, usernames, stopwords, stemming, and spelling correction to be applied to tweets due to the nature of messages posted by users. Twitter sentiment classification identifies different polarities (e.g., positive, negative, or neutral). Classification is based on textual features which can take different forms such as (i) syntactic (e.g., n-grams, term frequencies, dependency trees), (ii) semantic (e.g., opinion and sentiment words), and usually with the aid of lexicons, (iii) stylistic (e.g., emoticons) and (iv) Twitter-specific features (e.g., hashtags and retweets). The main challenges encountered with tweets are the length of the text (max. of 280 characters) and incorrect or improper use of language. The following machine learning approaches are popular with text classification: supervised [
9], semi-supervised [
10] and unsupervised [
11,
12]. Well-known sentiment lexicons such VADER (Valence Aware Dictionary for Sentiment Reasoning) [
13] were developed as an improvement over NLTK and Textblob tools. In [
14,
15,
16,
17], deep convolutional and recurrent neural networks were used in sentiment analysis.
In this paper, we present a soft computing technique-based algorithm (TSC) to classify sentiment polarities of tweets and news categories from text. The TSC algorithm is a novel supervised learning method based on tolerance near sets. Near sets theory [
18,
19] is a more recent soft computing methodology inspired by rough sets [
20] where instead of set approximation operators used by rough sets to induce tolerance classes, the tolerance classes are directly induced from the feature vectors using a tolerance level parameter ε and a distance function. The tolerance forms of rough sets have been shown to be more effective in text categorization applications [
21] where overlapping classes are induced by a
tolerance relation. The tolerance near set-based classification algorithm was first introduced in [
22]. Other applications of near sets in audio signal classification, music genre classification, and community detection in social network can be found in [
23]. A theoretical treatment of the relationship between near and rough sets can be found in [
24].
In this paper, we explore the effect of different vector-generation methods, tolerance class sizes, balanced and imbalanced datasets as well as a number of sentiment classes on the TSC algorithm. This paper is an extension of our previous work [
25]. The extensions include experimentation on three additional text datasets, new formal definition on text-based tolerance relation, comparative work using TF-IDF vectors, additional metrics besides weighted F1, and a statistical test to observe the difference in classifiers. We have also demonstrated that with transformer-based vectors, our proposed TSC outperforms five well-known machine learning algorithms on four datasets, and it is comparable with all other datasets based on the weighted F1, Precision and Recall scores. The highest AUC-ROC score was obtained in two datasets and comparable in six other datasets. The highest ROC-PRC score was obtained in one dataset and comparable in four other datasets. Additionally, significant differences were observed in most comparisons when examining the statistical difference between the weighted F1-score of TSC and other classifiers using a Wilcoxon signed-ranks test.
The proposed sentiment/text classification pipeline is given in
Figure 1 where feature vectors are generated using two pre-trained deep learning models BERT [
26] and SBERT [
27] shown in step 2. In step 3, a cosine distance matrix using the training set is created. This distance matrix is used to create tolerance classes in step 4. In step 5, a mean vector for each of the tolerance classes is created. This vector represents a
prototype class. In step 6, each of these prototype classes are then labeled using the majority class of their respective tolerance class members. In step 7, for each test example (in the testing set), the cosine distance is computed from every prototype class. In step 8, the test example that has the smallest distance value to the prototype class is selected. In step 9, the label of this prototype class is then assigned to the test example. In the final step, the predicted label is then checked with original label.
This paper is organized as follows. In
Section 2, we introduce formal definitions for text-based tolerance relation and an illustration of sample tolerance classes. In
Section 3, we present the datasets used in this work as well as the proposed supervised TCS algorithm. In
Section 4, we discuss our findings in terms of the weighted F1-score measure using both TF-IDF and transformer-based vectors on all the ten datasets followed by the concluding remarks in
Section 5.
3. Materials and Methods
We have created a subset of ten selected benchmark datasets which are a mix of long and short words (indicated by words per sentence) with a varying number and sizes of sentiment classes (positive, negative, neutral and irrelevant). Due to memory limitations, some large datasets were trimmed and only a subset was used in our experiments.
COVID-Sentiment is a manually labeled dataset which is a subset derived from [
28] using Tweets ID for 1 April 2020 and 1 May 2020. We extracted 47,386 tweets with the help of Twitter API. The tweets in languages other than English (ex: French, Hindi, Mandarin, and Portuguese) were removed. Extensive pre-processing of 29,981 English language tweets from the original dataset such as removal of HTML tags, @Username, Hashtags, URLs, and incorrect spellings were also performed. A total of 8003 hand-labeled tweets were prepared for experimentation. The Python regex module and NLTK stemming and lemmatization were used in pre-processing before generating vector embedding for this dataset.
The U.S. Airline Sentiment dataset consisted of 14,621 tweets and the pre-processed dataset of 13,000 tweets were used after the removal of duplicate and short tweets. For the
IMDB Movie Review, we used a subset of 22,000 reviews of the original dataset consisting of 50,000 reviews and for the
SST-2 dataset, the original dataset included 69,723 phrases and only 16,500 were used. For the
Sentiment140 dataset, a subset of 16,000 tweets from 1,600,000 were used. The
SemEval 2017 includes 62,671 tweets in the original dataset. We were able to use only 20,547 tweets in our experiments due to memory limitation. The
AG-News dataset contains 496,835 categorized news articles from more than 2000 news sources. Only the four largest classes from this corpus were selected to construct this dataset. The title and description fields were included in this dataset. These two columns were used as features for classification. To generate vector embeddings, these columns are combined into a single column. It contains four categories of news: “World”, “Sports”, “Business” and “Science”. We used 3000 samples from each category as our training set for our experiments. This dataset did not require any further pre-processing because it did not contain any grammar or spelling mistakes.
3.1. Materials
Table 1 gives details of the dataset. For each dataset, the total size of training and testing sets is given in column 3. In addition, columns 4, 5, 6, and 7 give the size of each sentiment class used for training and testing (except for the AG-news dataset). The last column shows the words per sentence (WPS) for each dataset. Only one dataset
Sanders corpus has four sentiment classes with imbalanced distribution. Datasets
UCI Sentence,
Sentiment140,
SST-2,
IMDB Movie Review have two sentiment classes with a fairly balanced distribution. Two datasets
COVID-Sentiment, U.S. Airline Sentiment, SemEval 2017 have three sentiment classes with imbalanced distribution. The
20 Newsgroups dataset is a common benchmark used for evaluating the performance of text classification algorithms. The dataset, introduced in [
29], contains approximately 20,000 newsgroup posts and this dataset is partitioned (nearly) evenly across 20 different newsgroups organized into broader categories: computers, recreation, religion, science, sale, and politics. Scikit-learn was used to prepare the training and testing datasets to remove noisy data. This dataset has 10,314 training samples and 782 testing samples with 209 words per document.
Table 2 gives the number of documents used in the training process and the size of the tolerance classes for each newsgroup category with the best value for
.
3.2. Methods
In this section, we present our proposed Tolerance Sentiment Classifier (TSC) in terms of the two Algorithms 1 and 2. The TSC algorithm was implemented using Python on a 16 GB RAM, Nvidia RTX 2060 GPU, 512 GB SSD machine using SBERT base vectors (1 × 768 dimensional vectors). We considered mean and median values for determining the prototype class vectors for the TSC algorithm. In addition, we experimented using TF-IDF vectors with the TSC algorithm for all the datasets. However, the classification results with TF-IDF vectors were unsatisfactory primarily because the cosine distance values started to converge to one and hence resulted in a number of the same values in the distance matrix shown in
Figure 2, which restricts the use of TF-IDF vectors.
Vector Embeddings with SBERT: Sentence-BERT (SBERT) is a modification of the pre-trained BERT network that uses siamese and triplet network structures to derive semantically meaningful sentence embeddings that can be compared using cosine-similarity. SBERT is fine-tuned on SNLI [
30] and the Multi-Genre NLI [
31] data, which creates sentence embeddings and significantly outperforms other state-of-the-art sentence embedding methods such as InferSent [
32] and and Universal Sentence Encoder [
33] in terms of accuracy.
Training Phase: Representative Class Generation Algorithm 1: In this phase, given a tolerance level , tolerance classes are induced from the training set vectors using the cosine distance, and the representative of each tolerance class is computed as the mean value of the feature vector. The polarity (or category) of the representative vector is determined based on majority voting.
Testing Phase: Polarity Assignment Algorithm 2: In the classification phase, TSC uses the representative class vectors generated in the training phase and their associated polarity/text category. The computeCosineDist function calculates the cosine distance between each test set vector and all the representative class vectors. The DeterminePolarity function chooses the representative class that is closest to the test set vector and assigns the polarity of the representative to the test set vector. In the training phase, the complexity of computeCosineDist function is . The complexity of generatetolerantpairs function is . In the testing phase, the complexity of DeterminePolarity function is .
In the
testing phase, the cosine distance is computed for each vector in the testing set by comparing with all representative vectors obtained in the training phase. The test set vector with the lowest cosine distance value is assigned the polarity (or category) of the representative.
Algorithm 1: Training Phase: Generating class representative vectors |
|
Algorithm 2: Testing Phase: Assigning Sentiment Polarities |
|
4. Results and Discussion
In this section, we discuss the performance of the TSC algorithm.
Figure 3 gives the weighted F1-score for all datasets for various tolerance values using the mean value (TSC-mean) for the prototype class vector. The range of tolerance values is from 0.08 to 0.38. The TSC algorithm performs best with the UCI sentence dataset and has the worst performance with the COVID-Sentiment dataset. With the 20-Newsgroups dataset, our proposed algorithm shows a steep improvement with the tolerance value between 0.12 and 0.19.
Figure 4 gives the median value (TSC-median) for the prototype class vectors. Note that even though the relative performance is similar with all datasets, the most noteworthy difference is with the
U.S. Airline and IMDB datasets. Since the overall results with the mean value are slightly better in terms of the weighted F1-score for all the datasets, we used the mean value for the TSC algorithm (TSC-mean) in all our subsequent experiments.
Table 3 shows the number of
tolerance classes for the best tolerance value (column 2) for each dataset. The TSC algorithm generates these classes as described in Algorithm 1. It should be noted that the SST-2 and UCI sentence datasets have an approximately similar number of tolerance classes and have two sentiment classes. It should be noted that
values range from 0.16 to 0.32. Other datasets with three and four sentiment classes do not generate balanced tolerance classes.
Random Forest (RF) [
34], Maximum Entropy (ME), Support Vector Machine (SVM) [
35], Stochastic Gradient Decent (SGD) [
36] and Light Gradient Boosting Machine (LGBM) [
37] classifier implementations in Scikit-learn (
https://scikit-learn.org/stable/) (accessed on 1 May 2022) with the following parameters were used. For the RF classifier, 100 trees and gini index were used to determine the quality of split. The minimum and maximum samples were set to 2 and 1, respectively, to split an internal node of the tree, and the maximum number of features was set to 27 (square root of the size of the vector). Bootstrap samples were used when building trees and the random_state parameter was set to 42. For the ME (logistic regression) classifier, the l2 penalty term with the stopping criteria set to
was used. The RBF kernel was used for the SVM classifier with a kernel cache size = 200 MB, gamma set to scale, C value set to 1 and l2 penalty term. The hinge loss function was used in the SGD classifier with default values for other parameters (penalty l2, alpha set to 0.0001, maximum iterations set to 1000 and learning rate set to optimal). Since the loss is hinge, this classifier is a linear SVM. For the LGBM classifier, the max_leaf_nodes was set to 31, and the learning_rate, n_estimators, min_child_weight, and min_child_samples were set to 0.1, 100,
, and 20, respectively.
Table 4 shows the AG-News dataset with a well-balanced tolerance class distribution for the four categories (TC-World, TC-Sports, TC-Business and TC-Science) for
. It should be noted all algorithms perform (third best) on this dataset. The 20-Newsgroups dataset has the highest number of categories among all other datasets. It has twenty sentiment classes and a fairly balanced tolerance class distribution as shown in
Table 2. Due to better semantic similarity of its vectors, the performance of TSC on this dataset is better than the COVID-sentiment and SemEval 2017 datasets.
Table 5 gives the weighted F1-score for all datasets with the TF-IDF-based ML algorithms. The size of TF-IDF vectors depends on the vocabulary of that dataset, which means having longer sentences results in better vocabulary for the TF-IDF approach, which is an advantage over transformer vectors. While building vocabulary with TF-IDF, the frequency of the words was considered to compute the vectors. We considered default parameters of TF-IDF to build the vocabulary. The minimum and maximum document frequency was set to the default value of 1.0. The results show that TF-IDF-based ML algorithms give the best results in longer sentences or document-level classification tasks such as IMDB and 20-newsgroups datasets.
Table 6 gives the experimental results with the transformer-based vectors. Here, our proposed TSC algorithm performs best in the U.S. Airline, IMDB, UCI SST-2 and 20-Newsgroup datasets and is comparable with the COVID-Sentiment, SST-2 and Sentiment 140 datasets. It can be observed that balanced tolerance classes are an indication of good semantic similarity between vectors generated by the transformer model. This can be seen with SST-2 and UCI sentence datasets that have approximately a similar number of tolerance classes with a weighted F1-score of over 85%. Another observation is that these two datasets contain only two sentiment classes. On the other hand, the TSC algorithm performs very well with the 20-Newsgroups datasets.
Table 7 and
Table 8 give the weighted precision and recall scores for all the tested classifiers. Based on this score, the proposed TSC algorithm performs best in the UCI, SST-2, IMDB and 20-Newsgroups datasets. These results mirror the values obtained with the weighted F1-score except with the U.S. Airline dataset.
Table 9 gives the AUC-ROC score for all the tested classifiers. A pair-wise comparison of all combinations (ovo) with the weighted average score parameter was used for obtaining the results. Based on the AUC-ROC score, the proposed TSC algorithm performs best in the UCI and SST-2 datasets that have an approximately similar number of tolerance classes and two classes. It is also noteworthy that the UCI and AGnews datasets have the best separability (90%), and IMDB (88%), 20-Newsgroups (87%) and SST-2 (85%) have scores above 84% based on the tested classifiers. Since SGD is a linear SVM, for most datasets, the weighted F1, Precision and Recall scores are either similar or identical.
Table 10, gives the AUC-PRC score for all the tested classifiers. The loss function for the SGD classifier was changed from
hinge (default) to
modified_huber to enable probabilistic outputs (
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html accessed on 20 July 2022). Since all other metrics for SVD were based on the hinge loss function, the results for this classifier (indicated in blue) could be omitted in the overall analysis. Based on the AUC-PRC score, the proposed TSC algorithm performs best with the IMDB dataset and is comparable with the UCI, SST-2, Sentiment140 and COVID-Sentiment datasets.
In terms of the three reported metrics (weighted F1, AUC-ROC, AUC-PRC) from
Table 6,
Table 9 and
Table 10 respectively, overall, the proposed TSC algorithm performs best with mostly balanced datasets having two sentiment classes (binary classification) with the U.S. Airline dataset being the exception. TSC performs poorly with two highly imbalanced datasets having more than sentiment classes (i.e., Sanders corpus and SemEval 2017). The weighted F1 score is computed only on predicted classes, whereas the AUC scores reflect the performance of a classifier over a range of values (prediction score). In comparison with other classifiers, the proposed TSC algorithm does better in more datasets with the AUC-ROC score. However, if we examine the overall AUC scores ≥ 80, TSC gives better performance with five datasets using the AUC-PRC score as compared to four datasets with the AUC-ROC score. Another point to note is that the size of tolerance classes for negative sentiment is larger the size of tolerance classes for the positive sentiment in two of these datasets (i.e., IMDB and US Airline). This result leads us to conclude that perhaps balanced tolerance classes may not be a significant factor and may depend on the vector generation method.
Table 11 gives the results of the Wilcoxon signed-ranks test based on the weighted F1-score of the classifiers using a two-sided test with
to test the null hypothesis that there is no difference between the TSC-mean and other classifiers. The results are a pair-wise test on all datasets. Based on these results, the null hypothesis can be rejected (i.e., there is a difference between between the classifiers based on weighted F1-score).