1. Introduction
The term ‘offensive’ refers to behavior that irritates, angers, or upsets a person or group. The rise of social media has led to a surge in offensive and hateful content on the internet. Because of the user’s ability to remain anonymous, they believe they may express themselves freely and without constraints. All of these unpleasant comments deteriorate the mental health of the targeted individual or group. One way to avoid this from happening is to remove offensive comments. A manual approach to finding and removing the offending comment can take considerable effort. Therefore, automated systems that use machine learning and natural language processing tools are being applied in research and technology around the world to detect and reduce the use of offensive content on social media.
Several researchers have studied the issue of identifying offensive content, but as the amount of multilingual content has increased, it has become more challenging. The majority of research has focused on high-resource languages such as English, and only recently have low-resource languages such as different Indic languages been given more attention [
1,
2,
3]. Dravidian languages, which are mainly spoken in south India and northeast Sri Lanka, include Tamil, Malayalam, Telugu, and Kannada. These languages also have a large number of offensive comments on social media, and there is a huge demand for automated systems that can classify into offensive and non-offensive regional YouTube comments. The limitations of this low-resource language include a small corpus and the lack of standard/benchmarked annotated data.
In this paper, the focus is on identifying YouTube comments written in Tamil script as either offensive or non-offensive. The dataset from the HASOC’21 shared task [
4] that included YouTube comments in Tamil is used in this work. Every comment is classified as either offensive or non-offensive in this annotated corpus. The dataset included a total of 5877 YouTube comments, each labeled as offensive or non-offensive. The distribution of data among classes is substantially skewed, particularly for the offensive class. The number of offensive comments is considerably lower than that of non-offensive comments. Out of the 5877 comments, 4724 are not offensive and 1153 are offensive. The ratio of offensive comments to non-offensive comments is 1:4. As a result, the text classifiers become biased toward the non-offensive class which is having a large number of samples, and offensive class samples are frequently misclassified.
To address this class imbalance, a Multilingual Translation based Data augmentation technique for Offensive content identification in Tamil text data (MTDOT) is proposed in this work. In data augmentation, new data points are generated artificially from existing data points. Data augmentation is useful for reducing the cost of collecting and labeling data as well as improving model prediction accuracy. The class imbalance problem is addressed through data augmentation by generating artificial samples of a rare class in the dataset. Data augmentation approaches are divided into two categories: linguistic and non-linguistic [
5]. The meaning is preserved after data augmentation in the linguistic category. In the linguistic category, the word or sentence is replaced or an entirely new statement is generated. Back translation is one such linguistic data augmentation techniques in which text is translated into another language and then back into the original language. In this work, offensive class comments are generated using the back-translation data augmentation technique.
Extensive experiments are conducted by applying the widely used SMOTE data level method [
6], single-level back-translation augmentation method and our proposed MTDOT method for balancing the HASOC’21 Tamil dataset. Then, the text embedding vectors are generated for the balanced dataset using the MuRIL pre-trained model embedding layer. Six different classifiers namely Support Vector Machine, Naive Bayes, K-Nearest Neighbor (K-NN), Decision Tree, Random Forest and Majority Voting are trained using the text embedding vectors. The experimental findings demonstrate that the balanced dataset achieved precision, recall, and
-score of 0.82, 0.80, and 0.81, respectively, using the MTDOT class balancing approach.
The key contributions of this paper are:
We proposed the data-level class balancing technique for addressing class imbalance in offensive content datasets. To the best of our knowledge, we are the first to use data augmentation for handling unequal class distribution in a non-English offensive content dataset i.e., Tamil dataset.
In order to achieve a completely balanced dataset, new offensive comments are generated through single-level back translation and multi-level back translation using Malayalam and English as an intermediate language.
Extensive experiments are conducted to demonstrate the effectiveness of proposed method using various classifiers and existing oversampling and undersampling methods.
The rest of this paper is structured as follows.
Section 2 discusses existing work in offensive content detection and class balancing methods.
Section 3 contains a description of the dataset used in our study.
Section 4 describes the methodology, and in
Section 5, details of the experiments conducted are provided. The results and discussion are in
Section 6. Finally
Section 7 concludes the paper.
4. Methodology
In this work, a class balancing method is proposed for balancing the HASOC’21 [
4] dataset in Tamil for offensive content identification. The architecture of proposed system is shown in
Figure 2.
The proposed system (
Figure 2) for the detection of offensive content from YouTube comments in Tamil contains the following steps:
The proposed Multilingual Translation based Data augmentation technique (MTDOT) is applied on the imbalanced Tamil text data, and the balanced data are obtained in the first step.
In the second step, MuRIL (Multilingual Representation for Indian Languages) is used as an embedding layer to obtain the representation of YouTube comments written in Tamil script.
The embedding vectors generated from MuRIL layers are then provided to different classifiers.
The detailed description of each step in the proposed system is provided in the following subsections.
4.1. Balancing the Dataset
In Data Augmentation (DA) algorithms, synthetic data are constructed from the samples of the dataset. One application of data augmentation is fixing imbalance by augmenting minority class samples. In the back-translation method, text is translated from one language to another and then back to the original language. The process of data augmentation using back translation is shown in
Figure 3.
In this method, three different ways of back translation have been used. The comments in the dataset were in Tamil language and written in Tamil script. Because the imbalance ratio is 4, the offensive comments must be augmented three times to balance the dataset completely. For each offensive comment, we generated three augmented comments using the following methods:
Tamil comment is translated to English and then back to Tamil (TET).
Tamil comment is translated to Malayalam; then, the Malayalam comment is translated to English, and it is then back-translated to Tamil (TMET). Here, Malayalam and English are used as intermediate languages.
Tamil comment is translated to Malayalam; then, the Malayalam comment is translated to English, and finally to Tamil (TMEMT). Here, Malayalam is used as an intermediate language twice.
The first balanced dataset (TET+TMET+TMEMT) contains all comments from the original dataset and augmented comments generated by the above methods.
The second balanced dataset (TMT+TKT+TTeT) contains the original dataset and the augmented offensive comments. Each offensive comment is augmented three times using the following methods:
Tamil comment is translated to Malayalam and then back to Tamil (TMT).
Tamil comment is translated to Kannada and then back to Tamil (TKT).
Tamil comment is translated to Telugu and then back to Tamil (TTeT).
4.2. Word Embeddings
Word embedding is a term used for the representation of words for text analysis, which is typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning.
In text embedding, tokens are represented as vectors which are a numerical representation of the word´s semantic meaning. These vectors can be utilised to train machine learning models for NLP-related tasks. TF-IDF is one of these text-embedding techniques. Classic embedding models have one disadvantage, though. They undergo polysemy disambiguation, in which a token or word with multiple meanings is represented by the same vector.
MuRIL [
23], or Multilingual Representations for Indian Languages, is a BERT model that has been pre-trained by Google’s Indian Research Unit. It is a multilingual language model trained exclusively on Indian text corpora. The corpora used by the authors is also subject to augmentation by translation and transliteration. The model has been trained on 17 languages, including English and 16 different Indian languages. It is trained in two phases: masked language modeling and translation language modeling. By evaluating the model on Indian language tasks and comparing it to the mBERT model, the authors conclude that MuRIL outperforms mBERT for all objectives. In this work, the MuRIL model is used as an embedding layer. After tokenization, the tokens are provided to the MuRIL pre-trained model to generate embedding vectors.
The first dimension of output represents the number of layers (12 layers + embedding layer), the second represents the number of tokens, and the third is the hidden size. The sentence embedding can be extracted by averaging over the layers and tokens (usually, the last four layers are considered, but one can take the average over all the layers as well). The contextual word embeddings can be extracted by the sum over the corresponding token outputs (as input, tokens here are subword units and not the words) with an average over the layers.
Although some comments in the offensive content dataset contain multiple sentences, the average sentence length is one. Each comment is labeled as ‘OFF’ (Offensive) or ‘NOT’ (Non-offensive). The text in the comment (single or multi-sentence) is tokenized and provided to the MuRIL embedding layer, which generates embedding vectors for the input tokens. These embedding vectors and their labels are fed into classifiers as training data.
4.3. Classifiers
Various classifiers that were trained on the dataset are used to study the effect of balancing the dataset. The embedding vectors generated from the pre-trained model are provided as input to these classifiers. Classifiers such as Support Vector Machine, Naive Bayes, K-NN, Decision Tree, Random Forest and Majority Voting are used for the experiments.
Support Vector Machine (SVM) is a supervised learning model used for classification, regression, and outlier detection. The objective of the SVM algorithm is to generate the optimal line or decision boundary that divides n-dimensional space into classes, so that subsequent data points can be classified easily. This optimal decision boundary is referred to as a hyperplane.
Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem. It is a highly scalable classifier used in various classification application domains. The fundamental idea behind Naive Bayes is that each feature makes an equal and independent contribution to the outcome.
K-Nearest Neighbor (K-NN) is a supervised machine learning algorithm. The K-NN algorithm assumes similarity between the new case/data and existing cases and places the new case in the category that is most similar to the existing categories.
A Decision Tree is a simple yet powerful and useful tool for data prediction and classification. The primary purpose of this model is to forecast an instance’s values by learning decision rules based on data properties. An instance begins at the root node and descends to the next equivalent node based on the categorization produced by the test property. This instance advances down the tree branch and repeats the operation with the next sub-tree.
In ensemble learning, several models are combined to achieve better predictive performance compared to a single model. Ensemble learning is frequently used to enhance a model’s performance (classification, prediction, function approximation, etc.) Bagging, stacking, and boosting are the three main categories of ensemble learning methods. In bagging, many Decision Trees are trained on different samples of the same dataset, and the predictions are averaged. In stacking, different models are trained on the same data, and another model is used to determine the best way to combine the predictions. In boosting, models are added sequentially to correct the prior model predictions, and the weighted average of the predictions is produced.
One such ensemble model is the Random Forest, which uses the bagging ensemble approach and decision trees as individual models. When working collectively, a large number of highly non-correlated models will outperform each of the component models separately. Here, the main element is the low correlation between models. Despite the fact that individual trees may make false predictions, the majority of them will be right; thus, the tree moves in the right direction as a group.
The Majority Voting Ensemble is yet another approach to ensemble learning in which the predictions from multiple other models are combined. When it comes to classification, the predictions for each label are added together, and the label with the most votes is predicted. It is a meta model i.e., a model of models with a collection of existing machine learning models. The models used are: Naive Bayes, K-NN, Decision Tree and Random Forest.
The above-mentioned models have been abbreviated for ease of representation, as shown in
Table 2.
4.4. Evaluation Metrics
To assess the effectiveness of the downstream model, performance metrics such as accuracy, precision, recall, and
-score were used. The ratio of the number of correct predictions over the total number of predictions is referred to as accuracy and is given in Equation (
1). Precision (Equation (
2)), recall (Equation (
3)), and
-score were used as performance metrics because accuracy alone is insufficient to measure true performance.
-score is a harmonic mean of precision and recall. The formula for
-score is given in Equation (
4).
Since the dataset is imbalanced with more non-offensive comments than offensive ones, accuracy is not a suitable evaluation metric. The offensive class, which has fewer samples in the dataset we used for this study, is more important, but its misclassification has little effect on accuracy. So, the weighted average for precision, recall and -score has been used. Here, the precision, recall, and -scores are calculated for each label, and then, the average is weighted by support.
In addition to the above listed metrics, the Bilingual Evaluation Understudy Score (BLEU score) proposed by Kishore Papineni [
24] has also been used. It is a metric used to compare a generated sentence to a reference sentence. A score of 1.0 indicates a perfect match, whereas a score of 0.0 indicates a perfect mismatch. Although it was designed for translation, it can also be used to assess text output for a variety of natural language processing tasks. The BLEU score can also be used to various language generating issues using deep learning techniques, such as text summarization, language generation, image caption generation, and speech recognition. The BLEU score is used to evaluate how the generated augmented comments varies from the original one. An implementation of the BLEU score is provided by the Python Natural Language Toolkit library or NLTK.
5. Experiments
In this study, we have used the HASOC’21 dataset [
4] for offensive content identification. Since the data are written in native/Tamil script, the task of identifying offensive content is challenging. As a result, techniques applicable to commonly used English-language NLP models will not be applicable to this type of data.
Various experiments are conducted with the following objectives:
To study the classifiers’ performance for the original imbalanced Tamil text data.
To study the effect of balancing the data using the NearMiss undersampling technique.
To study the effect of balancing the data using the SMOTE oversampling technique.
To study the effect of balancing the data using the single-level back-translation augmentation method for balancing.
To study the impact of the proposed MTDOT class balancing method.
5.1. Experimental Setup
The dataset is preprocessed by removing special characters such as [, +, /, #, @, &, etc. After preprocessing, the dataset is split into a training and validation set with a validation set size of 20% and a random state of 42. The tokenizer provided for the MuRIL model is used to tokenize the dataset’s texts. The model is imported using the Python package for the Hugging Face Transformer. The maximum token size is 512 characters. The pre-trained MuRIL model embedding layer receives the tokenized text as input and returns embedding vectors. After max pooling, the model generates a final vector with a length of 768. The different downstream classifiers, such as Support Vector Machine, Naive Bayes, K-Nearest Neighbor, Decision Tree, Random Forest, and Majority Voting, are then trained using the vectors. The Translators python library is used for translating the Tamil text into different languages.
5.2. Baseline Experiment with Original Dataset
In the first experiment, the classifiers were trained on the original offensive content identification dataset. The steps followed in training are described in the experimental setup section. The precision, recall and
-score for all the classifiers for the non-offensive class and offensive class are shown in
Figure 4 and
Figure 5, respectively. For the non-offensive class, the highest precision of 0.93 (Naive Bayes), recall of 1.00 (Support Vector Machine) and
-score of 0.90 (Random Forest and Majority Voting) are achieved. Except for Naive Bayes (NB), all classifiers achieved almost the same precision for the non-offensive class.
For the offensive class, which is of more interest, the highest precision of 0.66 (Random Forest), recall of 0.75 (Naive Bayes) and -score of 0.53 (Naive Bayes) is achieved. SVM does not perform well in the classification of offensive class comments. Due to the presence of imbalanced data, separating the hyperplane produced by an SVM is biased toward the minority class. The ratio of positive (offensive) to negative (non-offensive) support vectors becomes more imbalanced as the data imbalance increases; thus, samples at the boundary of the hyperplane are therefore more likely to be labeled as non-offensive. Naive Bayes performs comparatively well for both the offensive and non-offensive class as compared to other classifiers.
5.3. Near Miss Undersampling Method for Addressing Class Imbalance
The classifiers perform poorly for the offensive class in comparison with the non-offensive class, which is evident from
Figure 4 and
Figure 5. This degradation in performance is due to the class imbalance in the dataset. In this experiment, the Near Miss undersampling method [
19] is used to obtain a balanced dataset. Near Miss is a group of undersampling techniques that pick examples based on the distance between the majority and minority class examples. The technique comes in three versions: NearMiss-1, NearMiss-2, and NearMiss-3. In this experiment, NearMiss-3 is used, where the majority of the class examples that are closest to each minority class example is selected.
The classifier results for the offensive class are shown in
Figure 6. Except for Naive Bayes, the offensive class
-score increased after applying the Near Miss undersampling algorithm. For Random Forest and Majority Voting, the highest
-score of 0.48 is achieved. For the balanced dataset using Near Miss, the classifiers’
-scores range from 0.36 to 0.48 points.
5.4. SMOTE Based Method for Addressing Class Imbalance
In this experiment, the dataset is balanced using the SMOTE [
6] oversampling method. Synthetic Minority Oversampling TEchnique (SMOTE) [
6] is a widely used oversampling approach in which new synthetic examples are generated for the minority class (offensive class). In this technique, first, a random minority class example is selected. Next, K-Nearest Neighbors for that example are located (k is normally equal to 5). A synthetic example is created at a randomly chosen point in feature space between two examples and its randomly chosen neighbor. As many synthetic examples of the minority class as needed can be created using this approach.
The classifier results for the offensive class are shown below in
Figure 7. The
-score for the offensive class has increased after balancing, with the exception of Naive Bayes. Because some minority samples are repeated in SMOTE, so no new information is gained, Naive Bayes does not perform well in this situation. Support Vector Machines achieve the highest
-score of 0.53. For the Random Forest ensemble classifier, the maximum precision of 0.52 is achieved. For the balanced dataset using SMOTE, the classifiers’
-scores range from 0.39 to 0.53 points.
5.5. Single-Level Back-Translation Augmentation Method for Addressing Class Imbalance
The ratio of offensive comments to non-offensive comments in the offensive content identification dataset is 1:4. In this experiment, data augmentation is used to generate new offensive comments to obtain a balanced dataset. The offensive class samples are generated using the back-translation data augmentation method. Each offensive comment is augmented three times using TMT, TKT, and TTeT back-translation, with Malayalam, Kannada, and Telugu as an intermediate language, respectively.
Figure 3 illustrates the back-translation process. The number of comments per class before and after balancing using data augmentation is shown in
Figure 8.
The offensive class
-score achieved by all the classifiers is shown in
Figure 9. There is a significant improvement in performance compared to SMOTE. The
-score improvement is highest for ensemble classifiers, Random Forest and Majority Voting, rising from 0.49 to 0.82.
5.6. Multi-Level Back-Translation Augmentation Method for Addressing Class Imbalance
In this experiment, we studied how the quality of generated samples and the performance of classifiers varied when multiple languages were used as intermediate languages instead of just one. Based on the back-translation augmentation method, we have proposed the class balancing method ‘MTDOT’, where more than one intermediate language is used. Malayalam and the English language are used for augmenting the offensive comments in three ways. The first method involves translating Tamil comments into English and then back into Tamil. The Tamil comment is translated to Malayalam, then to English, and then back to Tamil in the second method. Finally, the Tamil comment is translated to Malayalam, then to English, then back to Malayalam, and finally to Tamil.
Figure 10 depicts the offensive class
-score obtained by all classifiers.
Compared to SMOTE, this method also achieved a significant improvement in offensive class -score. The highest improvement in -score is achieved for ensemble classifiers, Random Forest and Majority Voting, rising from 0.49 to 0.81.
6. Results and Discussion
To study the effect of balancing the offensive content data, experiments are conducted using four methods, namely: Near Miss, SMOTE, the single-level back-translation augmentation method and our proposed MTDOT class balancing method.
Table 3 displays the precision, recall, and
-score for each class as well as the weighted average for all classifiers for the original dataset and the class-balanced dataset. The precision (P), recall (R), and
-score (F1) for the HASOC’21 dataset are listed in the column named Original Dataset.
The results of the Near Miss undersampling algorithm are displayed in the Near Miss column in
Table 3. The performance of the majority of classifiers is improved after balancing the dataset using the Near Miss algorithm. In the third column, named SMOTE, results from the widely used SMOTE oversampling method are presented to demonstrate the efficacy of our balancing technique. SMOTE performed better than Near Miss. The dataset used in this work is of small size, containing 5880 samples. In undersampling, the majority of the class samples are removed, making the dataset size smaller. The less training data is the reason for Near Miss performing worse than SMOTE.
The results for the HASOC’21 dataset balanced using the back-translation data augmentation method with Malayalam (TMT), Kannada (TKT), and Telugu (TTeT) as intermediate languages are displayed in the column named TMT + TKT + TTeT. In the last column named TET + TMET + TMEMT, the results for the HASOC’21 dataset balanced using the multi-level back-translation data augmentation method (MTDOT) with Malayalam and English as intermediate languages are listed. Because accurately identifying offensive comments is of more importance, the
-scores for the offensive class are shown in bold. The highest
-score for the offensive class is displayed in blue. The graph of the comparative performance of all classifiers for the offensive class is shown in
Figure 11.
The
-score for the ‘OFF’ (offensive) class in the original dataset is extremely low, ranging from 0 to 0.31, which is a serious issue. The
-score of the offensive class for the balanced dataset has improved significantly for all classifiers, as seen in
Table 3. The highest
-score of 0.81 is achieved for the ‘OFF’(offensive) class for Random Forest and majority voting ensemble models. The proposed data-augmentation-based class balancing technique (MTDOT) outperforms the SMOTE oversampling method. For Support Vector Machines, the performance improvement is greatest when compared to SMOTE.
Data augmentation is implemented in two ways. For the first, three intermediate languages are used for back-translation: Tamil–Malayalam–Tamil (TMT), Tamil–Kannada–Tamil (TKT), and Tamil–Telugu–Tamil (TTeT). In multi-level back translation, English and Malayalam are used as intermediate languages at three levels: Tamil–English–Tamil (TET), Tamil–Malayalam–English–Tamil (TMET), and Tamil–Malayalam–English–Malayalam–Tamil (TMEMT). The performance improvement over the original unbalanced dataset is nearly identical for both methods. As a result, any method can be adopted for data augmentation. The BLEU score is computed to assess the quality of the generated statements as discussed in
Section 4.4. The BLEU scores for the balanced dataset are shown in
Table 4.
For the TMT + TKT + TTeT dataset, the BLEU score is 0.258, and for TET + TMET + TMEMT, it is 0.205. This signifies that multi-layer back translation augmentation generates more diversified data than single-level back translation augmentation.
The
-score for the offensive class, as well as the weighted average of the
-scores for the offensive class and the non-offensive class, increased significantly for the balanced dataset, as shown in
Table 3. Support Vector Machine shows the highest improvement in
-score of offensive class from 0.0 to 0.73, Naive Bayes from 0.53 to 0.75, K-NN from 0.33 to 0.80, Decision Tree from 0.37 to 0.71, Random Forest from 0.33 to 0.81, and Majority Voting from 0.31 to 0.81.