1. Introduction
Currently, access to online social media is associated with free speech and the exchange of ideas, thoughts, and opinions. However, this freedom and anonymity come with a downside. This virtual world can unveil various forms of undesired behavior, including hate speech, harassment, and derogatory comments. These messages can negatively impact not only the individuals who are targeted, but also society as a whole. Sureka and Agarwal [
1] showed a prominent correlation between extremist messages on social networks and the spread of harmful ideologies. Furthermore, due to the anonymity offered by the Internet, users find it easier to engage in abusive behavior without facing immediate consequences. This shift towards online communication and the proliferation of offensive language is a cause for concern, requiring the development of automated mechanisms to identify and address such content.
Following the global outbreak of the COVID-19 pandemic in early 2020, there was a notable surge in online content consumption (
https://www.iab.com/research/global-consumer-insights-four-fundamental-shifts-in-media-advertising-during-2020/, accessed 30 October 2023). These levels of activity led to additional radicalization of the online discourse, characterized by the proliferation of extremist content, hate speech, and toxic behavior on digital platforms and have become a focal point of concern for various stakeholders. An illustrative instance of this alarming trend was the observed positive correlation between hate speech on Twitter and crime levels in London [
2]. The need to address this issue is underscored by the growing influence and impact of dangerous ideologies that find fertile ground in the virtual realm and affect society on a broader level [
3,
4].
Automated systems for detecting offensive language have gained significant attention as they offer a promising solution to address the challenges of moderating the ever-increasing volume of content on online platforms. Traditional content moderation for these platforms, including social media, discussion forums, and chat applications, is struggling to keep up with the sheer quantity of material, underpinning the need for automated solutions. However, despite significant advancements in the field [
5,
6,
7], challenges persist, particularly for low- or less-resource languages. Much of the research and development focus has centered on widely spoken languages like English, German, French, and Spanish, leaving a resource gap for less-commonly spoken languages. Bridging this disparity is essential to ensure that online content in diverse languages can be effectively moderated, as offensive content knows no linguistic boundaries.
In the context of Romanian social media, Meza et al. [
8] highlighted that the most-prevalent target groups for offensive messages are welfare recipients or poor people, followed by the Roma and Hungarian communities. This finding supported the 2018 study [
9] by the Elie Wiesel National Institute for Studying the Holocaust in Romania. They analyzed Facebook posts and underlined that hate speech is not limited to one specific minority group. Still, the same users target several groups based on race, nationality, religion, or sexual orientation. This behavior was aggravated by the COVID-19 context in the years 2020–2022. In the 2020 report [
10], ActiveWatch pointed out that, apart from the Roma population, the Romanian Diaspora became a target of online hate speech. A recent paper by Manolescu and Çöltekin [
11] introduced a Romanian offensive language corpus. This corpus provides a detailed categorization of offensive language into targeted and untargeted offenses. Höfels et al. [
12] presented CoRoSeOf, a large social media corpus of the Romanian language annotated for sexist and offensive language. Their work encompassed a detailed account of the annotation process and the preliminary analyses, as well as a baseline classification for sexism detection using SVM-based models.
In this article, our primary focus was to enhance the toolkit available for offensive language detection in the Romanian language. Our primary objective in this research was to address the gap between the Romanian language, which currently lacks human-annotated datasets, particularly in the domains of hate speech and offensive language detection, and other well-resourced languages like English, German, Italian, and Chinese. In addition to this main objective, we pursued three secondary research objectives. First, we aimed to investigate the sequence tagging capabilities of state-of-the-art models applied to the newly developed dataset. We considered an appropriate metric in which precision and recall were redefined for comparing text spans while accounting for partial overlaps and penalizing overly broad selections. Second, we targeted uncovering ways to harness the capabilities of Large Language Models (LLMs) for detecting toxic language sequences. To our knowledge, these specific capabilities have not been thoroughly investigated thus far.
As such, our main contributions were as follows:
6. Discussion
Upon analyzing the results, an interesting observation was that, for the largest model (i.e., GPT-4), the performance of the LLM resembled that of narrower in scope BERT-based models. Surprisingly, when evaluating at the word level, there was a more-significant performance gap between these two types of models. This suggested that GPT-based models face challenges when predicting shorter sequences. Indeed, for the GPT-4 model, when considering records with shorter sequences containing less than 15 characters, we observed a drop in character-based recall from 75.81% to 52.59%, resulting in an F1-score of 61.56%.
Compared to the reported performance on other datasets, there was a broad range of values. For instance, the KOLD dataset [
37] reported precision and recall rates of 50.8 and 47.8, respectively. In contrast, the best model in the “SemEval-2021 Task 5” toxic-span-detection competition, Ref. [
61] achieved a precision and recall of 75.01 and 89.66. Thus, our results are within the current range of results reported for offensive span detection.
Table 7 provides examples of misclassifications for reference. In contrast, the best-performing model, RoBERT+CRF, experienced a drop in recall to only 70.82% for the same shorter sequences.
We employed Layer Integrated Gradients, a form of Integrated Gradients described by Sundararajan et al. [
67], to examine the impact of specific text features on the sequence-detection process. This approach was applied to the RoBERT-based model. To execute this analysis, the “Transformers-Interpret” package (
https://github.com/cdpierse/Transformers-interpret, accessed 30 November 2023) in Python was employed, providing the necessary tools for in-depth model interpretation and feature impact assessment. We illustrate this by conducting a detailed analysis of the phrase “Nu mamai cert cu toti
analfabetii” (“I don’t argue with all the
illiterates anymore”) and its individual textual components. We noticed that the original text had a misspelling, “nu ma mai” written as “nu mamai”. Interestingly, this erroneous concatenation closely resembles the Romanian word “mama”, meaning “mother”. Many offensive expressions are directed towards the target’s mother in the Romanian language. Therefore, this spelling mistake inadvertently introduced ambiguity and raised red flags for the model, as it may have been associated with common offensive expressions. This observation is highlighted in
Figure 6, where the tokens comprising this misspelled word contribute negatively to its attribution score, lowering the confidence in its prediction. This illustrates how the model assigns significance to each token in the context of the overall assessment, underscoring the influence of each text feature on the final decision for each token.
In contrast, as depicted in
Figure 7, when the misspelling was corrected, its influence shifted positively, and the model’s confidence in the interpretation or classification of the text significantly increased. The correct classification by the model, despite the presence of a misspelling, underscores the robustness of the pretrained models. Such resilience in handling variations and inaccuracies in input data highlights the model’s effectiveness in real-world applications.
In our evaluation, we encountered an issue with responses from te GPT-based models where we received malformed JSON answers, specifically lacking a trailing “}”. This issue was only observed in the GPT-3.5 models and occurred in approximately 0.5% of the responses. Fixing these types of errors requires additional attention and programming. Since generative models do not guarantee that the input is always present unchanged in its output, we had to verify that the model did not alter the text being annotated. This verification process also demands extra effort and is a potential error source. We identified some common text modifications that appeared in both GPT-3.5 and GPT-4: a tendency to add or remove spaces from the original text, spontaneous spellchecking of some words, and annotation sequences that were not properly closed with the required “]]”.
The results in
Table 5 and
Table 6 show a high performance jump between GPT-3.5 and GPT-4. For instance, in 25% of the positive test records, GPT-3.5 did not provide any overlap with the golden record. It either did not detect any offensive sequence or focused on some non-offensive terms such as team names (e.g., “STEAUA CAMPIOANA”, “DINAMO”, “stelei, STEAUA”) or repeated words that are not offensive (e.g., “dinamo, rapid, Steaua, blat, Blat, BLAT, BLAT”, “bistritza, BISTRITZA”). For comparison, BERT+CRF had only 34 non-overlapping predictions and GPT-4 only 30.
In order to offer a more-comprehensive understanding of our model’s performance, we conducted a manual analysis of misclassifications by our best-performing model, BERT + CRF, in parallel with GPT-4. We categorized these misclassifications into three types: Partial Disagreements (PDs), False Positives (FPs), and False Negatives (FNs). In the PD category, we considered all predictions that overlapped with the ground truth, but did not match the annotated spans exactly. We observed a higher number of PDs for our BERT-based model, specifically 311 out of 638 predicted sequences. Some of these discrepancies were due to the model’s tendency to add the trailing punctuation tokens to the detected sequences, while most can be attributed to the trade-off between achieving higher recall and lower precision. In contrast, we found only 187 PDs out of 571 predicted sequences for GPT-4. For the FP category, we analyzed records that contained predicted spans, but had no offensive spans in the ground truth annotations. Both types of models had similar numbers of FPs, with 96 for the BERT-based model and 94 for GPT-4. Finally, FNs referred to records with offensive spans undetected by the models. Here, again, we observed a notable difference between the two models. The BERT-based model had 121 FNs out of 718 true positives, while GPT-4 had 187 FNs. This indicated that almost 26% of the annotated sequences were completely missed by GPT-4 and 17% by the BERT+CRF model. Some examples of these types of errors are presented in
Table 8.
The error analysis highlighted areas with room for improvement in the detection models. Of particular interest are offensive sequences that manage to evade both types of models. By directing the research community’s attention toward these specific examples, we can work on enhancing the detection of offensive language. Additionally, we improved the explainability and interpretability of the classification results by providing an annotated sequence corpus, enabling us to narrow our focus on problematic phrases and expressions.
Limitations
There are certain limitations associated with our dataset. One such limitation is that the dataset’s domain could exhibit bias towards sports, with mentions of athletes, teams, and coaches appearing in some offensive texts. There are several effective ways to address the issue of biases related to domain-specific datasets. First, we provided explicit and comprehensive guidelines to help identify offensive language regardless of the domain. These guidelines should cover different forms of offensive language, such as insults, hate speech, profanity, and more, ensuring that annotators can label various offensive content. Second, for future work, diversifying the data sources would help reduce any existing biases. Collecting data from diverse sources would mean gathering text from a wide range of domains and contexts. Third, using data augmentation techniques, researchers could introduce more diversity in the dataset, helping their models generalize better.
Also, we are aware of another limitation in the annotation process, which involved only one dominant annotator with whom we had several iterations before defining the final guidelines. To mitigate this potential personal bias, we validated the annotations with a third-party annotator, who reviewed a random selection of 100 comments. Nevertheless, we recognize the opportunity for further improvement by considering multiple annotators and offering a more-balanced perspective. However, our dataset is one of the largest up-to-date corpora annotated at this granular span-based level.
7. Conclusions and Future Work
In this comprehensive study, we introduced the RO-Offense-Sequences dataset for offensive sequence detection in the Romanian language. The dataset, comprising 4800 comments from a popular Romanian sports website, covers a wide spectrum of offensive language, including mild insults to complex and offensive text sequences related to racism, homophobia, profanity, abuse, and threats. In this dataset, we manually labeled the character sequences that grant an offensive or abusive character to the comments.
Additionally, we conducted a thorough analysis of different state-of-the-art Natural Language Processing models, from narrow-focused language models (i.e., BERT) to generic, Large Language Models. Our experiments argued that BERT-based models, especially those pre-trained on the target language (Romanian), outperformed other models, thus improving offensive language detection. Adding a CRF layer to these models provided more-robust and -reliable results. In contrast, LLMs such as ChatGPT and GPT-4 can perform the specified task with only a few, three to five, examples. By providing a well-crafted prompt, these models can reach performance comparable to narrower models, trained on thousands of examples.
We also highlighted some specific areas of improvement in the models. For instance, GPT-based models showed difficulties with shorter sequences, indicating the need for further refinements. Additionally, we observed some challenges related to malformed responses due to their probabilistic nature. These findings underpin the potential for ongoing model enhancements.
While the dataset and models present valuable contributions to offensive language detection, several avenues exist for future work. Expanding the dataset by incorporating comments from additional websites could increase the diversity of examples. Moreover, involving multiple annotators for data labeling could enhance dataset robustness.
In conclusion, our research argued for the effectiveness of Transformer-based models for offensive language detection in Romanian. This work serves as a foundational step toward creating safer and more-respectful online environments, emphasizing the importance of developing tools to combat hate speech and offensive content on digital communication platforms.