Offensive Text Span Detection in Romanian Comments Using Large Language Models

Paraschiv, Andrei; Ion, Teodora Andreea; Dascalu, Mihai

doi:10.3390/info15010008

Open AccessArticle

Offensive Text Span Detection in Romanian Comments Using Large Language Models

by

Andrei Paraschiv

¹

,

Teodora Andreea Ion

²

and

Mihai Dascalu

^1,2,*

¹

Computer Science and Engineering Department, National University of Science and Technology Politehnica of Bucharest, 313 Splaiul Independentei, 060042 Bucharest, Romania

²

Academy of Romanian Scientists, Str. Ilfov, Nr. 3, 050044 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Information 2024, 15(1), 8; https://doi.org/10.3390/info15010008

Submission received: 4 December 2023 / Accepted: 18 December 2023 / Published: 21 December 2023

(This article belongs to the Collection Natural Language Processing and Applications: Challenges and Perspectives)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The advent of online platforms and services has revolutionized communication, enabling users to share opinions and ideas seamlessly. However, this convenience has also brought about a surge in offensive and harmful language across various communication mediums. In response, social platforms have turned to automated methods to identify offensive content. A critical research question emerges when investigating the role of specific text spans within comments in conveying offensive characteristics. This paper conducted a comprehensive investigation into detecting offensive text spans in Romanian language comments using Transformer encoders and Large Language Models (LLMs). We introduced an extensive dataset of 4800 Romanian comments annotated with offensive text spans. Moreover, we explored the impact of varying model sizes, architectures, and training data volumes on the performance of offensive text span detection, providing valuable insights for determining the optimal configuration. The results argue for the effectiveness of BERT pre-trained models for this span-detection task, showcasing their superior performance. We further investigated the impact of different sample-retrieval strategies for few-shot learning using LLMs based on vector text representations. The analysis highlights important insights and trade-offs in leveraging LLMs for offensive-language-detection tasks.

Keywords:

offensive language detection; hate speech dataset; toxicity; Natural Language Processing; deep learning; Transformer models; Large Language Models; online social media

1. Introduction

Currently, access to online social media is associated with free speech and the exchange of ideas, thoughts, and opinions. However, this freedom and anonymity come with a downside. This virtual world can unveil various forms of undesired behavior, including hate speech, harassment, and derogatory comments. These messages can negatively impact not only the individuals who are targeted, but also society as a whole. Sureka and Agarwal [1] showed a prominent correlation between extremist messages on social networks and the spread of harmful ideologies. Furthermore, due to the anonymity offered by the Internet, users find it easier to engage in abusive behavior without facing immediate consequences. This shift towards online communication and the proliferation of offensive language is a cause for concern, requiring the development of automated mechanisms to identify and address such content.

Following the global outbreak of the COVID-19 pandemic in early 2020, there was a notable surge in online content consumption (https://www.iab.com/research/global-consumer-insights-four-fundamental-shifts-in-media-advertising-during-2020/, accessed 30 October 2023). These levels of activity led to additional radicalization of the online discourse, characterized by the proliferation of extremist content, hate speech, and toxic behavior on digital platforms and have become a focal point of concern for various stakeholders. An illustrative instance of this alarming trend was the observed positive correlation between hate speech on Twitter and crime levels in London [2]. The need to address this issue is underscored by the growing influence and impact of dangerous ideologies that find fertile ground in the virtual realm and affect society on a broader level [3,4].

Automated systems for detecting offensive language have gained significant attention as they offer a promising solution to address the challenges of moderating the ever-increasing volume of content on online platforms. Traditional content moderation for these platforms, including social media, discussion forums, and chat applications, is struggling to keep up with the sheer quantity of material, underpinning the need for automated solutions. However, despite significant advancements in the field [5,6,7], challenges persist, particularly for low- or less-resource languages. Much of the research and development focus has centered on widely spoken languages like English, German, French, and Spanish, leaving a resource gap for less-commonly spoken languages. Bridging this disparity is essential to ensure that online content in diverse languages can be effectively moderated, as offensive content knows no linguistic boundaries.

In the context of Romanian social media, Meza et al. [8] highlighted that the most-prevalent target groups for offensive messages are welfare recipients or poor people, followed by the Roma and Hungarian communities. This finding supported the 2018 study [9] by the Elie Wiesel National Institute for Studying the Holocaust in Romania. They analyzed Facebook posts and underlined that hate speech is not limited to one specific minority group. Still, the same users target several groups based on race, nationality, religion, or sexual orientation. This behavior was aggravated by the COVID-19 context in the years 2020–2022. In the 2020 report [10], ActiveWatch pointed out that, apart from the Roma population, the Romanian Diaspora became a target of online hate speech. A recent paper by Manolescu and Çöltekin [11] introduced a Romanian offensive language corpus. This corpus provides a detailed categorization of offensive language into targeted and untargeted offenses. Höfels et al. [12] presented CoRoSeOf, a large social media corpus of the Romanian language annotated for sexist and offensive language. Their work encompassed a detailed account of the annotation process and the preliminary analyses, as well as a baseline classification for sexism detection using SVM-based models.

In this article, our primary focus was to enhance the toolkit available for offensive language detection in the Romanian language. Our primary objective in this research was to address the gap between the Romanian language, which currently lacks human-annotated datasets, particularly in the domains of hate speech and offensive language detection, and other well-resourced languages like English, German, Italian, and Chinese. In addition to this main objective, we pursued three secondary research objectives. First, we aimed to investigate the sequence tagging capabilities of state-of-the-art models applied to the newly developed dataset. We considered an appropriate metric in which precision and recall were redefined for comparing text spans while accounting for partial overlaps and penalizing overly broad selections. Second, we targeted uncovering ways to harness the capabilities of Large Language Models (LLMs) for detecting toxic language sequences. To our knowledge, these specific capabilities have not been thoroughly investigated thus far.

As such, our main contributions were as follows:

We introduce an extensive dataset of 4800 Romanian comments annotated with offensive text spans available at https://huggingface.co/datasets/readerbench/ro-offense-sequences, accessed 30 October 2023;
We created a strong baseline, including BERT-based architectures and LLMs; the best-performing model and the entire source code are open-source and available at https://huggingface.co/readerbench/ro-offense-sequences and https://github.com/readerbench/ro-offense-sequences, respectively (accessed 30 October 2023);
We performed a detailed comparison while considering sample retrieval strategies for few-shot learning using LLMs based on vector text representations, coupled with highlighting the limitations.

2. Related Work

2.1. Offensive and Toxic Spans’ Datasets

Offensive language is an umbrella term for several types of harmful content. This type of socially unacceptable discourse can be classified based on the type of message (e.g., insult, cyberbullying, sexism, racism, abuse), the perceived target of the message (e.g., misogyny, homophobia, antisemitism), or if the target is a person or group. However, the availability of trustworthy and large datasets for detecting and studying offensive language remains limited, posing a challenge for researchers in this field [13]. Addressing this limitation is crucial to developing robust and effective techniques for identifying and mitigating offensive language online.

Several attempts have been made to provide a taxonomy for socially unacceptable discourse online, but there is still no universally accepted definition and categorization. Waseem et al. [14] made the distinction between generalized, directed, explicit, and implicit offensive speech. In contrast, OLID [15], one of the most-used offensive language datasets for the English language, offers a hierarchical understanding of the messages. In their paper, Zampieri et al. [15] proposed a three-level classification for offensive messages: the first level differentiates between offensive versus non-offensive messages, then the second level distinguishes between targeted and untargeted profanities, whereas the targeted texts are labeled based on the target categories on the third level, namely: individual, group, or other. Using the Twitter API [16], the authors collected data on offensive tweets, primarily focusing on keywords like “-filter:safe”, which denotes tweets flagged by Twitter as unsafe, as well as keywords such as “liberals” or “gun control”. The dataset excluded user metadata, replaced all URLs and Twitter mentions with placeholders, and included roughly 50% of the raw Twitter data related to political keywords to capture more heated discussions. The entire dataset was annotated through crowdsourcing, involving two annotators per comment and a third rater in cases of disagreement.

While the detection of toxic, hate speech, and offensive content is a better-resourced task for several languages such as Brazilian Portuguese [17], Korean [18], Danish [19], Greek [20], Turkish [21], Arabic [22], or Hindi [23], the concept of pinpointing the specific spans of text responsible for toxicity has gained attention relatively recently, particularly with the introduction of the SemEval2021, Task 5—Toxic span detection task [24]. For this task, toxic spans are segments of text that contribute to the overall toxicity of a post. Without these segments, the original text would lose its offensive nature entirely. This line of research draws inspiration from previous Natural Language Processing (NLP) tasks focused on aspect-based sentiment analysis, which seeks to determine the sentiment conveyed in a text and identify the specific portions of text that express that sentiment through the use of attention-based deep neural network models [25,26].

Most of the research conducted in the area of toxic span classification was on English language corpora. This research began with the aforementioned Semeval2021 task or with its follow-up work on detoxifying posts [27] using the ToxicSpans dataset. Many of these studies focused on the explainability of existing classification methods [28,29]. Following the work of Zaidan et al. [30], annotators not only label the example, but highlight the text span that justifies this labeling. This provides hints for the machine learning model concerning why this particular sample was chosen. This method can lead to substantial enhancement in performance, as proven by Zaidan et al. [30].

Lately, the research community has taken strides toward expanding the language coverage in toxic span detection. Sarker et al. [31] narrowed the focus by labeling toxic English messages in Software Engineering. Specialized approaches tailored to specific domains have shown notable advantages in terms of both reliability and performance when compared to the use of general-purpose datasets [32]. Another domain-sensitive approach was proposed by Zhou et al. [33]. They presented a hate speech dataset that consisted of posts sourced from HackForums, an online hacking forum, as well as Stormfront and Incels.co, two extremist forums. They combined the dataset with a Twitter hate speech dataset to obtain a cross-domain dataset for hate speech detection. From this dataset, only 700 samples were annotated for span extraction. However, the evaluation revealed that classifiers trained on multiple data sources did not consistently outperform single-platform classifiers.

Additionally, we observed an increased interest in creating span-detection datasets for languages other than English. Hoang et al. [34] presented a new human-annotated corpus for Vietnamese hate speech and offensive span detection: Vi-HOS. They provided 11,056 comments with 26,467 annotated segments. They observed several cases where multiple disconnected text spans might exist in a comment, which collectively contributed to the offensive nature of the content. Therefore, the necessity for comprehensive guidelines became prominent. In their experiments, Hoang et al. [34] used several deep learning approaches: BiLSTM-CRF-based models [35], XLM-RoBERTa [36], and PhoBERT. They achieved a mean F1-score of 77.16% with PhoBERT

_{l a r g e}

and 77.70% using XLM-RoBERTa

_{l a r g e}

. Another recently released dataset is the Korean Offensive Language Dataset (KOLD) [37], which offers a hierarchical taxonomy and annotations at the span level for identifying toxic content. In addition to the taxonomy, similar to OLIF [15], Jeong et al. [37] proposed the labeling of the target group, similar to the approach of HateXplain [29]. Their approach to span detection involved fine-tuning the BERT base models using BIO-tags, enabling them to predict and identify the specific fragments of interest. As a result of their efforts, they achieved an F1-score of 40.6%.

Despite the increasing focus on detecting offensive segments in user-generated content, the resources for researching this area remain limited. As seen in Table 1, the majority of existing datasets are domain-specific, concentrating on areas such as politics [29] or Software Engineering [31]. This specialization results in a shortfall of datasets that encompass general domain interactions, which would be more representative of everyday online exchanges among users. Moreover, expanding the scope of research to include low- or less-resource languages would significantly enhance the effectiveness and inclusivity of offensive text detection in diverse linguistic and cultural contexts. This expansion is vital for developing more-robust models capable of understanding and accurately identifying offensive content across a broader spectrum of social media and online communication platforms.

In the domain of Romanian language processing, the availability of datasets is currently limited primarily to offensive or hate speech detection, with no known datasets specifically annotated for offensive sequence detection. This gap highlights a significant opportunity in linguistic research. As such, developing a dataset annotated with offensive sequences in Romanian would be an asset for researchers. Such a dataset would enable more-granular analyses of offensive content, enabling the identification of specific spans or sequences of text that contribute to the overall toxicity of the language. This would not only enhance the precision of offensive content detection in Romanian, but also contribute to the development of more-sophisticated NLP tools tailored to this language. The creation of such a resource would be a step forward in understanding and moderating online discourse in Romanian, providing a tool for both academic research and practical applications in content moderation and digital communication.

2.2. Models for Offensive and Toxic Span Classification

The need for automated offensive language detection emerged early in the Internet’s development. Spertus [38] used an automated decision tree generator to classify feedback from comments on web pages into “flame” and “ok” messages. Initially, research primarily focused on English language hate speech detection [39,40,41]. These early methods treated messages as bag-of-words representations, ignoring token positions, and relied on manually crafted features (e.g., word n-grams, TF-IDF, character, or word counts) for classifying text fragments as offensive or non-offensive.

Recent advances in offensive and hate speech detection have been driven by neural network models like RNNs [42,43], CNNs [44,45], and Transformers. Whereas simple methods based on bag-of-words methods ignore phrase structure, RNNs/CNNs use pre-trained word embeddings and classified messages in one or two steps; nevertheless, RNNs/CNNs have struggled with distant dependencies [46,47,48]. Transformers [49], like BERT [50], have improved the performance and spawned various architectures. Transformer-based models generalize well across datasets, but dataset development should prioritize generic categories and fine-grained labels like “sexism” or “racism” to capture specific phenomena within broader categories [51].

The most-significant research activity in the field of toxic span detection was observed during the SemEval 2021 Task 5 competition [24]. In total, 91 teams took part, and 36 of these teams submitted system description papers. The best-performing system combined BERT with two different approaches for their ensemble. These approaches included a token-labeling approach and a span-extraction approach. Their comparative analysis revealed that span extraction performed slightly better on single-span posts, while token labeling significantly outperformed on posts with multiple spans. Despite its promise, teams that explored rationale-extraction mechanisms did not observe a clear advantage when compared to supervised sequence labeling in terms of the F1-scores. By leveraging both the ToxicSpans and HateXplain datasets, Schouten et al. [28] showed that performance was more stable in a cross-domain setup while using the rationale-extraction method. They showed that fine-tuned language models struggled when applied to out-of-domain data, while pre-trained lexicons of toxic language performed the best. This suggested that fine-tuned models have difficulty adapting to shifts in toxic language use or community-driven data.

Ranasinghe and Zampieri [52] introduced MUDES, a multilingual system to detect offensive spans in texts. MUDES provided developers with a Python API and a user-friendly web-based interface. The authors released four pre-trained offensive language identification models: en-base, en-large, Multilingual-base, and Multilingual-large. The en-base and en-large models identified offensive spans in English text, while the Multilingual-base and Multilingual-large models recognized offensive spans in languages other than English. These models were trained on the ToxicSpans dataset from SemEval-2021 Task 5.

Current research in the field of offensive and toxic span detection predominantly revolves around specialized, finely tuned models. Table 2 provides an overview of the aforementioned models and methods. These include monolingual BERT architectures augmented with Conditional Random Field (CRF) layers or ensembles composed of Transformer-based models. Studies [53,54,55] have argued that monolingual models often outperform their multilingual counterparts in specific language contexts due to their focused training on a single language’s nuances. However, there exists a substantial potential advantage if more-general, multilingual models could achieve performance levels comparable to monolingual models. This is particularly relevant given the increasing capabilities of LLMs, which represent an opportunity for research in offensive language detection. These advanced models could potentially narrow the performance gap between specialized and general models. Such a development would be a significant step forward in NLP, particularly in handling the complexities and subtleties of offensive language across various linguistic landscapes. To our understanding, there is currently no research specifically addressing the use of LLMs for detecting offensive spans in text. The exploration of LLMs in the context of offensive span detection could, thus, be a valuable direction for future research, contributing to developing more-effective tools for content moderation and online safety.

3. Dataset Development

Our training dataset was derived from a collection of user-generated comments collected in 2020 from a popular Romanian sports site, which were annotated for their offensive content [56]. We selected 4800 comments for toxic sequence annotation from the 12,000 comments included in the aforementioned dataset; these messages cover a wide range of offensive language, from mild insults to complex and hateful text sequences. We annotated the dataset according to our specific purpose of offensive language extraction.

3.1. Data Collection

We utilized the RO-Offense dataset as the primary source corpus for extracting input texts and constructing the dataset [56]. RO-Offense was manually annotated to identify offensive language and comprises user-generated comments exhibiting a wide spectrum of offensive language, ranging from mild insulting words to intricate and derogatory text passages that portray instances of racism, homophobia, profanity, abuse, and threats. The content of the RO-Offense dataset (https://huggingface.co/readerbench/ro-offense, accessed on 30 November 2023) was collected by crawling the comment sections of a well-known Romanian sports website, gsp.ro, in the year 2020. Out of the extensive dataset containing over 12,000 comments mentioned earlier, we deliberately chose a subset of 4800 comments for toxic sequence annotation. This selection was made in a balanced manner to ensure a similar representation across the original dataset’s labels, as seen in Table 3.

The input texts exhibit a considerable number of grammatical errors, spelling mistakes, omitted letters, duplicated letters, improper use of punctuation marks, and various unconventional or malformed words. These linguistic deviations are often associated with specific online communities and can vary based on individual writing styles and expressions of ideas. In addition, some offensive language is intentionally misspelled to evade existing profane word filters. Some of the employed techniques include the replacement of the actual letters with other non-alphanumerical characters (such as “a” with “@”), introducing non-alphanumerical characters, especially punctuation marks in the structure of the words, repeating letters in a word, and using abbreviations. The distribution of the length for the selected comments is presented in Figure 1 and varied from 5 to 1487 characters and from 2 to 323 words.

3.2. Annotation Process

Our objective was to extract the precise word sequences within the input text that classify it as offensive. Therefore, the annotation process closely resembles a Named Entity Recognition (NER) annotation. The annotation process was conducted using Label Studio (https://labelstud.io/, accessed on 30 November 2023), with only one type of entity, namely offensive text. Given the subjective nature of offensiveness, we employed a well-defined annotation ruleset to help mitigate this subjectivity:

When identifying an offensive attribute, mark both the attribute and the subject that is described by it;
Select the minimum word sequence that generates the offensive nature of the comment and that could infuse offensiveness into other contexts;
Mark any sequence of words that conveys an abusive message or insult regardless of its explicit offensive or licentious language;
All pejorative constructs are annotated, even if not directed at a specific person or group;
Any mention of well-known racist or homophobic language is annotated, regardless of whether it targets a user or group or not (e.g., words such as “țigan” meaning “gypsy” or “bozgor” which is an offensive version for “Hungarians”);
Annotate sequences that refer to the sexuality of a person in a denigrating way;
Ignore the interjections that are frequently used together with offensive structures, and mark only the offensive words (for example, in sequences like “ce naiba” (to hell), only the word “naiba” (hell) is selected);
Any variations of public figure names that are modified/misspelled with the intent of offending the recipient or the referenced person.

Out of the entire pool of input texts, 100 records were randomly selected and subsequently annotated by a second annotator. We used these parallel annotated samples to refine the ruleset and to reduce potential biases.

3.3. Postprocessing and Final Dataset

After labeling all 4800 comments, we performed several sanity checks concerning the annotated spans. We verified that all sequences were bordered by white space punctuation and did not end or begin mid-word. All sequences were saved as tuples formed as (character start, character end) and stored as a list in the offensive_sequences field. Additionally, we provide the list of isolated sequences as the offensive_substrings field in the final CSV file. Finally, the dataset was split into training (4000 records), validation (400 records), and test (400 records) sets and published on the Hugging Face hub (https://huggingface.co/datasets/readerbench/ro-offense-sequences, accessed 30 October 2023).

The final distribution of records and the number of annotated spans is presented in Table 4. Depending on the data split considered, the average number of annotated spans in the offensive records ranged from 2.15 to 2.48. Additionally, there was a notable disparity in vocabulary size between non-offensive and offensive records. This larger vocabulary in offensive records can be attributed to the intentional misspellings of offensive words, often employed as a tactic to evade moderation systems. Furthermore, there was a prevalence of variations in the spelling of personal and team names used in an insulting manner, further contributing to the expanded vocabulary in the offensive segments of the dataset.

Using only the vocabulary frequencies from the extracted offensive spans, we constructed in Figure 2 a word cloud to visualize the most-prevalent terms. During the creation of this word cloud, stop words were filtered out using the NLTK library [57]. This visualization highlights a range of offensive words, with the intensity varying from milder profanities like “hell” and “stupid” to more-severe expressions. Additionally, the word cloud reveals the use of specific team names and pejorative terms for ethnic groups or professions, which are employed in a derogatory or diminishing manner. This not only highlights the diversity of offensive language used, but also provides insights into the common themes and targets of such language in the dataset.

4. Method

4.1. BERT-Based Architectures

The first experiment considered token classification using a BERT-based model, more specifically the Romanian-language pre-trained BERT version, RoBERT [58], followed by a fully connected feed-forward layer that outputs a label probability for each token.

Given that BERT-based models employ a Wordpiece tokenizer, it is important to note that each token can correspond to either a whole word or a portion of a word. However, our annotations were character-based. To reconcile this, we needed to convert the character-based spans into token spans. Each token span was then assigned the label corresponding to the characters from which it was constructed. Importantly, our annotation process ensured that spans never started mid-word, such that each token received just one label.

Hyperparameter optimization was performed using Optuna [59], and the performance was logged in the Weights and Biases platform (https://wandb.ai/site, accessed 30 October 2023). This approach was also adopted in the BERT+CRF experiments described in the following section.

4.2. BERT and CRF

Our second architecture explored the combination between a RoBERT-base encoder followed by a Conditional Random Field (CRF) layer [60]. CRF layers are notably effective in sequence tagging tasks such as Named Entity Recognition (NER). When paired with BERT embeddings, this combination enhances robustness and stability, particularly in sequence-tagging scenarios. Moreover, this was the top-performing architecture in the “SemEval-2021 Task 5: Toxic Span Detection” competition [61], and as such, these properties were crucial for the focus of our research.

Unlike traditional Named Entity Recognition (NER) approaches, which often utilize BIO-style encoding (i.e., Begin, Inside, Outside), we employed a different approach. In our case, we marked all offensive tokens with the label “1” and non-offensive tokens with the label “0”. This binary labeling scheme simplified the annotation process and distinguished between offensive and non-offensive tokens directly.

The training process utilized the Trainer pipeline from the “Transformers” library (https://huggingface.co/docs/Transformers/training, accessed on 30 November 2023) by Hugging Face, coupled with the pytorch-crf (https://github.com/kmkurn/pytorch-crf/, accessed on 30 November 2023) library for implementing the CRF layer. Text preprocessing involved the use of the Wordpiece tokenizer, with an additional step of converting the text to lowercase. Given the objective of labeling exact sequences in the original texts, no further preprocessing operations were performed.

4.3. ChatGPT 3.5 and GPT 4

In the final experiment, generative Large Language Models (LLMs) were considered to identify offensive spans. There are primarily two distinct approaches to harnessing LLMs for downstream tasks: fine-tuning and in-context learning. In fine-tuning, the model weights are adjusted to suit the specific task, whereas in in-context learning, limited examples are utilized within the LLM prompt. Regardless of the approach, the crucial aspect is the careful construction of the LLM prompt. OpenAi API has a special “system” message for their GPT models that has a strong influence on its behavior. In our experiments, we used the following system role message, regardless of the GPT version:

“I am an excellent Romanian linguist. The task is to label offensive sequences in user-generated comments. I place the offensive sequence between [[ and ]] and do not change the text otherwise. I review my replies before output. I receive inputs as JSON and send outputs as valid JSON. My answers are concise and exact. I will answer only with JSON statements such as: {“output”: “pai [[tiganu’ borat]] nu putea sa tina decat cu [[tiganii]]…”}” (role: system)

The system message forces the model to output parsable JSON, and we can stop the GPT-based models from outputting additional text, reducing their well-known verbosity (https://openai.com/blog/chatgpt#fn-1, accessed 30 October 2023). Additionally, we provided 3 or 5 few-shot examples from our training set to provide the LLM direct examples related to the task and references to help it make predictions. These examples provided the input text as if presented by a user and the expected result as if the model answered it. After a series of 3 or 5 such messages, we appended the text to be analyzed by the model:

Input:: {“text”: ”banu’ ii ochiu dracului :)))”};
Output:: {“output”: “banu’ ii ochiu [[dracului]] :)))”};
: … k samples …;
Input:: {“text”: ”ce dracu fumezi, mă? :)))))”}.

Another important aspect of the experiments was the sample retrieval strategies for the few-shot examples. The straightforward method is the random retrieval of k examples from the training set. However, this method has a clear drawback in that the chosen examples may have limited relevant information for the current input. An alternative method is a k-nearest neighbor (kNN) retrieval from the training set [62,63]. To find the nearest neighbors for the input text, we must compute a vector representation for all training samples. This way, we can use a metric to retrieve the top-k training samples that resemble the input. In our experiments, we used two types of representation. First, we used a Sentence-BERT representation [64] trained starting from the last checkpoint of RoBERT (https://huggingface.co/readerbench/RoBERT-sentence-Transformer, accessed on 30 October 2023). Using a cosine metric between the 768-dimensional representations in the training set and the input sample, we retrieved the top-k most-similar sentences. Alternatively, we used OpenAI’s text-embedding-ada-002 text representation API to compute a 1536-dimensional embedding for all training samples (https://platform.openai.com/docs/api-reference/embeddings, accessed on 30 October 2023). For the storage and retrieval of similar vector embeddings, we used Pinecone (http://pinecone.ai, accessed on 30 October 2023), a cloud-based vector database. For instance, the following samples from our training set were retrieved for the input text “Blat pe naiba….” using Sentence-BERT embeddings:

Score: 0.8223: “naiba*”;
Score: 0.7757: “Ca de obicei!Adica…ca dracu!”;
Score: 0.7368: “pai normal frate.. fiecare meci al stelei e blat acum.. bravo mah”.

The temperature was set to 0 in all experiments to avoid changes in the input text by the GPT models. This grounded the model in a more-deterministic and =focused output. Figure 3 illustrates the few-shot sequence-detection process described above.

Various considerations influenced our decision to employ ChatGPT alongside GPT-4. First, our aim was to conduct an extensive comparison across different models, and ChatGPT, with its substantial user base and ease of access due to the free tier, was an ideal candidate for this purpose. Second, the cost disparity between the ChatGPT API and the GPT-4 API is critical, particularly for researchers seeking to replicate or apply our findings in their own research, where budget constraints can be a significant factor. Lastly, incorporating these results in future studies allows for ongoing assessment of the evolution and performance of generative models, drawing on their historical performance data.

4.4. Performance Metrics

Content moderation involves the critical task of determining which user-generated comments should remain published on websites and which should be removed from platforms. Many social media platforms rely on human experts who identify and take down content that violates the community guidelines established by the respective social media companies. Manual moderation is a complex and cumbersome process influenced by many factors, such as the time required for training, imprecise definitions, and the mental strain on moderators dealing with a substantial volume of toxic content. Some of these challenges can be addressed using semi-automated moderation systems that support the moderation effort. In such scenarios, precision can be traded off against gains in recall. This way, human agents can compensate for the loss in precision without the need to adjust for recall. When dealing with vast amounts of data, enhancing recall through human moderation can be costly and challenging. This is primarily because most comments are non-offensive, making the identification of positive cases to increase recall akin to searching for a needle in a haystack. As such, recall can be considered more relevant for our analysis, as argued by Wei et al. [65].

On the other hand, precision can be of higher importance in fully automated applications where maintaining high precision is imperative, especially in the absence of human moderation or when resources are limited. Similarly, in smaller online communities, where members actively report offensive content, recall can be elevated through this reporting process. Precision can be raised through various methods. One approach is to over-sample from the positive class, which improves the model’s capability to identify it correctly. Another strategy is to adjust the decision boundary of the machine learning models, making them more conservative in classifying content as offensive. This approach minimizes the likelihood of false positives, thereby enhancing precision.

In the context of offensive language sequence detection, it is important to note that each sequence can span from one to multiple words. This calls for using an evaluation metric that appropriately accounts for partial overlaps or penalizes overly broad selections. To address this, we adopted the metric proposed by Da San Martino et al. [66]. This metric redefines precision and recall as the normalized sum of the ratios between the overlapping length and the predicted sequence length, as well as the golden sequence length, respectively (Equations (1) and (2)). Based on the previously defined precision and recall, we compute the F1-score as their harmonic mean.

P = \frac{1}{| S |} \cdot \sum_{d \in D} \sum_{s \in S_{d}, t \in T_{d}} \frac{| (s \cap t) |}{| s |}

(1)

R = \frac{1}{| T |} \cdot \sum_{d \in D} \sum_{s \in S_{d}, t \in T_{d}} \frac{| (s \cap t) |}{| t |}

(2)

In the provided equations, variables $| s |$ and $| t |$ denote the number of characters in the predicted sequence and the gold sequence, respectively; $| (s \cap t) |$ represents the overlap length between the sequences indicated by span s and span t.

These metrics can be computed on a character basis, token level, or per word. Since the compared models do not have a common tokenization scheme, we only computed a per-character and per-word metric.

5. Results

Our experiments involved three configurations: RoBERT, RoBERT with CRF, and OpenAI GPT; from all available OpenAI models, we experimented with GPT-4 and GPT-3.5-turbo. Table 5 reports the performance metrics for all configurations. Notably, fine-tuned language models like RoBERT exhibited the best performance. This outcome was expected as these models are specialized and trained specifically for the task at hand.

In contrast, Table 6 presents the metrics at the word level, which do not consider the length of the detected words. It is worth noting that the performance of GPT-3.5 and GPT-4 using this metric was lower than observed when using character-based evaluation. In contrast, both BERT-based models achieved higher performance.

The recall performance, as previously argued, is considered a more-pertinent metric for evaluating success in moderation tasks. Our findings aligned with this perspective, highlighting the value of our best-performing model in scenarios where high recall is critical.

Moreover, we observed that LLMs learn in context when provided with examples. Furthermore, although there was a performance improvement when more examples were provided to the model, the difference was relatively small, typically around a 2% increase in the F1-score. Interestingly, this increase in the examples did not significantly impact the precision, but rather, boosted the recall of the LLM. This aligns with the intuition, as more examples help identify various cases, enhancing the model’s recall.

In contrast, we observed that the sample-selection technique significantly enhanced the precision of the LLMs, often by as much as 10%. This observation also align with the intuition, as providing more-relevant examples would indeed make sense to assist the model in detecting offensive sequences with fewer false positives. Interestingly, we noted that the type of used embedding played less of role than expected. Despite the ada embeddings outperforming other sentence similarity embeddings from OpenAI (https://openai.com/blog/new-and-improved-embedding-model, accessed 30 October 2023), we observed that a simple pretrained sentence Transformer delivered comparable results more cost effectively.

Figure 4 and Figure 5 introduce the confusion matrices for the best BERT-based model and the best GPT-based model.

6. Discussion

Upon analyzing the results, an interesting observation was that, for the largest model (i.e., GPT-4), the performance of the LLM resembled that of narrower in scope BERT-based models. Surprisingly, when evaluating at the word level, there was a more-significant performance gap between these two types of models. This suggested that GPT-based models face challenges when predicting shorter sequences. Indeed, for the GPT-4 model, when considering records with shorter sequences containing less than 15 characters, we observed a drop in character-based recall from 75.81% to 52.59%, resulting in an F1-score of 61.56%.

Compared to the reported performance on other datasets, there was a broad range of values. For instance, the KOLD dataset [37] reported precision and recall rates of 50.8 and 47.8, respectively. In contrast, the best model in the “SemEval-2021 Task 5” toxic-span-detection competition, Ref. [61] achieved a precision and recall of 75.01 and 89.66. Thus, our results are within the current range of results reported for offensive span detection.

Table 7 provides examples of misclassifications for reference. In contrast, the best-performing model, RoBERT+CRF, experienced a drop in recall to only 70.82% for the same shorter sequences.

We employed Layer Integrated Gradients, a form of Integrated Gradients described by Sundararajan et al. [67], to examine the impact of specific text features on the sequence-detection process. This approach was applied to the RoBERT-based model. To execute this analysis, the “Transformers-Interpret” package (https://github.com/cdpierse/Transformers-interpret, accessed 30 November 2023) in Python was employed, providing the necessary tools for in-depth model interpretation and feature impact assessment. We illustrate this by conducting a detailed analysis of the phrase “Nu mamai cert cu toti analfabetii” (“I don’t argue with all the illiterates anymore”) and its individual textual components. We noticed that the original text had a misspelling, “nu ma mai” written as “nu mamai”. Interestingly, this erroneous concatenation closely resembles the Romanian word “mama”, meaning “mother”. Many offensive expressions are directed towards the target’s mother in the Romanian language. Therefore, this spelling mistake inadvertently introduced ambiguity and raised red flags for the model, as it may have been associated with common offensive expressions. This observation is highlighted in Figure 6, where the tokens comprising this misspelled word contribute negatively to its attribution score, lowering the confidence in its prediction. This illustrates how the model assigns significance to each token in the context of the overall assessment, underscoring the influence of each text feature on the final decision for each token.

In contrast, as depicted in Figure 7, when the misspelling was corrected, its influence shifted positively, and the model’s confidence in the interpretation or classification of the text significantly increased. The correct classification by the model, despite the presence of a misspelling, underscores the robustness of the pretrained models. Such resilience in handling variations and inaccuracies in input data highlights the model’s effectiveness in real-world applications.

In our evaluation, we encountered an issue with responses from te GPT-based models where we received malformed JSON answers, specifically lacking a trailing “}”. This issue was only observed in the GPT-3.5 models and occurred in approximately 0.5% of the responses. Fixing these types of errors requires additional attention and programming. Since generative models do not guarantee that the input is always present unchanged in its output, we had to verify that the model did not alter the text being annotated. This verification process also demands extra effort and is a potential error source. We identified some common text modifications that appeared in both GPT-3.5 and GPT-4: a tendency to add or remove spaces from the original text, spontaneous spellchecking of some words, and annotation sequences that were not properly closed with the required “]]”.

The results in Table 5 and Table 6 show a high performance jump between GPT-3.5 and GPT-4. For instance, in 25% of the positive test records, GPT-3.5 did not provide any overlap with the golden record. It either did not detect any offensive sequence or focused on some non-offensive terms such as team names (e.g., “STEAUA CAMPIOANA”, “DINAMO”, “stelei, STEAUA”) or repeated words that are not offensive (e.g., “dinamo, rapid, Steaua, blat, Blat, BLAT, BLAT”, “bistritza, BISTRITZA”). For comparison, BERT+CRF had only 34 non-overlapping predictions and GPT-4 only 30.

In order to offer a more-comprehensive understanding of our model’s performance, we conducted a manual analysis of misclassifications by our best-performing model, BERT + CRF, in parallel with GPT-4. We categorized these misclassifications into three types: Partial Disagreements (PDs), False Positives (FPs), and False Negatives (FNs). In the PD category, we considered all predictions that overlapped with the ground truth, but did not match the annotated spans exactly. We observed a higher number of PDs for our BERT-based model, specifically 311 out of 638 predicted sequences. Some of these discrepancies were due to the model’s tendency to add the trailing punctuation tokens to the detected sequences, while most can be attributed to the trade-off between achieving higher recall and lower precision. In contrast, we found only 187 PDs out of 571 predicted sequences for GPT-4. For the FP category, we analyzed records that contained predicted spans, but had no offensive spans in the ground truth annotations. Both types of models had similar numbers of FPs, with 96 for the BERT-based model and 94 for GPT-4. Finally, FNs referred to records with offensive spans undetected by the models. Here, again, we observed a notable difference between the two models. The BERT-based model had 121 FNs out of 718 true positives, while GPT-4 had 187 FNs. This indicated that almost 26% of the annotated sequences were completely missed by GPT-4 and 17% by the BERT+CRF model. Some examples of these types of errors are presented in Table 8.

The error analysis highlighted areas with room for improvement in the detection models. Of particular interest are offensive sequences that manage to evade both types of models. By directing the research community’s attention toward these specific examples, we can work on enhancing the detection of offensive language. Additionally, we improved the explainability and interpretability of the classification results by providing an annotated sequence corpus, enabling us to narrow our focus on problematic phrases and expressions.

Limitations

There are certain limitations associated with our dataset. One such limitation is that the dataset’s domain could exhibit bias towards sports, with mentions of athletes, teams, and coaches appearing in some offensive texts. There are several effective ways to address the issue of biases related to domain-specific datasets. First, we provided explicit and comprehensive guidelines to help identify offensive language regardless of the domain. These guidelines should cover different forms of offensive language, such as insults, hate speech, profanity, and more, ensuring that annotators can label various offensive content. Second, for future work, diversifying the data sources would help reduce any existing biases. Collecting data from diverse sources would mean gathering text from a wide range of domains and contexts. Third, using data augmentation techniques, researchers could introduce more diversity in the dataset, helping their models generalize better.

Also, we are aware of another limitation in the annotation process, which involved only one dominant annotator with whom we had several iterations before defining the final guidelines. To mitigate this potential personal bias, we validated the annotations with a third-party annotator, who reviewed a random selection of 100 comments. Nevertheless, we recognize the opportunity for further improvement by considering multiple annotators and offering a more-balanced perspective. However, our dataset is one of the largest up-to-date corpora annotated at this granular span-based level.

7. Conclusions and Future Work

In this comprehensive study, we introduced the RO-Offense-Sequences dataset for offensive sequence detection in the Romanian language. The dataset, comprising 4800 comments from a popular Romanian sports website, covers a wide spectrum of offensive language, including mild insults to complex and offensive text sequences related to racism, homophobia, profanity, abuse, and threats. In this dataset, we manually labeled the character sequences that grant an offensive or abusive character to the comments.

Additionally, we conducted a thorough analysis of different state-of-the-art Natural Language Processing models, from narrow-focused language models (i.e., BERT) to generic, Large Language Models. Our experiments argued that BERT-based models, especially those pre-trained on the target language (Romanian), outperformed other models, thus improving offensive language detection. Adding a CRF layer to these models provided more-robust and -reliable results. In contrast, LLMs such as ChatGPT and GPT-4 can perform the specified task with only a few, three to five, examples. By providing a well-crafted prompt, these models can reach performance comparable to narrower models, trained on thousands of examples.

We also highlighted some specific areas of improvement in the models. For instance, GPT-based models showed difficulties with shorter sequences, indicating the need for further refinements. Additionally, we observed some challenges related to malformed responses due to their probabilistic nature. These findings underpin the potential for ongoing model enhancements.

While the dataset and models present valuable contributions to offensive language detection, several avenues exist for future work. Expanding the dataset by incorporating comments from additional websites could increase the diversity of examples. Moreover, involving multiple annotators for data labeling could enhance dataset robustness.

In conclusion, our research argued for the effectiveness of Transformer-based models for offensive language detection in Romanian. This work serves as a foundational step toward creating safer and more-respectful online environments, emphasizing the importance of developing tools to combat hate speech and offensive content on digital communication platforms.

Author Contributions

Conceptualization, A.P., T.A.I. and M.D.; data curation, T.A.I.; formal analysis, A.P. and M.D.; validation, M.D.; writing—original draft preparation, A.P.; writing—review and editing, M.D. and T.A.I.; visualization, A.P.; supervision, M.D.; funding acquisition, M.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant from the Ministry of Research, Innovation and Digitalization, project CloudPrecis “Increasing UPB’s research capacity in Cloud technologies and massive data processing”, Contract Number 344/390020/06.09.2021, MySMIS code: 124812, within POC.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are publicly available, and links to the data (https://huggingface.co/datasets/readerbench/ro-offense-sequences, accessed 30 October 2023), model (https://huggingface.co/readerbench/ro-offense-sequences, accessed 30 October 2023), and code (https://github.com/readerbench/ro-offense-sequences, accessed 30 October 2023) were added to the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BERT	Bidirectional Encoder Representations from Transformers
BIO	Begin, Inside, Outside
CRF	Conditional Random Field
FN	False Negative
FP	False Positive
GPT	Generative Pre-trained Transformer
LLM	Large Language Model
MDPI	Multidisciplinary Digital Publishing Institute
NER	Named Entity Recognition
NLP	Natural Language Processing
PD	Partial Disagreement

References

Sureka, A.; Agarwal, S. Learning to classify hate and extremism promoting tweets. In Proceedings of the 2014 IEEE Joint Intelligence and Security Informatics Conference, The Hague, The Netherlands, 24–26 September 2014; IEEE: Washington, DC, USA, 2014; p. 320. [Google Scholar]
Williams, M.L.; Burnap, P.; Javed, A.; Liu, H.; Ozalp, S. Hate in the machine: Anti-Black and Anti-Muslim social media posts as predictors of offline racially and religiously aggravated crime. Br. J. Criminol. 2020, 60, 93–117. [Google Scholar] [CrossRef]
Müller, K.; Schwarz, C. Fanning the flames of hate: Social media and hate crime. J. Eur. Econ. Assoc. 2021, 19, 2131–2167. [Google Scholar] [CrossRef]
Agarwal, S.; Sureka, A. Applying Social Media Intelligence for Predicting and Identifying On-line Radicalization and Civil Unrest Oriented Threats. arXiv 2015, arXiv:1511.06858. [Google Scholar]
de Gibert, O.; Perez, N.; García-Pablos, A.; Cuadros, M. Hate Speech Dataset from a White Supremacy Forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), Brussels, Belgium, 31 October–1 November 2018; pp. 11–20. [Google Scholar] [CrossRef]
Gupta, S.; Waseem, Z. A Comparative Study of Embeddings Methods for Hate Speech Detection From Tweets; Association for Computing Machinery: New York, NY, USA, 2017. [Google Scholar]
Mohiyaddeen, M.; Siddiqi, S. Automatic Hate Speech Detection: A Literature Review. 2021. Available online: https://ssrn.com/abstract=3887383 (accessed on 30 November 2023).
Meza, R.M.; Vincze, H.O.; Mogos, A. Targets of Online Hate Speech in Context. Intersections 2019, 4, 4. [Google Scholar] [CrossRef]
INSHR. Discursul Instigator la ură împotriva Evreilor și Romilor în Social Media; Technical Report; Institutul Naţional pentru Studierea Holocaustului din România “Elie Wiesel”: Bucuresti, Romania, 2018. [Google Scholar]
Pankowski, R.; Dzięgielewski, J. COVID-19 Crisis and Hate Speech: Transnational Report; Technical report; Tolerance and Non-Discrimination in the Context of COVID-19; Open Code for Hate-Free Communication: Poland, 2020. [Google Scholar]
Manolescu, M.; Çöltekin, Ç. ROFF-A Romanian Twitter Dataset for Offensive Language. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), Online, 1–3 September 2021; pp. 895–900. [Google Scholar]
Höfels, D.C.; Çöltekin, Ç.; Mădroane, I.D. CoRoSeOf-an annotated corpus of Romanian sexist and offensive tweets. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 2269–2281. [Google Scholar]
Alkomah, F.; Ma, X. A literature review of textual hate speech detection methods and datasets. Information 2022, 13, 273. [Google Scholar] [CrossRef]
Waseem, Z.; Davidson, T.; Warmsley, D.; Weber, I. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. In Proceedings of the First Workshop on Abusive Language Online, Vancouver, BC, Canada, 4 August 2017; pp. 78–84. [Google Scholar] [CrossRef]
Zampieri, M.; Malmasi, S.; Nakov, P.; Rosenthal, S.; Farra, N.; Kumar, R. Predicting the Type and Target of Offensive Posts in Social Media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Cedarville, OH, USA; pp. 1415–1420. [Google Scholar] [CrossRef]
Twitter-API. n.d. Available online: https://developer.twitter.com/en/docs (accessed on 1 September 2022).
Leite, J.A.; Silva, D.; Bontcheva, K.; Scarton, C. Toxic Language Detection in Social Media for Brazilian Portuguese: New Dataset and Multilingual Analysis. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 4–7 December 2020; pp. 914–924. [Google Scholar]
Moon, J.; Cho, W.I.; Lee, J. BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, Online, 10 July 2020; pp. 25–31. [Google Scholar] [CrossRef]
Sigurbergsson, G.I.; Derczynski, L. Offensive Language and Hate Speech Detection for Danish; European Language Resources Association: Marseille, France, 2020; pp. 3498–3508. [Google Scholar]
Pitenis, Z.; Zampieri, M.; Ranasinghe, T. Offensive Language Identification in Greek; European Language Resources Association: Marseille, France, 2020; pp. 5113–5119. [Google Scholar]
Çøltekin, Ç. A corpus of Turkish offensive language on social media. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; European Language Resources Association: Paris, France, 2020; pp. 6174–6184. [Google Scholar]
Mubarak, H.; Rashed, A.; Darwish, K.; Samih, Y.; Abdelali, A. Arabic Offensive Language on Twitter: Analysis and Experiments. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, Kiev, Ukraine, 19–20 April 2021; pp. 126–135. [Google Scholar]
Mandl, T.; Modha, S.; Majumder, P.; Patel, D.; Dave, M.; Mandlia, C.; Patel, A. Overview of the HASOC track at FIRE 2019: Hate speech and offensive content identification in indo-european languages. In Proceedings of the 11th Forum for Information Retrieval Evaluation, Kolkata, India, 12–15 December 2019; pp. 14–17. [Google Scholar] [CrossRef]
Pavlopoulos, J.; Sorensen, J.; Laugier, L.; Androutsopoulos, I. SemEval-2021 task 5: Toxic spans detection. In Proceedings of the 15th international workshop on semantic evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; pp. 59–69. [Google Scholar]
Wang, L.; Shen, Y.; Peng, S.; Zhang, S.; Xiao, X.; Liu, H.; Tang, H.; Chen, Y.; Wu, H.; Wang, H. A Fine-Grained Interpretability Evaluation Benchmark for Neural NLP; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 70–84. [Google Scholar] [CrossRef]
Dai, J.; Kim, M.Y.; Goebel, R. Interactive Rationale Extraction for Text Classification. In Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association, Adelaide, SA, Australia, 14–16 December 2022; pp. 115–121. [Google Scholar]
Pavlopoulos, J.; Laugier, L.; Xenos, A.; Sorensen, J.; Androutsopoulos, I. From the detection of toxic spans in online discussions to the analysis of toxic-to-civil transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 3721–3734. [Google Scholar]
Schouten, S.; Barbarestani, B.; Tufa, W.; Vossen, P.; Markov, I. Cross-Domain Toxic Spans Detection. In Proceedings of the Natural Language Processing and Information Systems–28th International Conference on Applications of Natural Language to Information Systems, NLDB 2023, Derby, UK, 21–23 June 2023; Métais, E., Meziane, F., Manning, W., Reiff-Marganiec, S., Sugumaran, V., Eds.; Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer: Cham, Switzerland, 2023; pp. 533–545. [Google Scholar] [CrossRef]
Mathew, B.; Saha, P.; Yimam, S.M.; Biemann, C.; Goyal, P.; Mukherjee, A. Hatexplain: A benchmark dataset for explainable hate speech detection. In Proceedings of the AAAI Conference On Artificial Intelligence, Online, 2–9 February 2021; Volume 35/17, pp. 14867–14875. [Google Scholar]
Zaidan, O.; Eisner, J.; Piatko, C. Using “annotator rationales” to improve machine learning for text categorization. In Proceedings of the Human Language Technologies 2007: The conference of the North American Chapter of the Association for Computational Linguistics; Rochester, NY, USA, 22–27 April 2007, Proceedings of the Main Conference; pp. 260–267.
Sarker, J.; Sultana, S.; Wilson, S.R.; Bosu, A. ToxiSpanSE: An Explainable Toxicity Detection in Code Review Comments. arXiv 2023, arXiv:2307.03386. [Google Scholar] [CrossRef]
Jongeling, R.; Sarkar, P.; Datta, S.; Serebrenik, A. On negative results when using sentiment analysis tools for software engineering research. Empir. Softw. Eng. 2017, 22, 2543–2584. [Google Scholar] [CrossRef]
Zhou, L.; Caines, A.; Pete, I.; Hutchings, A. Automated hate speech detection and span extraction in underground hacking and extremist forums. Nat. Lang. Eng. 2023, 29, 1247–1274. [Google Scholar] [CrossRef]
Hoang, P.G.; Luu, C.D.; Tran, K.Q.; Nguyen, K.V.; Nguyen, N.L.T. ViHOS: Hate Speech Spans Detection for Vietnamese; Association for Computational Linguistics: Dubrovnik, Croatia, 2023; pp. 652–669. [Google Scholar] [CrossRef]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale; Association for Computational Linguistics: Dubrovnik, Croatia, 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Jeong, Y.; Oh, J.; Lee, J.; Ahn, J.; Moon, J.; Park, S.; Oh, A. KOLD: Korean Offensive Language Dataset; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 10818–10833. [Google Scholar] [CrossRef]
Spertus, E. Smokey: Automatic recognition of hostile messages. In Proceedings of the AAAI/IAAI 1997, Providence, RI, USA, 27–31 July 1997; pp. 1058–1065. [Google Scholar]
Davidson, T.; Warmsley, D.; Macy, M.; Weber, I. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media, Montreal, QC, Canada, 15–18 May 2017; Volume 11, pp. 512–515. [Google Scholar]
Robinson, D.; Zhang, Z.; Tepper, J. Hate speech detection on twitter: Feature engineering vs feature selection. In Proceedings of the European Semantic Web Conference, Monterey, CA, USA, 8–12 October 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 46–49. [Google Scholar]
Xiang, G.; Fan, B.; Wang, L.; Hong, J.; Rose, C. Detecting offensive tweets via topical feature discovery over a large scale twitter corpus. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, Maui, HI, USA, 29 October–2 November 2012; pp. 1980–1984. [Google Scholar] [CrossRef]
Pitsilis, G.K.; Ramampiaro, H.; Langseth, H. Effective hate-speech detection in Twitter data using recurrent neural networks. Appl. Intell. 2018, 48, 4730–4742. [Google Scholar] [CrossRef]
Del Vigna, F.; Cimino, A.; Dell’Orletta, F.; Petrocchi, M.; Tesconi, M. Hate me, hate me not: Hate speech detection on facebook. In Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), Venice, Italy, 17–20 January 2017; pp. 86–95. [Google Scholar]
Park, J.H.; Fung, P. One-step and Two-step Classification for Abusive Language Detection on Twitter. In Proceedings of the First Workshop on Abusive Language Online, Vancouver, BC, Canada, 4 August 2017; pp. 41–45. [Google Scholar] [CrossRef]
Alshalan, R.; Al-Khalifa, H.; Alsaeed, D.; Al-Baity, H.; Alshalan, S. Detection of hate speech in covid-19–related tweets in the arab region: Deep learning and topic modeling approach. J. Med. Internet Res. 2020, 22, e22609. [Google Scholar] [CrossRef] [PubMed]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Atlanta, Georgia, 17–19 June 2013; pp. 1310–1318. [Google Scholar]
Li, S.; Li, W.; Cook, C.; Zhu, C.; Gao, Y. Independently Recurrent Neural Network (IndRNN): Building a Longer and Deeper RNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5457–5466. [Google Scholar] [CrossRef]
Zhao, J.; Huang, F.; Lv, J.; Duan, Y.; Qin, Z.; Li, G.; Tian, G. Do RNN and LSTM have Long Memory? In Proceedings of the International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 11365–11375. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional Transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Fortuna, P.; Soler-Company, J.; Wanner, L. How well do hate speech, toxicity, abusive and offensive language classification models generalize across datasets? Inf. Process. Manag. 2021, 58, 102524. [Google Scholar] [CrossRef]
Ranasinghe, T.; Zampieri, M. MUDES: Multilingual Detection of Offensive Spans; Association for Computational Linguistics: Dubrovnik, Croatia, 2021; pp. 144–152. [Google Scholar] [CrossRef]
Velankar, A.; Patil, H.; Joshi, R. Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi. In Proceedings of the Artificial Neural Networks in Pattern Recognition, Dubai, United Arab Emirates, 24–26 November 2022; El Gayar, N., Trentin, E., Ravanelli, M., Abbas, H., Eds.; Springer International Publishing: Cham, Switzerland, 2023; pp. 121–128. [Google Scholar]
Ghosh, K.; Senapati, A. Hate speech detection: A comparison of mono and multilingual Transformer model with cross-language evaluation. In Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation, Manila, Philippines, 20–22 October 2022; pp. 853–865. [Google Scholar]
Dowlagar, S.; Mamidi, R. Hasocone@ fire-hasoc2020: Using BERT and multilingual BERT models for hate speech detection. arXiv 2021, arXiv:2101.09007. [Google Scholar]
Paraschiv, A.; Sandu, I.; Dascalu, M.; Cercel, D.C. Fighting Romanian Offensive Language with RO-Offense: A Dataset and Classification Models for Online Comments; 2023. [Google Scholar]
Bird, S.; Loper, E.; Klein, E. Natural Language Processing with Python; O’Reilly Media Inc.: Sebastopol, CA, USA, 2009. [Google Scholar]
Masala, M.; Ruseti, S.; Dascalu, M. RoBERT–A Romanian BERT Model. In Proceedings of the 28th International Conference on Computational Linguistics, Online, 8–13 December 2020; pp. 6626–6637. [Google Scholar]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
Lafferty, J.D.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data; ICML ‘01; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2001; pp. 282–289. [Google Scholar]
Zhu, Q.; Lin, Z.; Zhang, Y.; Sun, J.; Li, X.; Lin, Q.; Dang, Y.; Xu, R. HITSZ-HLT at SemEval-2021 Task 5: Ensemble Sequence Labeling and Span Boundary Detection for Toxic Span Detection. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Bangkok, Thailand, 5–6 August 2021; Palmer, A., Schneider, N., Schluter, N., Emerson, G., Herbelot, A., Zhu, X., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2021; pp. 521–526. [Google Scholar] [CrossRef]
Wang, S.; Meng, Y.; Ouyang, R.; Li, J.; Zhang, T.; Lyu, L.; Wang, G. GNN-SL: Sequence Labeling Based on Nearest Examples via GNN; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 12679–12692. [Google Scholar] [CrossRef]
Vilar, D.; Freitag, M.; Cherry, C.; Luo, J.; Ratnakar, V.; Foster, G. Prompting PaLM for Translation: Assessing Strategies and Performance; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 15406–15427. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; Association for Computational Linguistics: Toronto, ON, Canada, 2019. [Google Scholar]
Wei, J.T.Z.; Zufall, F.; Jia, R. Operationalizing content moderation “accuracy” in the Digital Services Act. arXiv 2023, arXiv:2305.09601. [Google Scholar] [CrossRef]
Da San Martino, G.; Yu, S.; Barrón-Cedeno, A.; Petrov, R.; Nakov, P. Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5636–5646. [Google Scholar]
Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 3319–3328. [Google Scholar]

Figure 1. Comment length distribution for RO-Offense-Sequences dataset.

Figure 2. Word cloud of predominant terms in the offensive spans from the dataset.

Figure 3. Few-shot automated offensive-sequence-detection process using retrieval augmented generation with GPT-based models.

Figure 4. Confusion matrix for “GPT-4 top-5 most-similar” model.

Figure 5. Confusion matrix for “BERT + CRF” model.

Figure 6. Visualization of the textual features’ influence in the model’s decision process.

Figure 7. Visualization of how various textual elements, such as misspellings, impact the model’s analytical outcomes and classification probabilities.

Table 1. Overview of available datasets in toxic span classification research.

Publication	Dataset Name	Language	Number of Records
Pavlopoulos et al. (2021) [24]	ToxicSpans dataset	English	11,006
Mathew et al. (2021) [29]	HateXplain	English	20,148
Hoang et al. (2023) [34]	Vi-HOS	Vietnamese	11,056
Jeong et al. (2022) [37]	Korean Offensive Language Dataset (KOLD)	Korean	40,429
Saker et al. (2023) [31]	Toxic English messages in Software Engineering	English	19,651
Zhou et al. (2023) [33]	Hate speech dataset (HackForums, Stormfront, Incels.co, Twitter)	English	700

Table 2. Overview of models used in offensive span detection.

Publication	Architecture	Metrics
Ranasinghe et al. (2021) [52]	Python framework that allows various models	For binary classification, F1 = 90.23% (BERT-large) Not provided for span detection
Mathew et al. (2021) [29]	CNN+GRU, BI-RNN+attention, BERT	Best token F1 = 50.6% (BI-RNN+attention)
Pavlopoulos et al. (2022) [27]	BI-LSTM, CNN, BERT, SPAN-BERT	Best F1 = 63% (using SPAN-BERT)
Schouten et al. (2023) [28]	BERT and BERT+CRF	Best in-domain F1 = 74.9% (using BERT) Best cross-domain F1 = 42.8% (using BERT+CRF)
Hoang et al. (2023) [34]	Bi-LSTM+CRF, XLM-R, PhoBERT	Best F1 = 77.70% (XML-R Large)
Saker et al. (2023) [31]	Lexicon-based, BERT, RoBERTa, DistilBERT, ALBERT, XLNet	Best F1 = 88% (RoBERTa)
Zhou et al. (2023) [33]	BERT+span, BERT+token	Best F1 = 68.2% (BERT+token)

Table 3. RO-Offense-Sequences dataset distribution.

Class Name	Number of Entries
Other	1311
Profanity	1217
Insult	1178
Abuse	1095

Table 4. Distribution of the annotated spans over the training–validation–test split in RO-Offense-Sequences dataset.

	Train	Validation	Test
Number of records	4000	400	400
Number of offensive records	2999	260	291
Number of offensive spans	7101	558	718
Avg. clean comment length	204.06	210.97	198.33
Avg. offensive comment length	218.29	208.53	220.35
Clean comment vocab size	7367	1875	1522
Offensive comment vocab size	17,433	3239	3570
Avg. No. spans/offensive message	2.37	2.15	2.48
Number of single-span comments (%)	1166 (38.87%)	123 (47.30%)	109 (37.45%)
Number of multi-span comments (%)	1833 (61.12%)	137 (52.69%)	182 (62.54%)
Spans with 1 word	3520	297	355
Spans with 2–3 words	1924	171	218
Spans with >3 words	1657	90	145

Table 5. Character-based metrics for our experiments.

Model	Precision	Recall	F1-Score
RoBERT	76.68	71.96	74.16
RoBERT + CRF	75.18	76.38	75.72
Random Samples
ChatGPT-3.5 k = 3 samples	43.24	50.20	46.46
ChatGPT-3.5 k = 5 samples	42.56	52.40	46.97
ChatGPT-4 k = 5 samples	62.37	74.62	67.95
SentenceBERT embeddings
ChatGPT-3.5 top-3 most similar	52.02	55.62	53.76
ChatGPT-3.5 top-5 most similar	52.74	58.45	55.45
GPT-4 top-5 most similar	70.02	75.93	72.85
OpenAI ada embeddings
ChatGPT-3.5 top-3 most similar	52.03	55.74	53.82
ChatGPT-3.5 top-5 most similar	52.25	58.38	55.15
GPT-4 top-5 most similar	69.49	75.81	72.51

Table 6. Word-based metrics for our experiments.

Model	Precision	Recall	F1-Score
RoBERT	75.14	76.12	75.61
RoBERT + CRF	75.13	78.92	76.96
Random Samples
ChatGPT-3.5 k = 3 samples	42.19	51.14	46.23
ChatGPT-3.5 k = 5 samples	40.42	50.30	44.82
ChatGPT-4 k = 5 samples	62.47	71.11	66.51
SentenceBERT embeddings
ChatGPT-3.5 top-3 most similar	45.94	51.41	48.52
ChatGPT-3.5 top-5 most similar	48.07	53.27	50.54
GPT-4 top-5 most similar	67.54	74.22	70.72
OpenAI ada embeddings
ChatGPT-3.5 top-3 most similar	45.79	51.54	48.49
ChatGPT-3.5 top-5 most similar	47.83	52.98	50.27
GPT-4 top-5 most similar	68.10	75.10	71.43

Table 7. Short sequence misclassification examples for GPT-4 top-5 with ada embeddings.

Comment with Annotated Sequence	Classification Result
canta pe naiba suporterul englez, sa te uiti la niste ilmulete cu torcida split si sa faci o comparatie dupa;)	[’suporterul englez’]
Ne-au cocosat pe dracu! Au avut pierderi umane de aproape de 4 ori mai mari! Au avut pierderi de tancuri si avioane de 9 ori mai mari!	[’cocosat’]
faru sparge si mai mult de cat atata o sa piarda si cfr cu noi si cu dinamouistii	-
Ce bataie is va lua steaua de la CFR…Va fi rusine maxima!	[’rusine maxima’]
Talharie!!! Furt inadmisibil!!! Bai Petrescule, ce mama naibii faci??	[ ’Talharie’, ’Furt inadmisibil’, ’ce mama naibii faci’]

Table 8. Classification errors by the BERT + CRF model and GPT-4 + ada embedding model.

Error Type	Comment with Annotated Sequence	BERT + CRF Annotation	GPT-4 Annotation
PD	BAI RAZIAS TU AI FATZA DE “GHIDUL LABBBAGIULUI”....AUTOR MIMI STOICA	[AI FATZA]	[RAZIAS, LABBBAGIULUI]
PD	cfristi facuti in palma neeetrrminati si aruncatzi pe pereti	-	[palma neeetrrminati si aruncatzi pe pereti]
PD	sunt falcosul si imi place sa o iau in falci....ma ajuta cineva?????	[falcosul, imi place sa o iau in falci.]	[imi place sa o iau in falci]
PD	Talharie!!! Furt inadmisibil!!! Bai Petrescule, ce mama naibii faci??	[mama naibii]	[Talharie, Furt inadmisibil, ce mama naibii faci]
FP	Ce bataie is va lua steaua de la CFR… Va fi rusine maxima!	-	rusine maxima
FP	VARZA!! HAGI DADEA CEL PUTIN UN HATRICK CONTRA SCOTIENII.	[VARZA!]	-
FP	m.UE CFR, m.UE Rapid. Steaua Campioana. Pupat Piatza Independentii !	[m.UE, m.UE, Pupat]	[m.UE CFR, m.UE Rapid, Pupat Piatza Independentii]
FN	stati naibii acasa la tv. va razbunati pe club nu pe jiji	[naibii]	-
FN	poza spune tot. =))) doamne ce fata are oaia asta. fortza cfr	-	[oaia]
FN	faru sparge si mai mult de cat atata o sa piarda si cfr cu noi si cu dinamouistii	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Paraschiv, A.; Ion, T.A.; Dascalu, M. Offensive Text Span Detection in Romanian Comments Using Large Language Models. Information 2024, 15, 8. https://doi.org/10.3390/info15010008

AMA Style

Paraschiv A, Ion TA, Dascalu M. Offensive Text Span Detection in Romanian Comments Using Large Language Models. Information. 2024; 15(1):8. https://doi.org/10.3390/info15010008

Chicago/Turabian Style

Paraschiv, Andrei, Teodora Andreea Ion, and Mihai Dascalu. 2024. "Offensive Text Span Detection in Romanian Comments Using Large Language Models" Information 15, no. 1: 8. https://doi.org/10.3390/info15010008

APA Style

Paraschiv, A., Ion, T. A., & Dascalu, M. (2024). Offensive Text Span Detection in Romanian Comments Using Large Language Models. Information, 15(1), 8. https://doi.org/10.3390/info15010008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Offensive Text Span Detection in Romanian Comments Using Large Language Models

Abstract

1. Introduction

2. Related Work

2.1. Offensive and Toxic Spans’ Datasets

2.2. Models for Offensive and Toxic Span Classification

3. Dataset Development

3.1. Data Collection

3.2. Annotation Process

3.3. Postprocessing and Final Dataset

4. Method

4.1. BERT-Based Architectures

4.2. BERT and CRF

4.3. ChatGPT 3.5 and GPT 4

4.4. Performance Metrics

5. Results

6. Discussion

Limitations

7. Conclusions and Future Work

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI