Offensive Text Span Detection in Romanian Comments Using Large Language Models
Round 1
Reviewer 1 Report (Previous Reviewer 1)
Comments and Suggestions for AuthorsAll my previous comments have been properly addressed. I have no further comments to add.
Reviewer 2 Report (Previous Reviewer 2)
Comments and Suggestions for AuthorsThe revised manuscript resolved all of my concerns, and the updated one should be good for publication.
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe topic of offensive language detection and large language model, is interesting as well as trending. I read the paper with great interest. However, I am confident that few critical details need to be highlighted by the authors. Following are some suggestions for critical improvements:
1) Discussion on the Related work should be written in a way that raises research problems and research gaps to the mind of inquisitive scientific community. Current version of the paper, failed to do that (in reference to Section 2. Related Work). The Authors should rectify this issue, by clearly summarizing the features and disadvantages of existing studies in a TABULAR manner. At present section 2 only describing the state of current research without properly highlighting the research gaps or problems.
2) Please properly describe at a greater detail. Having only the number or records are not good enough. The author need to provide the number of fields, constraints, ranges of values, max length, min length, average length, distribution etc. Provide details of the data transformation tasks (e.g., elimination of stop words, elimination of Emoji’s / icons, or any other type conversion etc.). These detailed step would ensure research reproducibility. Preferably, the authors can add a block diagrams to showcase the detailed data transformation / preprocessing / manipulation steps. Please highlight any software or tools used for conducting the data preprocessing / post processing.
3) Method section should introduce the high level methodology in a clear manner using block diagram / flow diagram / or high level conceptual diagram.
4) The author needs to discuss the possible implication of an offensive text of not being classified as non-offensive text or vice versa (i.e., non-offensive text being classified as offensive case). Based on these justifications, the author need prioritized and recommend the best evaluation metrics for this scenario (e.g., Precision or Recall).
Author Response
The topic of offensive language detection and large language model, is interesting as well as trending. I read the paper with great interest. However, I am confident that few critical details need to be highlighted by the authors. Following are some suggestions for critical improvements:
Response: Thank you kindly for your thorough review.
1) Discussion on the Related work should be written in a way that raises research problems and research gaps to the mind of inquisitive scientific community. Current version of the paper, failed to do that (in reference to Section 2. Related Work). The Authors should rectify this issue, by clearly summarizing the features and disadvantages of existing studies in a TABULAR manner. At present section 2 only describing the state of current research without properly highlighting the research gaps or problems.
Response: Indeed, this was a great addition to our paper. We have introduced descriptions and summative tables (see new Tables 1 & 2) in the Related Work section.
2) Please properly describe at a greater detail. Having only the number or records are not good enough. The author need to provide the number of fields, constraints, ranges of values, max length, min length, average length, distribution etc. Provide details of the data transformation tasks (e.g., elimination of stop words, elimination of Emoji’s / icons, or any other type conversion etc.). These detailed step would ensure research reproducibility. Preferably, the authors can add a block diagrams to showcase the detailed data transformation / preprocessing / manipulation steps. Please highlight any software or tools used for conducting the data preprocessing / post processing.
Response: Thank you for your guidance on improving the clarity of our paper. In line with your suggestions, we have expanded our analysis of the dataset, extending the statistics presented in Table 4 (previously 2). This improved analysis includes a more detailed breakdown of the dataset's characteristics, such as the range of values, offensive span size distribution, and the distribution of text lengths. Additionally, to better illustrate the key features of the dataset, we have created a word cloud based on the content of the offensive sequences. This visualization effectively highlights the most prominent and recurrent words within the dataset, providing a clearer picture of its content and focus.
Furthermore, in Section 4.2 of our paper, we have thoroughly described the preprocessing steps employed in our methodology. This section now includes detailed descriptions of all data transformation tasks and any type conversions performed. We have also specified the software tools and libraries used throughout the processing and post-processing stages. We believe these enhancements improve the comprehensiveness and clarity of our paper, offering readers a deeper understanding of our dataset and methodologies.
3) Method section should introduce the high level methodology in a clear manner using block diagram / flow diagram / or high level conceptual diagram.
Response: Thank you for your suggestion to enhance the clarity of the Method section of our paper. In response to your recommendation, we have included a high-level schema (see Figure 3) that succinctly illustrates the methodology employed in our study. We believe that this addition effectively addresses your feedback by offering a comprehensive yet accessible overview of our approach, thereby facilitating a better understanding of our methods and processes.
4) The author needs to discuss the possible implication of an offensive text of not being classified as non-offensive text or vice versa (i.e., non-offensive text being classified as offensive case). Based on these justifications, the author need prioritized and recommend the best evaluation metrics for this scenario (e.g., Precision or Recall).
Response: Thank you for highlighting the necessity of discussing the implications of misclassifications in our offensive text detection system. Addressing this concern, we have now included a detailed discussion in our paper about the potential consequences of such misclassifications, both in terms of offensive texts being wrongly labeled as non-offensive and non-offensive texts being classified as offensive. For this purpose, we added two paragraphs at the beginning of section 4.4 that cover these scenarios. We hope this addition provides a comprehensive understanding of the implications of misclassification and guides the appropriate choice of evaluation metrics.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this work, the authors leverage large language models to detect potential offensive and abusive Romanian languages, where different pre-trained models, like BERT and GPT4, are attempted. The authors find that the Transformer-based models can efficiently detect offensive language in Romanian, particularly in online environments.
There are several concerns in this work:
(1) The combination of BERT and CRF is unusual. Why not combine BERT with DNN models for downstream tasks?
(2) Since the authors can access GPT-4, why do they use a degraded version of ChatGPT?
(3) It is not very clear if some discriminative features are helpful for the improvement of the detection system.
(4) The precision rates in Table 3 are still low, it is not clear if the detection system is useful in practice.
Author Response
In this work, the authors leverage large language models to detect potential offensive and abusive Romanian languages, where different pre-trained models, like BERT and GPT4, are attempted. The authors find that the Transformer-based models can efficiently detect offensive language in Romanian, particularly in online environments.
Response: Thank you kindly for your thorough review.
(1) The combination of BERT and CRF is unusual. Why not combine BERT with DNN models for downstream tasks?
Response: Thank you for your insightful observation regarding our choice of model architecture, particularly the combination of BERT with a Conditional Random Field (CRF) layer, as opposed to integrating BERT with Deep Neural Network (DNN) models for downstream tasks. The rationale behind our choice is rooted in the specific requirements and nuances of our task – offensive span detection in text. CRF layers are notably effective in sequence tagging tasks such as Named Entity Recognition (NER). When paired with BERT, this combination enhances robustness and stability, particularly in sequence tagging scenarios. These properties are crucial for the focus of our research, which is the detection of offensive spans in text.
In our study, we explored various model architectures, including BERT with Deep Neural Network (DNN) models. One such model we utilized was RoBERT, a BERT variant pretrained on the Romanian language. This choice allowed us to evaluate and compare the efficacy of different architectures in our context.
Moreover, our decision to investigate BERT+CRF was also inspired by its successful application in the “SemEval-2021 Task 5: Toxic Span Detection” competition, where it was used by the top-performing team (Zhu et al., SemEval 2021). We believed it was imperative to assess this model's performance in our specific task of detecting offensive spans, given its proven effectiveness in a closely related application.
We hope this clarifies the reasoning behind our methodology and model choices. We aimed to thoroughly investigate the capabilities of various combinations to ensure robust and effective offensive span detection. These details were added to the paper.
(2) Since the authors can access GPT-4, why do they use a degraded version of ChatGPT?
Response: Various considerations influenced our decision to employ ChatGPT alongside GPT-4. First, our aim was to conduct an extensive comparison across different models, and ChatGPT, with its substantial user base and ease of access due to the free tier, was an ideal candidate for this purpose. Second, the cost disparity between the ChatGPT API and the GPT-4 API is critical, particularly for researchers seeking to replicate or apply our findings in their own research, where budget constraints can be a significant factor. Lastly, incorporating these results in future studies allows for ongoing assessment of the evolution and performance of generative models, drawing on their historical performance data. This explanation was added in the manuscript.
(3) It is not very clear if some discriminative features are helpful for the improvement of the detection system.
Response: Thank you for your inquiry about the methods we used for model interpretation and feature impact assessment in our study. To address this aspect, we employed Layer Integrated Gradients, a variant of Integrated Gradients. This approach was specifically applied to our RoBERT-based model. For the execution of this analysis, we utilized the 'Transformers-Interpret' package in Python. Using Layer Integrated Gradients provides visualizations that help understand how different textual components and their nuances contribute to the model’s output. By employing this technique, we gained valuable insights into the features that play a pivotal role in the detection of offensive content. This understanding helps enhance the model's accuracy and effectiveness in identifying such content. We hope this addition to section 6 (pages 12-13-14) clarifies our methodological approach and its relevance to the research objectives.
(4) The precision rates in Table 3 are still low, it is not clear if the detection system is useful in practice.
Response: Thank you for your observation regarding the precision rates in Table 3, particularly concerning the generative models. Indeed, there is room for improvement in precision for these models. Notably, GPT-4 demonstrates a significant advancement over GPT-3.5 in our analysis, although it has not yet reached the performance level of a fine-tuned, specialized Transformer model.
We recognize the potential practical application of GPT-4, especially in the area of content moderation. The recall performance, as argued by Wei et al. (2023), is considered a more pertinent metric for evaluating success in moderation tasks. Our findings align with this perspective, highlighting the value of GPT-4 in scenarios where high recall is critical.
For context, when we compare metrics in tasks similar to our dataset, there is a broad range of performances. For instance, the KOLD dataset (Jeong et al. 2022) reports precision and recall rates of 50.8 and 47.8, respectively. In contrast, the best model in the "SemEval-2021 Task 5" toxic span detection competition (Zhu et al. 2021) achieved precision and recall of 75.01 and 89.66. These benchmarks help to situate our results within the current landscape of offensive span detection and highlight areas for further improvement.
Our ongoing efforts are focused on enhancing the precision of these models while maintaining or improving their recall performance, aiming to strike a balance that maximizes both the practical utility and accuracy of our detection system. Key details were also added in the paper.