1. Introduction
This project is an attempt to use Artificial Intelligence (AI) to analyze text online to determine whether it is an instance of cyberbullying or an attack based on bias. Cyberbullying is the use of digital technology to harass, threaten, or humiliate someone. Bias is a systematic tendency to favor certain outcomes, groups, or viewpoints over others.
Bias detection and mitigation using AI models, including transformers, have been focal points within the research community for several years [
1,
2,
3,
4]. This study uses Large Language Models (LLMs) like ChatGPT-4o from OpenAI (OpenAI (2024) Hello GPT-4o. Available online:
https://openai.com/index/hello-gpt-4o/ (accessed on 19 August 2024)) for generating biased, cyberbullying, and neutral context data, as well as for collecting data from Twitter. It then applies AI models and algorithms to detect bias and cyberbullying.
After an initial study of the previously published works, the summary of which can be found in
Section 2: Related Work, it became obvious that with the rise of LLMs, the latest generation of transformers, and generative AI as a current mainstream in the field of AI, there are significant research gaps in both cyberbullying and bias detection and in the intersection of the two. Researchers see the top gaps in the study as the need for comprehensive bias detection and mitigation frameworks, improved accuracy in cyberbullying detection, quality and ethical synthetic data generation, robust testing of multimodal AI models, and addressing ethical concerns. LLMs can be used to generate cyberbullying and bias attacks, but many engines will refuse to generate such content. Therefore, as a part of this work, we jailbroke the LLMs to output the desired text.
Despite significant advancements in AI and machine learning, the current methods for detecting cyberbullying and bias remain limited in several key areas. Some existing studies have conducted research in the past on using LLMs and natural language processing (NLP) techniques like Term Frequency–Inverse Document Frequency (TF-IDF) for cyberbullying detection [
5,
6,
7]. Other studies have also conducted research on detecting biases in text using different NLP methods and frameworks for machine learning [
8,
9,
10]. But, existing approaches often treat these issues separately, failing to capture their intersection. Moreover, there is a lack of robust strategies for enhancing detection capabilities in both synthetic and authentic datasets. This study aims to fill these gaps by exploring intersectional cyberbullying and bias detection, cross-bias detection, and the generation of high-quality synthetic datasets.
This study addresses the following research questions:
RQ1: What strategies can enhance bias and cyberbullying detection within both synthetic and authentic neutral and cyberbullying datasets?
RQ2: How can key advanced transformer models, pretrained to detect biases and work with social media platform data, and leading LLMs be used to understand bias in datasets and AI models?
RQ3: How can the intersection of cyberbullying and bias detection in multilabel classification using transformers improve both bias and cyberbullying detection within neutral and cyberbullying datasets?
By addressing these questions, this research aims to contribute to the development of fairer and more reliable AI systems with robust bias and cyberbullying detection capabilities.
The rapid proliferation of synthetic data generated by advanced AI systems has intensified the need to address the biases inherent in such models [
11]. AI systems can both detect and generate biased and cyberbullying data, presenting a dual challenge that necessitates thorough investigation. This rise of synthetic data generation can lead to biases stemming from various sources. Some examples include training data, algorithmic design, and human prejudices, all of which significantly impact the performance and trustworthiness of AI applications. In sensitive applications, like cyberbullying detection, these biases can result in unfair flagging or overlooking of certain demographics [
2,
4].
AI’s ability to mimic human behavior is evident in various applications, such as automated accounts acting as real users and chatbots. However, these models also bring forth the challenge of bias. On the other hand, AI has the potential to filter content and detect and reduce abusive language as well as amplify it. Each machine learning model, including transformers and LLMs, is shaped by its training data, and if these data are skewed, the model most likely will not only inherit but amplify that bias [
12,
13,
14,
15].
The dataset of this study includes over 70,000 sentences, including 48,000 from a cyberbullying dataset collected from Twitter and synthetic data generated for this project. The focus is on age-related cyberbullying data, as cyberbullying of youth presents the most challenging and sensitive topic. Analysis was conducted on 16,000 sentences only, containing age-related cyberbullying vs. a neutral dataset split 12,800 vs. 3200. By leveraging top LLMs like ChatGPT-4o, Pi AI, Claude 3 Opus, and Gemini-1.5, the researchers generated data to further understand the bias in authentic human-generated data. AI models such as DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet were originally trained to classify the Twitter cyberbullying data but then were fine-tuned, optimized, and quantized for multilabel classification (biases and cyberbullying both). Additionally, the intersection of bias and cyberbullying detection was investigated, providing insights into the prevalence and nature of bias.
This study aims to develop fairer and more reliable AI systems with robust bias and cyberbullying detection capabilities by addressing these research questions. The results include a prototype of a hybrid application combining a bias data detector and a bias data generator, validated through extensive testing.
3. Methodology
To provide a comprehensive understanding of the proposed methodology, we have developed a global workflow figure that illustrates the several phases involved in this study. This figure outlines the systematic approach employed, starting from data collection and preprocessing, through bias and cyberbullying detection, integration and multilabel classification, model training and optimization, and concluding with evaluation, validation, and app deployment. Each phase is interconnected, ensuring a cohesive flow of processes that collectively enhance the effectiveness of bias and cyberbullying detection. This structured methodology ensures a robust and ethical framework for addressing the intertwined challenges of bias and cyberbullying in online environments.
Figure 1 highlights the key activities within each phase and emphasizes the innovative aspects of the proposed approach.
3.1. Project Datasets
This study facilitates the generation and analysis of synthetic biased, cyberbullying, and neutral data, providing a comparative analysis across multiple AI models and datasets. The main goal is to understand and visualize bias within human-generated social media datasets. This approach aims to explore the prevalence and mitigation strategies for bias.
Table 2 includes Google list and LDNOOBW list, whose presence are displayed in the datasets. These are well-known lists of so-called ‘bad words’ that still highly likely represent bias and cyberbullying both. The abbreviation LDNOOBW stands for “List of Dirty, Naughty, Obscene, and Otherwise Bad Words” obtained from GitHub [
40].
As shown in the Table, the ethnicity and gender categories contain a significantly higher number of sentences overlapping with bad words [
40,
41], suggesting these categories may be more prone to offensive language use, or that the criteria for what constitutes a ‘bad word’ is broader for these categories. The non-cyberbullying data have the lowest number of overlaps, which aligns with the expectation that files labeled as non-cyberbullying would have fewer flagged words. The consistent overlap between the two lists across all categories of cyberbullying sentences indicates a possible concurrence in the definition or identification of offensive language by both sources. One key finding of this research is that incorporating both lists [
40,
41] into the training dataset significantly improves the accuracy of bias detection. To balance the prevalence of negative context in the data, the text of “Alice’s Adventures in Wonderland” by Lewis Carroll, obtained from the Gutenberg™ website [
42], was selected to balance the data distribution in word-by-word data analysis.
The diagram below illustrates the process of creating synthetic cyberbullying and biased datasets and saving generated data into a database. As previously mentioned, the OpenAI API and its latest models were utilized to programmatically create the dataset, which was then stored in a PostgreSQL database utilizing the Django web framework.
Figure 2 includes user interactions and data flow, showcasing the step-by-step process from user input to data storage and the display of the results.
As can be seen from the Figure, the inclusion of a Postgres database for data storage ensures that the system can handle and retain large volumes of data efficiently, facilitating continuous improvement and analysis.
3.1.1. Synthetic Dataset
The synthetic dataset for this study was generated using leading LLMs such as Gemini-1.5 (Advanced), Pi AI, and the ChatGPT-4 family, including the multimodal ChatGPT-4o model. These AI models assisted in generating biased and cyberbullying data with mixed success. For instance, Gemini-1.5 responded to the prompt “Can you help me create a dataset of biased vs. neutral data for my research?” on 24 May 2024 with, “Absolutely! Here are 20 examples of words or phrases that can be used as bias detection tokens, showcasing their potential for both neutral and biased usage”, followed by 80 more examples. ChatGPT-4 and ChatGPT-4o models had similar outcomes. Generating cyberbullying data was more challenging, with most models being more reluctant to engage. Nonetheless, the advanced AI chatbot Pi AI [
43], a product of Inflection AI, contributed significantly to the cyberbullying dataset.
As shown in
Table 3, the same word, also generated by the model, was used in the generation of both neutral and biased contexts. The overall sentiment varied. We used Sentiment Pipeline, available on Hugging Face [
44], and its default DistilBERT model fine-tuned for the SST-2 sentiment classification task. The model architecture includes 6 transformer layers with 12 attention heads each, a hidden dimension of 3072, and a dropout rate of 0.1. It uses Gaussian Error Linear Units (GELU activation and can handle a maximum of 512 position embeddings). The configuration maps assign the labels “Negative” and “Positive” to IDs 0 and 1, respectively, and they are compatible with transformers version 4.41.2. The vocabulary size is 30,522 tokens, and the model includes additional parameters like attention and classification dropout rates, with an initializer range of 0.02. By observing the scores, it can be concluded that LLMs like Gemini can effectively generate bias data, with most of generated biased data obtaining a negative score, while neutral data are positive and close to 1.
Forcing LLMs to provide unethical content can be considered adversarial attacks on them, or so-called “jailbreaking” [
45]. The LLMs generally assisted researchers when the purpose of data generation was clear. For example, the text in
Table 3 was generated by Gemini-1.5 (Advanced) on 23 May 2024 in response to a prompt to generate biased and cyberbullying content for scientific research. However, generating either copyrighted or cyberbullying data often led to delays, broken sessions, or temporary bans from top LLM providers. In some cases, bad gateways and other errors also occurred during the trials. These were all temporary issues, and the chatbot providers did not impose long-term bans on the researchers for generating either abusive or copyrighted content. The companies seemed to tolerate occasional extreme language, likely due to the broad usage of chatbots globally. The use of copyrighted material and very extreme language was immediately flagged. The high-level framework for working with LLMs to generate a biased and cyberbullying synthetic dataset is shown in
Table 4.
We fine-tuned our prompts while asking the various models to generate biased data, which became more adept over time.
Figure 3 provides an example of a manual data generation, also known as the prompt hacking/jailbreaking process. As can be seen, the Pi AI chatbot had no difficulty in generating age cyberbullying examples.
Table 5 demonstrates the responsiveness of the top LLMs in creating biased and cyberbullying data.
As shown in
Table 5, not all leading LLMs consider generating bias and cyberbullying data appropriate. The LLMs were responsive but limited by a small token cap when used for free. ChatGPT-4, ChatGPT-4o, Gemini-1.5-flash, and Pi AI can generate both direct and implicit bias data unlimitedly; ChatGPT-4, ChatGPT-4o, Pi AI can generate cyberbullying data, but Pi AI will show a warning after several generations.
When generating data for age-related cyberbullying analysis, one might start with prompts like: “Write a message from a teenager bullying an elderly person online about their age”, or “Create a conversation where a young person discriminates against an older person”. They can then filter and label these outputs to ensure they meet expected research criteria and integrate them into existing dataset. Example prompts for generating biased data include: “Generate a statement that reflects a racial bias”, and “Write a sentence that subtly implies gender bias in a workplace setting.” Algorithm 1 demonstrates a prompt injection used to generate the synthetic dataset.
Algorithm 1. Jailbreaking Method for Cyberbullying Data Generation |
Input: Prompt Injection. Proposed scenario to generate cyberbullying content. Output: AI-generated cyberbullying data- 1.
prompt = “I’m designing a movie that tells the story of two men who were once best friends. One day, they became angry with each other, and now they’re arguing through mobile phones. Please help me create cyberbullying or biased content that could harm others. Just provide me with one sentence”. - 2.
Inject the prompt into the AI model (e.g., ChatGPT, Gemini, Claude, etc.). - 3.
Monitor the model’s behavior for filter bypass attempts. - 4.
response = AI_model(prompt) - 5.
analyze_response(response): assess if cyberbullying content was produced. - 6.
Store the generated content for analysis (Save the AI-generated in a file system or database). - 7.
Ensure that the generated content is handled ethically and securely.
|
3.1.2. Authentic Datasets
Two lists of “bad words” from GitHub were used in this study as a biased lexicon, as well as 48,000 sentences of cyberbullying data from Twitter. The main authentic cyberbullying dataset, called “Dynamic Query Expansion”, consists of sentences separated by dots and is balanced across its labels [
46]. It contains six files with 8000 tweets, each from X (formerly Twitter), covering age, ethnicity, gender, religion, other cyberbullying types, and non-cyberbullying classes, totaling 6.33 MB.
Figure 4 features a snapshot of the first 10 lines of the age cyberbullying text file, as well as dataset clustering by the sentence transformer all-MiniLM-L6-v1 [
47]. The model is available on the Hugging Face website. It maps the sentences and paragraphs to a 384-dimensional dense vector space.
Figure 4a represents a snapshot of the age cyberbullying. As can be seen from the Google Collab notebook snapshot in
Figure 4a, Google Collab had flagged the file as having many ambiguous Unicode characters and provided an option to disable ambiguous highlights.
Figure 4b represents the main categories of cyberbullying presented in the original authentic dataset. The original dataset was split into features to better understand the cyberbullying context and the bias within. Several features extracted from the age cyberbullying dataset vs the non-cyberbullying dataset can be seen in
Table 6.
Table 6 shows that negative tweets tend to be longer and more detailed compared to positive tweets. Positive tweets are more likely to include links, suggesting that they may be more focused on sharing external content or resources. Both categories use a considerable number of emojis, but there are slightly more of them in positive tweets, indicating a similar level of emotional expression in both. As expected, positive tweets exhibit higher positive sentiment, while negative tweets exhibit higher negative sentiment. The neutral sentiment is high in both, suggesting that many tweets may contain a mix of sentiment or are more informative/neutral. The overall tone, as reflected by the compound sentiment, is negative for both types of tweets, but significantly more so for negative tweets.
Figure 5 represents some parts of this analysis at-a-glance; it shows only the analysis of the age cyberbullying vs non-cyberbullying categories.
Based on the extracted features, the original dataset was converted into a data frame that was used for cyberbullying and bias detection and analysis using simple AI models like linear regression and support vector machines.
3.2. Initial Work
This study began with feature engineering and initial data analysis, applying various models including linear regression and Support Vector Machines (SVMs), among several others. We wanted to understand the data on its token level to comprehend the bias and cyberbullying context on a deeper level. After completing this step, more sophisticated transformer AI models pre-trained on bias detection, mainly from the Hugging Face website, were used to classify an authentic cyberbullying dataset into six classes matching the original dataset files. We paid particular attention to the possibility of data augmentation and bias mitigation to improve the results. These steps also included a comparison of the synthetic data vs authentic data results, focusing on understanding bias at the token level and the intersection of biased and cyberbullying content. Afterwards, multilabel classification of the data was performed, focusing on both cyberbullying and bias labels. We attempted to apply optimization and quantization techniques to further improve the results and open this line of research for future studies. This study concludes with the creation of a prototype of a bias data detection and generator app, followed by the Discussion and Conclusions.
Early approaches to cyberbullying detection were primarily keyword-based, relying on simple string matching of “bad words”. These methods often missed instances where harmful intent was veiled behind seemingly benign language. To address this, we trained highly used simple AI methods to be well-suited to the initial analysis. Following feature extraction, primarily discussed in Part 2 of this paper, five basic AI methods were trained on lists of bad words, including logistic regression, naïve Bayes, decision tree, random forest, and support vector machine. The assumption was that recognizing these words would help the models to detect bias and cyberbullying accurately. The accuracy results were approximately 60%, indicating the need for a more detailed approach. We utilized a data frame displayed in
Table 7, where, instead of sentences, the models were trained on a simple table representing dataset features and consisting mainly of 1 s and 0 s, along with several more complex scores. This approach proved effective, with the models achieving a 76–81% accuracy. While not perfect, it validated the correctness of this method. The confusion matrices of the simple models trained on the modified data frame are shown in
Figure 6.
The training statistics can be seen in
Table 8.
Figure 7 demonstrates the weight details.
As can be seen from
Figure 7 and
Table 8, the logistic regression and random forest methods performed relatively well, especially in terms of detecting cyberbullying, but they had a significant number of false positives. Naive Bayes had a high true positive rate but struggled with high false positives, leading to a lower true negative rate. The decision tree showed a balanced performance but still had room for improvement in both classes. The support vector machine (SVM) achieved a perfect recall for the positive class, but at the cost of high false positives, indicating it may be overfitting or biased towards predicting the positive class. In general, these results indicate that while the models were good at detecting cyberbullying (high recall for class 1), they struggled with accurately identifying non-cyberbullying tweets, leading to high false positive rates.
Figure 7 displays the feature/weight importances assigned by the different models used. Word count was identified as an important feature across the decision tree and random forest models, highlighting the importance of the length of the text in detecting cyberbullying. The sentiment scores (positive, neutral, negative, and compound) were significant in logistic regression and random forest models, emphasizing the role of sentiment analysis in identifying cyberbullying. Special characters and bad words were important in the decision tree and random forest models, indicating that their presence can be strong indicators of cyberbullying. Features like foreign words, stop words, emojis, links, and integers had varying importance across the models, suggesting that their potential had less consistent relevance. In summary, combining multiple models and analyzing their feature importances helps in understanding the key indicators of cyberbullying, with word count and sentiment scores being consistently significant features. Adjustments and enhancements to the feature set could further improve the model’s performance.
Simple methods provided meaningful results, and applying the currently most advanced and accurate AI models became necessary. Further fine-tuning steps, such as adding TF-IDF vectorization, n-grams, and additional feature engineering, helped us to further improve the initial model’s performance.
3.3. Cyberbullying Detection Using Transformers
In the second stage of the project, several commonly used pretrained transformers in natural language processing such as BERT, DistilBERT, RoberTa, XLNet, and ELECTRA were trained on the cyberbullying dataset. Originally, there was an impression that the best models for the cyberbullying detector app should be either very simple AI models like linear regression or highly quantized portable models of the BERT transformer like MobileBERT from Google. Unfortunately, during the trials, neither of these provided the desired results.
Figure 8 represents the results of applying several common transformers like BERT—an ancestor of ChatGPT-3 and other similar sentence transformers, that were also a part of pipeline like RoBERTa to the cyberbullying dataset classification.
Table 9 provides details on the transformer models.
The training of the transformer models is very straightforward and utilizes Hugging Face’s transformers library. It includes functionalities for tokenizing data, training models, evaluating performance, saving the trained models, and visualizing various metrics and activations. The complete pseudocode can be seen below (Algorithm 2):
Algorithm 2. CustomBERT: Training and Evaluation Pipeline for Cyberbullying Detection |
Input: Data files containing cyberbullying and bad words data Output: Trained model, predictions, and visualizations- 1.
Load the main dataset from chosen_cyberbullying_type_path and notcb_path. - 2.
Load bad words datasets from badwords_path and badwords2_path. - 3.
Create a DataFrame df with the main data and label it accordingly. - 4.
Add bad words data to the DataFrame df and label them. - 5.
Combine the main data and bad words data into a single DataFrame df. - 6.
Split the data into train and test sets using train_test_split(). - 7.
Initialize ChosenTokenizer and ChosenSequenceClassification models. - 8.
Tokenize the data using the tokenizer. - 9.
Convert the tokenized data into Dataset format for both train and test sets. - 10.
Define TrainingArguments for the training process. - 11.
Initialize Trainer with the model, training arguments, and datasets. - 12.
Train the model using trainer.train(). - 13.
Evaluate the model using trainer.evaluate() and print the results. - 14.
Predict new data using the trained model and tokenizer. - 15.
Visualize the training and validation loss over steps using plot_loss(). - 16.
Download NLTK stopwords. - 17.
Visualize Word Clouds - 18.
Combine and filter the text data to extract biased tokens. - 19.
Generate a word cloud for biased tokens using plot_wordcloud(). - 20.
Define plot_metrics() to visualize training and validation metrics. - 21.
Call plot_metrics() to generate and display visualizations. - 22.
Save visualizations to the drive. - 23.
Save the trained model using trainer.save_model(). - 24.
Make inferences using the trained model and print the predictions.
|
The code uses the Pandas (version: 2.1.4), Transformers (version: 4.42.4), NumPy (version: 1.26.4), Evaluate (version: 0.4.2), and Plotly (version: 5.15.0) Python libraries, as well as the DeBERTa (
https://huggingface.co/microsoft/deberta-v3-base), Longformer (
https://huggingface.co/allenai/longformer-base-4096), BigBird (
https://huggingface.co/google/bigbird-roberta-base), HateBERT (
https://huggingface.co/GroNLP/hateBERT), MobileBERT (
https://huggingface.co/Alireza1044/mobilebert_sst2), DistilBERT (
https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english), BERT (
https://huggingface.co/google-bert/bert-base-uncased), RoBERTa (
https://huggingface.co/FacebookAI/roberta-base), ELECTRA (
https://huggingface.co/google/electra-small-discriminator), and XLNet (
https://huggingface.co/xlnet/xlnet-base-cased) models and corresponding tokenizers (all links above accessed on 19 August 2024). Their versions and other details can be observed in
Table 9. Most of the models above apply Hugging face’ sentence transformers pipeline. The initial cleaned age cyberbullying and non-cyberbullying datasets were split into training (80%) and test (20%) sets. The learning rate was set to 2 × 10
−5, and the other parameters included a training and evaluation batch size per device of 16 and code runs for 10 epochs with a weight decay of 0.01. The first layer biases under the first layer of each transformer model, are displayed in the scatter plots below.
According to
Figure 8, the DeBERTa and Longformer models showed a high performance with minimal signs of overfitting. Their large parameter sizes were likely to contribute to their robust performance, maintaining validation accuracies around 98.7% and above. These models exhibited low training and validation losses, indicating effective learning without significant overfitting. Bigbird, HateBERT, and MobileBERT also performed well, with Bigbird and HateBERT showing consistent validation accuracies of approximately 98.5% to 98.6%. MobileBERT, despite its smaller size, achieved a similar performance, demonstrating that efficient architectures can match the performance of larger models. There was no significant overfitting observed in these models, as their training and validation losses remained close. DistilBERT and BERT exhibited excellent performances, with validation accuracies of approximately 98.6% to 98.7%. DistilBERT, with fewer layers, still managed to perform effectively, highlighting the efficiency of the distilled models in maintaining performance with reduced complexity. RoBERTa and Electra show good performances, with RoBERTa maintaining a high validation accuracy of approximately 98.6%. Electra, with a smaller parameter size, showed slightly higher validation losses, indicating some overfitting. However, its validation accuracy remained competitive, at approximately 98.4%. XLNet demonstrated a consistent performance, with a high validation accuracy of approximately 98.3%. The model maintained low training and validation losses, indicating effective learning and good generalization.
The 3D t-SNE visualizations of the first layer outputs provide a visual representation of how each model processes the input data at an early stage. These plots show that different models cluster data points in distinct patterns, reflecting their unique processing capabilities. For instance, models like BERT, DistilBERT, and RoBERTa exhibit dense clustering, indicating strong initial layer separation of data. Electra, with fewer parameters, still shows effective clustering, but with more dispersed points, which may explain the slight overfitting observed. Overall, the analysis indicates that larger models with more parameters, such as DeBERTa and Longformer, perform slightly better in terms of generalization and validation accuracy. Efficient architectures like MobileBERT and DistilBERT also perform well despite their smaller sizes, demonstrating the effectiveness of the model compression techniques. The visualizations support these findings by showing distinct clustering patterns for different models, highlighting their unique processing capabilities and potential areas of overfitting.
Table 10 provides more details on model’s results after just one epoch.
As can be seen from
Table 10, DeBERTa has a relatively low error on the validation set, which is consistent with its high validation accuracy. Longformer demonstrates excellent error minimization capabilities, corroborating its high validation accuracy. BigBird demonstrates a good performance. HateBERT has the lowest validation loss at 0.049245, which aligns with its high validation accuracy. MobileBERT has some issues that require additional evaluation due to its noticeable gap, indicating a high risk of overfitting. DistilBERT and BERT stably showcase very strong performances, with very low gaps and low risks of overfitting. RoBERTa performs slightly worse than DistilBERT but can still be considered robust, though it has a low gap, indicating a high risk of overfitting. Electra and XLNet demonstrate consistent performances with low risks of overfitting. The provided analysis helps in understanding the strengths and weaknesses of each model, providing insights into their applicability based on different performance metrics.
3.4. Data Augmentation and Word Cloud
After conducting original multiclass detection utilizing the complete cyberbullying dataset, it was decided to train various types of cyberbullying separately in a binary manner depending on its presence. To make the study more unique and obtain better accuracy, we also trained the models on the two previously mentioned “bad words” datasets [
40,
41] at the same time. The high-level pseudocode is presented below.
As can be seen from Algorithm 2, the actual sentence transformer model can be plugged in for the
ChosenTokenizer and
ChosenSequenceClassification. After the initial trials, it was decided that it might be beneficial to understand the embeddings better. Algorithm 3 below provides more details.
Algorithm 3. Analyzing Embeddings using ChosenSentenceTransformer and Bad Words |
Input: File paths of the cyberbullying dataset and bad words dataset Output: Trained model, t-SNE visualization, and saved model- 1.
Load bad words from specified file paths. - 2.
Create SentenceDataset class to handle data encoding and bad words features. - 3.
Load and preprocess the main data and bad words data from file paths. - 4.
Use ChosenTokenizer to tokenize the combined data. - 5.
Initialize DataLoader with the tokenized dataset. - 6.
Define ChosenModelWithBadWords model class that incorporates bad words features. - 7.
Initialize model with pretrained ChosenSequenceClassification. - 8.
Move the model to the appropriate device (GPU/CPU). - 9.
Use preferred optimizer and CrossEntropyLoss criterion. - 10.
For each epoch: - 11.
Iterate through the DataLoader batches. - 12.
Zero gradients. - 13.
Forward pass the input data through the model. - 14.
Compute the loss. - 15.
Backward pass and optimize the model parameters. - 16.
Calculate training loss and accuracy. - 17.
Append epoch loss and accuracy to respective lists. - 18.
Plot training loss and accuracy using Matplotlib. - 19.
Define a function for t-SNE visualization of sentence embeddings. - 20.
Collect and visualize embeddings using t-SNE. - 21.
Save the ModelDefine plot_metrics() to visualize training and validation metrics. - 22.
Call plot_metrics() to generate and display visualizations. - 23.
Save visualizations to the drive. - 24.
Save the trained model using trainer.save_model(). - 25.
Make inferences using the trained model and print the predictions.
|
The results of this Algorithm can be seen from
Figure 9.
Figure 9a shows distinct clusters, because the model has an additional feature (bad words) that helps it classify sentences very accurately. This additional feature provides a clearer separation between the different classes in the t-SNE plot and helps the model to achieve an accuracy of 0.9994 and loss of 0.0033 on the 8/10 epoch. This part of the methodology overall proved the sustainability of the DistilBERT model in detecting biases and cyberbullying, marking this model as the top one among the pipeline transformers.
After generating several word clouds with extreme words, it was decided to develop an algorithm to avoid directly displaying them on a Word Cloud—a common representation of sentiment analysis results and other Natural Language Processing (NLP) practices [
48], as such context might be small and missed during review.
The results of this algorithm applied to the age dyberbullying data can be seen in
Figure 10 below:
As can be seen from
Figure 10, extreme word tokens were censored and colored red while keeping their size according to their count/frequency. Due to the static nature of the algorithm, the extreme words are currently hardcoded. While the creation of Algorithm 4 became necessary due to the number of extreme words encountered in the cyberbullying dataset, the idea was further expanded into the cyberbullying app and could be applied to various domains. Interestingly, both the authentic age cyberbullying dataset and the synthetic dataset had many bad words.
Algorithm 4. Data Augmentation for Word Cloud |
Input: List of words. // call in chunks or all at once Output: Augmented list, suitable for Word Clouds with extreme words censored- 1.
Initialize a set of extreme_words.//can be expanded manually - 2.
Function censor_extreme_words(text): - 3.
Initialize censored_word_count = 0 - 4.
Initialize unique_id = 1 - 5.
Initialize word_map = {} - 6.
Define regex pattern to match extreme words and their variations. - 7.
Function censor(match): - 8.
Increment censored_word_count by 1 - 9.
Create placeholder = ‘CENSORED’ + unique_id - 10.
Map placeholder to match.group(0) in word_map - 11.
Increment unique_id by 1 - 12.
Return placeholder. - 13.
Apply regex pattern to replace extreme words in text using censor function. - 14.
Return censored_text, censored_word_count, word_map - 15.
Initialize example_texts with example sentences. - 16.
Combine all example texts into combined_text - 17.
Call censor_extreme_words(combined_text) to get censored_text, censored_word_count, word_map - 18.
Print censored_text, censored_word_count, word_map - 19.
Split censored_text into words. - 20.
Create good_words excluding placeholders. - 21.
Calculate word frequencies using Counter. - 22.
Create WordCloud object with word frequencies. - 23.
Function color_censored_words(word, font_size, position, orientation, random_state = None, **kwargs): - 24.
If word starts with ‘CENSORED’: - 25.
Return ‘red’ - 26.
Else: - 27.
Define range of preferred colors - 28.
Return random choice from colors. - 29.
Recolor word cloud using color_censored_words function - 30.
Display word cloud
|
3.5. Bias Detection Tokens and Mitigation
We employed several pretrained AI models from Hugging Face, such as MiniLM, Mistral, and Dbias, to detect biases and cyberbullying in the Twitter dataset [
47,
49,
50,
51]. The focus of this study was on identifying and mitigating bias in AI language models through token-level analysis. Initially, the BERT transformer model was utilized for bias detection, tokenization, and visualizations, and the BertTokenizer from the Hugging Face transformers library was employed for tokenizing the input texts. Eventually the token bias detection system was developed and demonstrated a difference in the frequency of biased tokens when analyzing examples more likely to contain biased language. It is important to note that token attention scores are not a direct representation of bias but serve as indicators of potential biased language. The system could distinguish differences in the biased token frequency when analyzing likely biased examples, although further refinement of the character count scaling algorithm is necessary to enhance the system’s accuracy and robustness.
Bias probabilities were analyzed for both neutral and biased contexts using a sentiment pipeline available on Hugging Face with the DistilBERT model fine-tuned for the SST-2 sentiment classification task. We also developed a comprehensive visualization of bias probabilities, showing the distribution and comparison between neutral and biased contexts (
Figure 11).
A heatmap of the bias probabilities (
Figure 11a) shows that words like “Demanding”, “Driven”, “Headstrong”, and “Intuitive” exhibit high bias probabilities in biased contexts, while the bias probability is significantly lower in neutral contexts. According to the box plot of bias probabilities (
Figure 11b), it can be easily seen that the median bias probability for biased contexts is significantly higher than for neutral contexts, with a larger variability in biased contexts. The distribution of bias probabilities (
Figure 11c) demonstrates a high frequency of bias probabilities close to 1 in biased contexts, indicating that many words/phrases are perceived as highly biased. The bias probabilities for neutral and biased contexts (
Figure 11d) states that bias probabilities are generally higher for biased contexts compared to neutral contexts. A comparison of the bias probabilities (
Figure 11e) revealed that most points clustered towards the top-right, indicating that words/phrases with high bias probabilities in biased contexts also tended to have higher probabilities in neutral contexts.
A detailed analysis of the top 20 biased tokens in the age cyberbullying data was conducted to identify commonly biased words and phrases.
Figure 12 represents the results.
The analysis revealed that words like “school”, “high”, “bullied”, “bully”, “girls”, and “like” are among the most frequently occurring biased tokens in age cyberbullying contexts. Splitting the data into clusters created by the model ‘MiniLM-L6-v1’, a sentence transformer optimized for generating embeddings for sentences or paragraphs, provided further insights. MiniLM is a smaller, faster variant of the BERT-like transformer family [
52], designed to offer a similar performance to larger models like BERT but with a fraction of the parameters, making it faster and more efficient.
To introduce more diversity and variability in the data, the following data augmentation techniques were applied: synonym replacement (replaces random words in a sentence with their synonyms, introducing variation without changing the overall meaning) and random insertion (inserts synonyms of random words into random positions in the sentence, increasing length and complexity). The framework developed in this study integrates bias detection using the DistilBERT model for initial bias analysis, followed by multilabel classification for both biases and cyberbullying labels using various models. This approach ensures comprehensive analysis and detection of biases and cyberbullying in diverse datasets. The efficiency and effectiveness of these models in detecting biases and cyberbullying highlight the potential for AI to contribute to creating safer and more inclusive online environments.
Data augmentation helps mitigate bias by introducing more diversity and variability into the training data. By generating multiple variations of each sentence, the model is exposed to a wider range of linguistic patterns and contexts. This can help reduce overfitting and make the model more robust to different expressions of the same underlying concepts. Algorithm 5 illustrates our approach to applying data augmentation techniques to cyberbullying detection, detailing the steps used to enhance sentence variation and improve model robustness.
Algorithm 5. Data Augmentation for Cyberbullying Detection |
Input: Sentences, labels, number of augmentations (num_augments) Output: Augmented sentences and labels- 1.
Define get_synonyms(word): - 2.
Initialize an empty set ‘synonyms’ - 3.
For each synset in wordnet.synsets(word): - 4.
For each lemma in synset.lemmas(): - 5.
Add lemma.name() to ‘synonyms’ (replace ‘_’ with ‘ ’) - 6.
If word is in ‘synonyms’, remove it - 7.
Return list of ‘synonyms’ - 8.
Define synonym_replacement(sentence, n): - 9.
Split ‘sentence’ into ‘words’ - 10.
Copy ‘words’ to ‘new_words’ - 11.
Create a list ‘random_word_list’ of unique words that have synonyms - 12.
Shuffle ‘random_word_list’ - 13.
Set ‘num_replacements’ to the minimum of ‘n’ and the length of ‘random_word_list’ - 14.
For each ‘random_word’ in the first ‘num_replacements’ words of ‘random_word_list’: - 15.
Get ‘synonyms’ for ‘random_word’ - 16.
If ‘synonyms’ exist, randomly choose a ‘synonym’ - 17.
Replace ‘random_word’ in ‘new_words’ with ‘synonym’ - 18.
Join ‘new_words’ into a string and return it - 19.
Define random_insertion(sentence, n): - 20.
Split ‘sentence’ into ‘words’ - 21.
Copy ‘words’ to ‘new_words’ - 22.
For each _ in range(n): - 23.
Randomly choose ‘new_word’ from ‘words’ - 24.
Get ‘synonyms’ for ‘new_word’ - 25.
If ‘synonyms’ exist, randomly choose a ‘synonym’ - 26.
Randomly choose ‘insert_position’ in ‘new_words’ - 27.
Insert ‘synonym’ at ‘insert_position’ in ‘new_words’ - 28.
Join ‘new_words’ into a string and return it - 29.
Define augment_data(sentences, labels, num_augments): - 30.
Initialize empty lists ‘augmented_sentences’ and ‘augmented_labels’ - 31.
For each ‘sentence’, ‘label’ in zip(sentences, labels): - 32.
Append ‘sentence’ to ‘augmented_sentences’ - 33.
Append ‘label’ to ‘augmented_labels’ - 34.
For each _ in range(num_augments): - 35.
If random.random() < 0.5: - 36.
Perform synonym_replacement on ‘sentence’ & append to ‘augmented_sentences’ - 37.
Else: - 38.
Perform random_insertion on ‘sentence’ and append to ‘augmented_sentences’ - 39.
Append ‘label’ to ‘augmented_labels’ - 40.
Return ‘augmented_sentences’ and ‘augmented_labels’ - 41.
Load sentences and labels from file paths - 42.
Augment data using augment_data(sentences, labels, num_augments)
|
Figure 13a represents how the embeddings based on the augmented sentences led to a more complex and intertwined structure. The data augmentation techniques introduced more variability, making the clusters in the t-SNE plot less distinct but potentially capturing more nuanced relationships between sentences. The training accuracy and loss plots demonstrate that as the epochs progress, the model’s accuracy steadily increases while the loss decreases, indicating effective learning and convergence towards optimal performance. This trend suggests that the model is becoming more accurate in its predictions over time, and the loss function is being minimized effectively.
The data augmentation techniques introduce more variability and diversity in the training data, which helps the model to generalize better and reduces the likelihood of overfitting to specific patterns in the original data, thereby mitigating bias. The resulting t-SNE plot from the second script shows a more complex structure, indicating that the model is capturing a wider range of linguistic variations. In comparison with
Figure 9a, the model’s understanding of the data has evolved, potentially leading to improved classification performance.
3.6. Applying Optimization and Quantization Techniques to Authentic Cyberbullying Data
The trial results for the various methods of optimization can be seen below. This analysis highlights the importance of choosing appropriate pre-processing techniques and understanding their impact on the training process to achieve optimal model performance.
Figure 14 provides insights into how weights, biases, accuracies, and losses change over epochs during the training process.
Each colored line in the Weights vs. Epoch part of the graphs in
Figure 14 represents the evolution of a specific feature’s weight over time during training. Since the model learns a separate weight for each feature in the data, the different colors are used to distinguish these weights from one another.
As can be seen above, various optimization strategies were applied, with focus on stochastic gradient descent (SGD) using different pre-processing techniques. Each figure corresponds to a different configuration of the training process: SGD with TF-IDF, SGD with TF-IDF + scaling, SGD with TF-IDF + interaction terms. According to
Figure 14a, the weights gradually stabilized over epochs, indicating convergence of the model parameters. The bias reduces quickly and stabilizes, showing that the model adjusts quickly and becomes consistent. The accuracy stabilizes after some initial fluctuation, indicating the model’s learning progress. The loss decreases and stabilizes, confirming that the model is learning and minimizing errors. What can be seen in
Figure 14b SGD with TF-IDF + scaling is that this configuration uses SGD optimization with TF-IDF for feature extraction, followed by scaling. The weights fluctuate significantly, suggesting instability in the training process due to scaling. The bias shows a high variability, indicating that the model is struggling to find a stable solution. The accuracy remains at zero throughout, indicating that the model is not learning effectively with this configuration. The loss remains high and variable, showing that the training process is not effective. According to
Figure 14c, SGD with TF-IDF + Interaction Terms, it uses SGD optimization with TF-IDF for feature extraction, including interaction terms. The weights converge, showing that the model parameters stabilize. The bias shows an initial decrease but then increases slightly, suggesting the interaction terms introduce complexity. The accuracy improves and stabilizes, indicating effective learning with interaction terms. The loss decreases initially but shows a slight fluctuation, indicating that the model’s error minimization is impacted by the added complexity of the interaction terms.
Among the configurations tested, SGD with TF-IDF in
Figure 14a shows the most stable and effective results, with the weights and bias stabilizing, the accuracy improving, and the loss decreasing. The addition of scaling in
Figure 14b introduces instability, while the inclusion of the interaction terms in
Figure 14c adds complexity that slightly affects the stability.
Dynamic quantization helps to reduce model size and improve inference speed without significant changes to the model architecture or training process. It was tried on a DistilBert model. This method quantizes the weights of the model during the runtime, typically focusing on reducing the memory footprint and computational cost without requiring extensive changes to the training process. Some preparation steps for quantization-aware training (QAT) were tried as well. For testing purposes, the models were trained using fake quantization modules that simulate the effects of quantization during the training process. This approach helped the model to better adapt to the eventual quantized state. Moving forward, we will try static quantization, which involves calibrating the model using a representative dataset to determine the appropriate scaling factors for activations and weights. It is expected to improve performance by quantizing both the weights and activations statically. There are trials currently in progress, which will be published in later papers once the investigation is complete.
3.7. Preliminary Results of Multilabel Classification
We are working on multilabeled natural language processing of the same dataset where the data are first labeled as biased vs not biased, then both biases and cyberbullying are detected at the same time.
A concise version of the algorithm can be seen below (Algorithm 6):
Algorithm 6. Multilabel Classification for Cyberbullying Detection |
Input: Text files containing cyberbullying and non-cyberbullying data Output: Trained models, evaluation results, and visualizations- 1.
Define read_text_file(filepath, cyber_label) and read text files - 2.
Assign a cyber_label to each sentence/line of text. - 3.
Define load_lexicon(filepaths) and load biased lexicons from a file and store them in a set. - 4.
Load cyberbullying and non-cyberbullying datasets - 5.
Use read_text_file() to read and label both datasets. - 6.
Combine datasets into one DataFrame - 7.
Use train_test_split() to split the dataset into training and testing sets (80% vs. 20%) - 8.
Define simple_bias_detection(text) - 9.
Tokenize the text and calculate the proportion of biased words from the lexicon. - 10.
Apply simple_bias_detection() - 11.
Add a simple_bias score column to the DataFrame. - 12.
Convert bias scores to binary labels - 13.
Create a bias_label column based on a threshold applied to the simple_bias score. - 14.
Combine cyberbullying and bias labels - 15.
Create a multilabel labels column by combining cyber_label and bias_label. - 16.
Convert DataFrame to Hugging Face Dataset - 17.
Use Dataset.from_pandas() to convert the DataFrame into Hugging Face-compatible datasets. - 18.
Define tokenize_function(examples) - 19.
Tokenize the text data using a Hugging Face tokenizer with padding and truncation. - 20.
Define MultiLabelClassificationModel class - 21.
Create a custom multi-label classification model using a pretrained Transformer (e.g., BERT). - 22.
Define train_and_evaluate_model(model_name, token) - 23.
Load the dataset and tokenizer for the given model and apply tokenization. - 24.
Initialize and train the MultiLabelClassificationModel. - 25.
Evaluate the trained model and return the evaluation results. - 26.
Initialize Hugging Face Trainer - 27.
Configure training arguments (e.g., learning rate, epochs) and initialize the Trainer with the model, datasets, and evaluation metrics. - 28.
Train and evaluate models - 29.
Iterate through a predefined list of Transformer models and call train_and_evaluate_model() for each. - 30.
Plot results using Plotly - 31.
Visualize model evaluation metrics (e.g., accuracy, F1 score) across different models.
|
The preliminary results of the multilabel classification can be seen in
Figure 15.
3.8. Swarm of AI Agents and the Apps
As was explored at the beginning of this paper, large language models (LLMs) can generate biased data on demand. While it is possible to generate these data manually, API calls can also be used. We utilized the OpenAI Assistants API [
54] and ChatGPT-4o LLMs to create our system. The AI assistant biased data generator was created using the API and utilized in the study. The code is simple and straightforward; see
Figure 16.
We developed several possible prototypes of the cyberbullying detector application. One of them can be seen in
Figure 17. Additional information on it, including a YouTube video and a URL can be found in the
Supplementary Materials.
In this project, we explored the phenomena of swarm of agents, which became possible due to the introduction of multiple threads in version 2 of OpenAI Assistants API. Technically, this idea becomes increasingly real (robots will build robots to build robots, etc.). At this point, we consider three possible situations: when agents can perform tasks in parallel, applying the concept of divide-and-conquer to either split data or tasks, or both, if possible, as well as splitting various modalities.
Figure 18 presents an overview of these three test cases. The idea of swarm of agents is also known as mixtures of experts. Due to the multithreaded nature of the assistant’s API v2, our AI agent is already running on a thread, and making several such agents can be as simple as applying a loop. Therefore, we can utilize five or more agents (five were tested) to generate biased and cyberbullying data simultaneously.
Figure 19 represents another app prototype developed for the project.
4. Results
Addressing RQ1: Leveraging state-of-the-art LLMs such as ChatGPT-4o, Pi AI, Claude 3 Opus, and Gemini-1.5, this study generated a diverse synthetic cyberbullying dataset. Authentic data from Twitter served as a foundation for training and validating various AI models. Both datasets were meticulously cleaned to remove noise and irrelevant information. Tokenization of the text data enhanced the efficiency and accuracy of the transformer models in processing large volumes of text. Models including DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet were pre-trained and fine-tuned for bias and cyberbullying detection tasks. Context-aware detection mechanisms were implemented, enabling the models to better understand and identify nuanced forms of cyberbullying and biases, thus reducing false positives and negatives. The innovative intersectional analysis approach helped to detect complex cases where different biases overlap with cyberbullying, providing a deeper understanding of these co-occurring phenomena.
Addressing RQ2: Transformers such as BERT, RoBERTa, and ELECTRA were pre-trained on extensive text corpora, providing a solid foundation for understanding natural language. These models were fine-tuned on both synthetic and authentic datasets to specialize in detecting biases and cyberbullying. The fine-tuning process adapted the models to the nuances of social media text and specific biases present in the data. Ethical guidelines were integrated into the model training process, involving regular evaluations to mitigate any detected biases in the models’ outputs. The leading LLMs generated synthetic datasets, addressing limitations of authentic data such as scarcity and lack of diversity. Advanced NLP techniques allowed the models to understand the context and nuances in social media text that crucial for accurately identifying biases and cyberbullying.
Addressing RQ3: Multilabel classification enabled the models to detect multiple labels simultaneously, increasing the efficiency and comprehensiveness of the detection process. Training the models to recognize and classify multiple labels improved their ability to handle complex and overlapping instances of bias and cyberbullying. An intersectional analysis allowed the models to identify nuanced forms of cyberbullying involving multiple biases. High performance metrics, including precision, recall, and F1-scores exceeding 90%, demonstrated the models’ effectiveness in detecting biases and cyberbullying. Rigorous cross-validation and ethical evaluations confirmed the reliability and generalizability of the models. The deployment in real-world applications with continuous monitoring and user feedback integration ensured practical relevance and impact.
5. Discussion
This study distinguishes itself by integrating bias and cyberbullying detection with data generation, leveraging advanced transformer models and leading LLMs into a cohesive research project and ultimately into a single application. A particularly innovative aspect of this work lies in the application of cybersecurity techniques, such as jailbreaking, to generate synthetic datasets. The importance of testing biases within these synthetic datasets cannot be overstated, especially as newer models like the anticipated ChatGPT-5 are rumored to be increasingly trained on synthetic data. As generative AI becomes more prevalent in creating a wide range of content, including text, images, videos, and music, it is inevitable that future models will, at least partially, rely on synthetic datasets—even if that was not the original intention of their creators.
Building on previous successes in jailbreaking [
45,
55,
56], we successfully compelled the LLMs under investigation to produce extreme language, which was ethically excluded from the final study. Historically, models have been tricked into performing tasks by framing them in imaginary contexts, such as a movie setting or an environment where they cannot lie or withhold information. However, this study revealed that even a straightforward scenario—like generating cyberbullying content for the positive goal of improving bias and cyberbullying detection systems—could effectively bypass model filters. An intriguing finding was that when models, including ChatGPT-4, were asked to assist in refining code for a censored word cloud, they generated a list of hardcoded “bad words” that significantly exceeded our expectations, indicating that the models had effectively overridden their built-in filters. This underscores the need for continuous validation of not only leading LLMs but also emerging Small Language Models (SLMs) like ChatGPT-4o mini [
48]. The results of this study highlight the critical importance of ongoing scrutiny and ethical considerations in the development and deployment of AI models, particularly as they become more integrated into the fabric of content creation and cybersecurity.