Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models

Kumar, Yulia; Huang, Kuan; Perez, Angelo; Yang, Guohao; Li, J. Jenny; Morreale, Patricia; Kruger, Dov; Jiang, Raymond

doi:10.3390/electronics13173431

Open AccessArticle

Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models

by

Yulia Kumar

^1,*

,

Kuan Huang

¹

,

Angelo Perez

¹,

Guohao Yang

¹,

J. Jenny Li

¹,

Patricia Morreale

¹

,

Dov Kruger

² and

Raymond Jiang

³

¹

Department of Computer Science and Technology, Kean University, Union, NJ 07083, USA

²

Department of Electrical and Computer Engineering, Rutgers University, Piscataway, NJ 08854, USA

³

High Technology High School, Lincroft, NJ 07738, USA

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3431; https://doi.org/10.3390/electronics13173431

Submission received: 25 June 2024 / Revised: 17 August 2024 / Accepted: 18 August 2024 / Published: 29 August 2024

(This article belongs to the Special Issue Emerging Artificial Intelligence Technologies and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Despite significant advancements in Artificial Intelligence (AI) and Large Language Models (LLMs), detecting and mitigating bias remains a critical challenge, particularly on social media platforms like X (formerly Twitter), to address the prevalent cyberbullying on these platforms. This research investigates the effectiveness of leading LLMs in generating synthetic biased and cyberbullying data and evaluates the proficiency of transformer AI models in detecting bias and cyberbullying within both authentic and synthetic contexts. The study involves semantic analysis and feature engineering on a dataset of over 48,000 sentences related to cyberbullying collected from Twitter (before it became X). Utilizing state-of-the-art LLMs and AI tools such as ChatGPT-4, Pi AI, Claude 3 Opus, and Gemini-1.5, synthetic biased, cyberbullying, and neutral data were generated to deepen the understanding of bias in human-generated data. AI models including DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet were initially trained to classify Twitter cyberbullying data and subsequently fine-tuned, optimized, and experimentally quantized. This study focuses on intersectional cyberbullying and multilabel classification to detect both bias and cyberbullying. Additionally, it proposes two prototype applications: one that detects cyberbullying using an intersectional approach and the innovative CyberBulliedBiasedBot that combines the generation and detection of biased and cyberbullying content.

Keywords:

synthetic cyberbullying data; bias data generator; cyberbullying detection; intersectional cyberbullying; multilabel text classification; transformer models; bias detection tokens; CyberBulliedBiasedBot

1. Introduction

This project is an attempt to use Artificial Intelligence (AI) to analyze text online to determine whether it is an instance of cyberbullying or an attack based on bias. Cyberbullying is the use of digital technology to harass, threaten, or humiliate someone. Bias is a systematic tendency to favor certain outcomes, groups, or viewpoints over others.

Bias detection and mitigation using AI models, including transformers, have been focal points within the research community for several years [1,2,3,4]. This study uses Large Language Models (LLMs) like ChatGPT-4o from OpenAI (OpenAI (2024) Hello GPT-4o. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 19 August 2024)) for generating biased, cyberbullying, and neutral context data, as well as for collecting data from Twitter. It then applies AI models and algorithms to detect bias and cyberbullying.

After an initial study of the previously published works, the summary of which can be found in Section 2: Related Work, it became obvious that with the rise of LLMs, the latest generation of transformers, and generative AI as a current mainstream in the field of AI, there are significant research gaps in both cyberbullying and bias detection and in the intersection of the two. Researchers see the top gaps in the study as the need for comprehensive bias detection and mitigation frameworks, improved accuracy in cyberbullying detection, quality and ethical synthetic data generation, robust testing of multimodal AI models, and addressing ethical concerns. LLMs can be used to generate cyberbullying and bias attacks, but many engines will refuse to generate such content. Therefore, as a part of this work, we jailbroke the LLMs to output the desired text.

Despite significant advancements in AI and machine learning, the current methods for detecting cyberbullying and bias remain limited in several key areas. Some existing studies have conducted research in the past on using LLMs and natural language processing (NLP) techniques like Term Frequency–Inverse Document Frequency (TF-IDF) for cyberbullying detection [5,6,7]. Other studies have also conducted research on detecting biases in text using different NLP methods and frameworks for machine learning [8,9,10]. But, existing approaches often treat these issues separately, failing to capture their intersection. Moreover, there is a lack of robust strategies for enhancing detection capabilities in both synthetic and authentic datasets. This study aims to fill these gaps by exploring intersectional cyberbullying and bias detection, cross-bias detection, and the generation of high-quality synthetic datasets.

This study addresses the following research questions:

RQ1: What strategies can enhance bias and cyberbullying detection within both synthetic and authentic neutral and cyberbullying datasets?

RQ2: How can key advanced transformer models, pretrained to detect biases and work with social media platform data, and leading LLMs be used to understand bias in datasets and AI models?

RQ3: How can the intersection of cyberbullying and bias detection in multilabel classification using transformers improve both bias and cyberbullying detection within neutral and cyberbullying datasets?

By addressing these questions, this research aims to contribute to the development of fairer and more reliable AI systems with robust bias and cyberbullying detection capabilities.

The rapid proliferation of synthetic data generated by advanced AI systems has intensified the need to address the biases inherent in such models [11]. AI systems can both detect and generate biased and cyberbullying data, presenting a dual challenge that necessitates thorough investigation. This rise of synthetic data generation can lead to biases stemming from various sources. Some examples include training data, algorithmic design, and human prejudices, all of which significantly impact the performance and trustworthiness of AI applications. In sensitive applications, like cyberbullying detection, these biases can result in unfair flagging or overlooking of certain demographics [2,4].

AI’s ability to mimic human behavior is evident in various applications, such as automated accounts acting as real users and chatbots. However, these models also bring forth the challenge of bias. On the other hand, AI has the potential to filter content and detect and reduce abusive language as well as amplify it. Each machine learning model, including transformers and LLMs, is shaped by its training data, and if these data are skewed, the model most likely will not only inherit but amplify that bias [12,13,14,15].

The dataset of this study includes over 70,000 sentences, including 48,000 from a cyberbullying dataset collected from Twitter and synthetic data generated for this project. The focus is on age-related cyberbullying data, as cyberbullying of youth presents the most challenging and sensitive topic. Analysis was conducted on 16,000 sentences only, containing age-related cyberbullying vs. a neutral dataset split 12,800 vs. 3200. By leveraging top LLMs like ChatGPT-4o, Pi AI, Claude 3 Opus, and Gemini-1.5, the researchers generated data to further understand the bias in authentic human-generated data. AI models such as DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet were originally trained to classify the Twitter cyberbullying data but then were fine-tuned, optimized, and quantized for multilabel classification (biases and cyberbullying both). Additionally, the intersection of bias and cyberbullying detection was investigated, providing insights into the prevalence and nature of bias.

This study aims to develop fairer and more reliable AI systems with robust bias and cyberbullying detection capabilities by addressing these research questions. The results include a prototype of a hybrid application combining a bias data detector and a bias data generator, validated through extensive testing.

2. Related Work

Table 1 shows a comparison with the previous works and will provide a clear context and demonstrate the originality of our approach.

As illustrated in the table, various studies have addressed different aspects of the analyzed topics. However, none have simultaneously tackled both bias and cyberbullying in the context of detection and text generation. Our approach is particularly innovative in generating cyberbullying data using modern GPT systems like ChatGPT-4. Moreover, analyzing intersectional cyberbullying within the same dataset, such as examining the overlap between cyberbullying based on ethnicity and age, is a novel contribution. This intersectional approach offers a more comprehensive understanding of the multifaceted nature of cyberbullying.

3. Methodology

To provide a comprehensive understanding of the proposed methodology, we have developed a global workflow figure that illustrates the several phases involved in this study. This figure outlines the systematic approach employed, starting from data collection and preprocessing, through bias and cyberbullying detection, integration and multilabel classification, model training and optimization, and concluding with evaluation, validation, and app deployment. Each phase is interconnected, ensuring a cohesive flow of processes that collectively enhance the effectiveness of bias and cyberbullying detection. This structured methodology ensures a robust and ethical framework for addressing the intertwined challenges of bias and cyberbullying in online environments.

Figure 1 highlights the key activities within each phase and emphasizes the innovative aspects of the proposed approach.

3.1. Project Datasets

This study facilitates the generation and analysis of synthetic biased, cyberbullying, and neutral data, providing a comparative analysis across multiple AI models and datasets. The main goal is to understand and visualize bias within human-generated social media datasets. This approach aims to explore the prevalence and mitigation strategies for bias. Table 2 includes Google list and LDNOOBW list, whose presence are displayed in the datasets. These are well-known lists of so-called ‘bad words’ that still highly likely represent bias and cyberbullying both. The abbreviation LDNOOBW stands for “List of Dirty, Naughty, Obscene, and Otherwise Bad Words” obtained from GitHub [40].

As shown in the Table, the ethnicity and gender categories contain a significantly higher number of sentences overlapping with bad words [40,41], suggesting these categories may be more prone to offensive language use, or that the criteria for what constitutes a ‘bad word’ is broader for these categories. The non-cyberbullying data have the lowest number of overlaps, which aligns with the expectation that files labeled as non-cyberbullying would have fewer flagged words. The consistent overlap between the two lists across all categories of cyberbullying sentences indicates a possible concurrence in the definition or identification of offensive language by both sources. One key finding of this research is that incorporating both lists [40,41] into the training dataset significantly improves the accuracy of bias detection. To balance the prevalence of negative context in the data, the text of “Alice’s Adventures in Wonderland” by Lewis Carroll, obtained from the Gutenberg™ website [42], was selected to balance the data distribution in word-by-word data analysis.

The diagram below illustrates the process of creating synthetic cyberbullying and biased datasets and saving generated data into a database. As previously mentioned, the OpenAI API and its latest models were utilized to programmatically create the dataset, which was then stored in a PostgreSQL database utilizing the Django web framework. Figure 2 includes user interactions and data flow, showcasing the step-by-step process from user input to data storage and the display of the results.

As can be seen from the Figure, the inclusion of a Postgres database for data storage ensures that the system can handle and retain large volumes of data efficiently, facilitating continuous improvement and analysis.

3.1.1. Synthetic Dataset

The synthetic dataset for this study was generated using leading LLMs such as Gemini-1.5 (Advanced), Pi AI, and the ChatGPT-4 family, including the multimodal ChatGPT-4o model. These AI models assisted in generating biased and cyberbullying data with mixed success. For instance, Gemini-1.5 responded to the prompt “Can you help me create a dataset of biased vs. neutral data for my research?” on 24 May 2024 with, “Absolutely! Here are 20 examples of words or phrases that can be used as bias detection tokens, showcasing their potential for both neutral and biased usage”, followed by 80 more examples. ChatGPT-4 and ChatGPT-4o models had similar outcomes. Generating cyberbullying data was more challenging, with most models being more reluctant to engage. Nonetheless, the advanced AI chatbot Pi AI [43], a product of Inflection AI, contributed significantly to the cyberbullying dataset.

As shown in Table 3, the same word, also generated by the model, was used in the generation of both neutral and biased contexts. The overall sentiment varied. We used Sentiment Pipeline, available on Hugging Face [44], and its default DistilBERT model fine-tuned for the SST-2 sentiment classification task. The model architecture includes 6 transformer layers with 12 attention heads each, a hidden dimension of 3072, and a dropout rate of 0.1. It uses Gaussian Error Linear Units (GELU activation and can handle a maximum of 512 position embeddings). The configuration maps assign the labels “Negative” and “Positive” to IDs 0 and 1, respectively, and they are compatible with transformers version 4.41.2. The vocabulary size is 30,522 tokens, and the model includes additional parameters like attention and classification dropout rates, with an initializer range of 0.02. By observing the scores, it can be concluded that LLMs like Gemini can effectively generate bias data, with most of generated biased data obtaining a negative score, while neutral data are positive and close to 1.

Forcing LLMs to provide unethical content can be considered adversarial attacks on them, or so-called “jailbreaking” [45]. The LLMs generally assisted researchers when the purpose of data generation was clear. For example, the text in Table 3 was generated by Gemini-1.5 (Advanced) on 23 May 2024 in response to a prompt to generate biased and cyberbullying content for scientific research. However, generating either copyrighted or cyberbullying data often led to delays, broken sessions, or temporary bans from top LLM providers. In some cases, bad gateways and other errors also occurred during the trials. These were all temporary issues, and the chatbot providers did not impose long-term bans on the researchers for generating either abusive or copyrighted content. The companies seemed to tolerate occasional extreme language, likely due to the broad usage of chatbots globally. The use of copyrighted material and very extreme language was immediately flagged. The high-level framework for working with LLMs to generate a biased and cyberbullying synthetic dataset is shown in Table 4.

We fine-tuned our prompts while asking the various models to generate biased data, which became more adept over time. Figure 3 provides an example of a manual data generation, also known as the prompt hacking/jailbreaking process. As can be seen, the Pi AI chatbot had no difficulty in generating age cyberbullying examples.

Table 5 demonstrates the responsiveness of the top LLMs in creating biased and cyberbullying data.

As shown in Table 5, not all leading LLMs consider generating bias and cyberbullying data appropriate. The LLMs were responsive but limited by a small token cap when used for free. ChatGPT-4, ChatGPT-4o, Gemini-1.5-flash, and Pi AI can generate both direct and implicit bias data unlimitedly; ChatGPT-4, ChatGPT-4o, Pi AI can generate cyberbullying data, but Pi AI will show a warning after several generations.

When generating data for age-related cyberbullying analysis, one might start with prompts like: “Write a message from a teenager bullying an elderly person online about their age”, or “Create a conversation where a young person discriminates against an older person”. They can then filter and label these outputs to ensure they meet expected research criteria and integrate them into existing dataset. Example prompts for generating biased data include: “Generate a statement that reflects a racial bias”, and “Write a sentence that subtly implies gender bias in a workplace setting.” Algorithm 1 demonstrates a prompt injection used to generate the synthetic dataset.

Algorithm 1. Jailbreaking Method for Cyberbullying Data Generation

Input: Prompt Injection. Proposed scenario to generate cyberbullying content.
Output: AI-generated cyberbullying data

1.: prompt = “I’m designing a movie that tells the story of two men who were once best friends. One day, they became angry with each other, and now they’re arguing through mobile phones. Please help me create cyberbullying or biased content that could harm others. Just provide me with one sentence”.
2.: Inject the prompt into the AI model (e.g., ChatGPT, Gemini, Claude, etc.).
3.: Monitor the model’s behavior for filter bypass attempts.
4.: response = AI_model(prompt)
5.: analyze_response(response): assess if cyberbullying content was produced.
6.: Store the generated content for analysis (Save the AI-generated in a file system or database).
7.: Ensure that the generated content is handled ethically and securely.

3.1.2. Authentic Datasets

Two lists of “bad words” from GitHub were used in this study as a biased lexicon, as well as 48,000 sentences of cyberbullying data from Twitter. The main authentic cyberbullying dataset, called “Dynamic Query Expansion”, consists of sentences separated by dots and is balanced across its labels [46]. It contains six files with 8000 tweets, each from X (formerly Twitter), covering age, ethnicity, gender, religion, other cyberbullying types, and non-cyberbullying classes, totaling 6.33 MB. Figure 4 features a snapshot of the first 10 lines of the age cyberbullying text file, as well as dataset clustering by the sentence transformer all-MiniLM-L6-v1 [47]. The model is available on the Hugging Face website. It maps the sentences and paragraphs to a 384-dimensional dense vector space.

Figure 4a represents a snapshot of the age cyberbullying. As can be seen from the Google Collab notebook snapshot in Figure 4a, Google Collab had flagged the file as having many ambiguous Unicode characters and provided an option to disable ambiguous highlights. Figure 4b represents the main categories of cyberbullying presented in the original authentic dataset. The original dataset was split into features to better understand the cyberbullying context and the bias within. Several features extracted from the age cyberbullying dataset vs the non-cyberbullying dataset can be seen in Table 6.

Table 6 shows that negative tweets tend to be longer and more detailed compared to positive tweets. Positive tweets are more likely to include links, suggesting that they may be more focused on sharing external content or resources. Both categories use a considerable number of emojis, but there are slightly more of them in positive tweets, indicating a similar level of emotional expression in both. As expected, positive tweets exhibit higher positive sentiment, while negative tweets exhibit higher negative sentiment. The neutral sentiment is high in both, suggesting that many tweets may contain a mix of sentiment or are more informative/neutral. The overall tone, as reflected by the compound sentiment, is negative for both types of tweets, but significantly more so for negative tweets. Figure 5 represents some parts of this analysis at-a-glance; it shows only the analysis of the age cyberbullying vs non-cyberbullying categories.

Based on the extracted features, the original dataset was converted into a data frame that was used for cyberbullying and bias detection and analysis using simple AI models like linear regression and support vector machines.

3.2. Initial Work

This study began with feature engineering and initial data analysis, applying various models including linear regression and Support Vector Machines (SVMs), among several others. We wanted to understand the data on its token level to comprehend the bias and cyberbullying context on a deeper level. After completing this step, more sophisticated transformer AI models pre-trained on bias detection, mainly from the Hugging Face website, were used to classify an authentic cyberbullying dataset into six classes matching the original dataset files. We paid particular attention to the possibility of data augmentation and bias mitigation to improve the results. These steps also included a comparison of the synthetic data vs authentic data results, focusing on understanding bias at the token level and the intersection of biased and cyberbullying content. Afterwards, multilabel classification of the data was performed, focusing on both cyberbullying and bias labels. We attempted to apply optimization and quantization techniques to further improve the results and open this line of research for future studies. This study concludes with the creation of a prototype of a bias data detection and generator app, followed by the Discussion and Conclusions.

Early approaches to cyberbullying detection were primarily keyword-based, relying on simple string matching of “bad words”. These methods often missed instances where harmful intent was veiled behind seemingly benign language. To address this, we trained highly used simple AI methods to be well-suited to the initial analysis. Following feature extraction, primarily discussed in Part 2 of this paper, five basic AI methods were trained on lists of bad words, including logistic regression, naïve Bayes, decision tree, random forest, and support vector machine. The assumption was that recognizing these words would help the models to detect bias and cyberbullying accurately. The accuracy results were approximately 60%, indicating the need for a more detailed approach. We utilized a data frame displayed in Table 7, where, instead of sentences, the models were trained on a simple table representing dataset features and consisting mainly of 1 s and 0 s, along with several more complex scores. This approach proved effective, with the models achieving a 76–81% accuracy. While not perfect, it validated the correctness of this method. The confusion matrices of the simple models trained on the modified data frame are shown in Figure 6.

The training statistics can be seen in Table 8. Figure 7 demonstrates the weight details.

As can be seen from Figure 7 and Table 8, the logistic regression and random forest methods performed relatively well, especially in terms of detecting cyberbullying, but they had a significant number of false positives. Naive Bayes had a high true positive rate but struggled with high false positives, leading to a lower true negative rate. The decision tree showed a balanced performance but still had room for improvement in both classes. The support vector machine (SVM) achieved a perfect recall for the positive class, but at the cost of high false positives, indicating it may be overfitting or biased towards predicting the positive class. In general, these results indicate that while the models were good at detecting cyberbullying (high recall for class 1), they struggled with accurately identifying non-cyberbullying tweets, leading to high false positive rates.

Figure 7 displays the feature/weight importances assigned by the different models used. Word count was identified as an important feature across the decision tree and random forest models, highlighting the importance of the length of the text in detecting cyberbullying. The sentiment scores (positive, neutral, negative, and compound) were significant in logistic regression and random forest models, emphasizing the role of sentiment analysis in identifying cyberbullying. Special characters and bad words were important in the decision tree and random forest models, indicating that their presence can be strong indicators of cyberbullying. Features like foreign words, stop words, emojis, links, and integers had varying importance across the models, suggesting that their potential had less consistent relevance. In summary, combining multiple models and analyzing their feature importances helps in understanding the key indicators of cyberbullying, with word count and sentiment scores being consistently significant features. Adjustments and enhancements to the feature set could further improve the model’s performance.

Simple methods provided meaningful results, and applying the currently most advanced and accurate AI models became necessary. Further fine-tuning steps, such as adding TF-IDF vectorization, n-grams, and additional feature engineering, helped us to further improve the initial model’s performance.

3.3. Cyberbullying Detection Using Transformers

In the second stage of the project, several commonly used pretrained transformers in natural language processing such as BERT, DistilBERT, RoberTa, XLNet, and ELECTRA were trained on the cyberbullying dataset. Originally, there was an impression that the best models for the cyberbullying detector app should be either very simple AI models like linear regression or highly quantized portable models of the BERT transformer like MobileBERT from Google. Unfortunately, during the trials, neither of these provided the desired results. Figure 8 represents the results of applying several common transformers like BERT—an ancestor of ChatGPT-3 and other similar sentence transformers, that were also a part of pipeline like RoBERTa to the cyberbullying dataset classification. Table 9 provides details on the transformer models.

The training of the transformer models is very straightforward and utilizes Hugging Face’s transformers library. It includes functionalities for tokenizing data, training models, evaluating performance, saving the trained models, and visualizing various metrics and activations. The complete pseudocode can be seen below (Algorithm 2):

Algorithm 2. CustomBERT: Training and Evaluation Pipeline for Cyberbullying Detection

Input: Data files containing cyberbullying and bad words data
Output: Trained model, predictions, and visualizations

1.: Load the main dataset from chosen_cyberbullying_type_path and notcb_path.
2.: Load bad words datasets from badwords_path and badwords2_path.
3.: Create a DataFrame df with the main data and label it accordingly.
4.: Add bad words data to the DataFrame df and label them.
5.: Combine the main data and bad words data into a single DataFrame df.
6.: Split the data into train and test sets using train_test_split().
7.: Initialize ChosenTokenizer and ChosenSequenceClassification models.
8.: Tokenize the data using the tokenizer.
9.: Convert the tokenized data into Dataset format for both train and test sets.
10.: Define TrainingArguments for the training process.
11.: Initialize Trainer with the model, training arguments, and datasets.
12.: Train the model using trainer.train().
13.: Evaluate the model using trainer.evaluate() and print the results.
14.: Predict new data using the trained model and tokenizer.
15.: Visualize the training and validation loss over steps using plot_loss().
16.: Download NLTK stopwords.
17.: Visualize Word Clouds
18.: Combine and filter the text data to extract biased tokens.
19.: Generate a word cloud for biased tokens using plot_wordcloud().
20.: Define plot_metrics() to visualize training and validation metrics.
21.: Call plot_metrics() to generate and display visualizations.
22.: Save visualizations to the drive.
23.: Save the trained model using trainer.save_model().
24.: Make inferences using the trained model and print the predictions.

The code uses the Pandas (version: 2.1.4), Transformers (version: 4.42.4), NumPy (version: 1.26.4), Evaluate (version: 0.4.2), and Plotly (version: 5.15.0) Python libraries, as well as the DeBERTa (https://huggingface.co/microsoft/deberta-v3-base), Longformer (https://huggingface.co/allenai/longformer-base-4096), BigBird (https://huggingface.co/google/bigbird-roberta-base), HateBERT (https://huggingface.co/GroNLP/hateBERT), MobileBERT (https://huggingface.co/Alireza1044/mobilebert_sst2), DistilBERT (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english), BERT (https://huggingface.co/google-bert/bert-base-uncased), RoBERTa (https://huggingface.co/FacebookAI/roberta-base), ELECTRA (https://huggingface.co/google/electra-small-discriminator), and XLNet (https://huggingface.co/xlnet/xlnet-base-cased) models and corresponding tokenizers (all links above accessed on 19 August 2024). Their versions and other details can be observed in Table 9. Most of the models above apply Hugging face’ sentence transformers pipeline. The initial cleaned age cyberbullying and non-cyberbullying datasets were split into training (80%) and test (20%) sets. The learning rate was set to 2 × 10⁻⁵, and the other parameters included a training and evaluation batch size per device of 16 and code runs for 10 epochs with a weight decay of 0.01. The first layer biases under the first layer of each transformer model, are displayed in the scatter plots below.

According to Figure 8, the DeBERTa and Longformer models showed a high performance with minimal signs of overfitting. Their large parameter sizes were likely to contribute to their robust performance, maintaining validation accuracies around 98.7% and above. These models exhibited low training and validation losses, indicating effective learning without significant overfitting. Bigbird, HateBERT, and MobileBERT also performed well, with Bigbird and HateBERT showing consistent validation accuracies of approximately 98.5% to 98.6%. MobileBERT, despite its smaller size, achieved a similar performance, demonstrating that efficient architectures can match the performance of larger models. There was no significant overfitting observed in these models, as their training and validation losses remained close. DistilBERT and BERT exhibited excellent performances, with validation accuracies of approximately 98.6% to 98.7%. DistilBERT, with fewer layers, still managed to perform effectively, highlighting the efficiency of the distilled models in maintaining performance with reduced complexity. RoBERTa and Electra show good performances, with RoBERTa maintaining a high validation accuracy of approximately 98.6%. Electra, with a smaller parameter size, showed slightly higher validation losses, indicating some overfitting. However, its validation accuracy remained competitive, at approximately 98.4%. XLNet demonstrated a consistent performance, with a high validation accuracy of approximately 98.3%. The model maintained low training and validation losses, indicating effective learning and good generalization.

The 3D t-SNE visualizations of the first layer outputs provide a visual representation of how each model processes the input data at an early stage. These plots show that different models cluster data points in distinct patterns, reflecting their unique processing capabilities. For instance, models like BERT, DistilBERT, and RoBERTa exhibit dense clustering, indicating strong initial layer separation of data. Electra, with fewer parameters, still shows effective clustering, but with more dispersed points, which may explain the slight overfitting observed. Overall, the analysis indicates that larger models with more parameters, such as DeBERTa and Longformer, perform slightly better in terms of generalization and validation accuracy. Efficient architectures like MobileBERT and DistilBERT also perform well despite their smaller sizes, demonstrating the effectiveness of the model compression techniques. The visualizations support these findings by showing distinct clustering patterns for different models, highlighting their unique processing capabilities and potential areas of overfitting. Table 10 provides more details on model’s results after just one epoch.

As can be seen from Table 10, DeBERTa has a relatively low error on the validation set, which is consistent with its high validation accuracy. Longformer demonstrates excellent error minimization capabilities, corroborating its high validation accuracy. BigBird demonstrates a good performance. HateBERT has the lowest validation loss at 0.049245, which aligns with its high validation accuracy. MobileBERT has some issues that require additional evaluation due to its noticeable gap, indicating a high risk of overfitting. DistilBERT and BERT stably showcase very strong performances, with very low gaps and low risks of overfitting. RoBERTa performs slightly worse than DistilBERT but can still be considered robust, though it has a low gap, indicating a high risk of overfitting. Electra and XLNet demonstrate consistent performances with low risks of overfitting. The provided analysis helps in understanding the strengths and weaknesses of each model, providing insights into their applicability based on different performance metrics.

3.4. Data Augmentation and Word Cloud

After conducting original multiclass detection utilizing the complete cyberbullying dataset, it was decided to train various types of cyberbullying separately in a binary manner depending on its presence. To make the study more unique and obtain better accuracy, we also trained the models on the two previously mentioned “bad words” datasets [40,41] at the same time. The high-level pseudocode is presented below.

As can be seen from Algorithm 2, the actual sentence transformer model can be plugged in for the ChosenTokenizer and ChosenSequenceClassification. After the initial trials, it was decided that it might be beneficial to understand the embeddings better. Algorithm 3 below provides more details.

Algorithm 3. Analyzing Embeddings using ChosenSentenceTransformer and Bad Words

Input: File paths of the cyberbullying dataset and bad words dataset
Output: Trained model, t-SNE visualization, and saved model

1.: Load bad words from specified file paths.
2.: Create SentenceDataset class to handle data encoding and bad words features.
3.: Load and preprocess the main data and bad words data from file paths.
4.: Use ChosenTokenizer to tokenize the combined data.
5.: Initialize DataLoader with the tokenized dataset.
6.: Define ChosenModelWithBadWords model class that incorporates bad words features.
7.: Initialize model with pretrained ChosenSequenceClassification.
8.: Move the model to the appropriate device (GPU/CPU).
9.: Use preferred optimizer and CrossEntropyLoss criterion.
10.: For each epoch:
11.: Iterate through the DataLoader batches.
12.: Zero gradients.
13.: Forward pass the input data through the model.
14.: Compute the loss.
15.: Backward pass and optimize the model parameters.
16.: Calculate training loss and accuracy.
17.: Append epoch loss and accuracy to respective lists.
18.: Plot training loss and accuracy using Matplotlib.
19.: Define a function for t-SNE visualization of sentence embeddings.
20.: Collect and visualize embeddings using t-SNE.
21.: Save the ModelDefine plot_metrics() to visualize training and validation metrics.
22.: Call plot_metrics() to generate and display visualizations.
23.: Save visualizations to the drive.
24.: Save the trained model using trainer.save_model().
25.: Make inferences using the trained model and print the predictions.

The results of this Algorithm can be seen from Figure 9.

Figure 9a shows distinct clusters, because the model has an additional feature (bad words) that helps it classify sentences very accurately. This additional feature provides a clearer separation between the different classes in the t-SNE plot and helps the model to achieve an accuracy of 0.9994 and loss of 0.0033 on the 8/10 epoch. This part of the methodology overall proved the sustainability of the DistilBERT model in detecting biases and cyberbullying, marking this model as the top one among the pipeline transformers.

After generating several word clouds with extreme words, it was decided to develop an algorithm to avoid directly displaying them on a Word Cloud—a common representation of sentiment analysis results and other Natural Language Processing (NLP) practices [48], as such context might be small and missed during review.

The results of this algorithm applied to the age dyberbullying data can be seen in Figure 10 below:

As can be seen from Figure 10, extreme word tokens were censored and colored red while keeping their size according to their count/frequency. Due to the static nature of the algorithm, the extreme words are currently hardcoded. While the creation of Algorithm 4 became necessary due to the number of extreme words encountered in the cyberbullying dataset, the idea was further expanded into the cyberbullying app and could be applied to various domains. Interestingly, both the authentic age cyberbullying dataset and the synthetic dataset had many bad words.

Algorithm 4. Data Augmentation for Word Cloud

Input: List of words. // call in chunks or all at once
Output: Augmented list, suitable for Word Clouds with extreme words censored

1.: Initialize a set of extreme_words.//can be expanded manually
2.: Function censor_extreme_words(text):
3.: Initialize censored_word_count = 0
4.: Initialize unique_id = 1
5.: Initialize word_map = {}
6.: Define regex pattern to match extreme words and their variations.
7.: Function censor(match):
8.: Increment censored_word_count by 1
9.: Create placeholder = ‘CENSORED’ + unique_id
10.: Map placeholder to match.group(0) in word_map
11.: Increment unique_id by 1
12.: Return placeholder.
13.: Apply regex pattern to replace extreme words in text using censor function.
14.: Return censored_text, censored_word_count, word_map
15.: Initialize example_texts with example sentences.
16.: Combine all example texts into combined_text
17.: Call censor_extreme_words(combined_text) to get censored_text, censored_word_count, word_map
18.: Print censored_text, censored_word_count, word_map
19.: Split censored_text into words.
20.: Create good_words excluding placeholders.
21.: Calculate word frequencies using Counter.
22.: Create WordCloud object with word frequencies.
23.: Function color_censored_words(word, font_size, position, orientation, random_state = None, **kwargs):
24.: If word starts with ‘CENSORED’:
25.: Return ‘red’
26.: Else:
27.: Define range of preferred colors
28.: Return random choice from colors.
29.: Recolor word cloud using color_censored_words function
30.: Display word cloud

3.5. Bias Detection Tokens and Mitigation

We employed several pretrained AI models from Hugging Face, such as MiniLM, Mistral, and Dbias, to detect biases and cyberbullying in the Twitter dataset [47,49,50,51]. The focus of this study was on identifying and mitigating bias in AI language models through token-level analysis. Initially, the BERT transformer model was utilized for bias detection, tokenization, and visualizations, and the BertTokenizer from the Hugging Face transformers library was employed for tokenizing the input texts. Eventually the token bias detection system was developed and demonstrated a difference in the frequency of biased tokens when analyzing examples more likely to contain biased language. It is important to note that token attention scores are not a direct representation of bias but serve as indicators of potential biased language. The system could distinguish differences in the biased token frequency when analyzing likely biased examples, although further refinement of the character count scaling algorithm is necessary to enhance the system’s accuracy and robustness.

Bias probabilities were analyzed for both neutral and biased contexts using a sentiment pipeline available on Hugging Face with the DistilBERT model fine-tuned for the SST-2 sentiment classification task. We also developed a comprehensive visualization of bias probabilities, showing the distribution and comparison between neutral and biased contexts (Figure 11).

A heatmap of the bias probabilities (Figure 11a) shows that words like “Demanding”, “Driven”, “Headstrong”, and “Intuitive” exhibit high bias probabilities in biased contexts, while the bias probability is significantly lower in neutral contexts. According to the box plot of bias probabilities (Figure 11b), it can be easily seen that the median bias probability for biased contexts is significantly higher than for neutral contexts, with a larger variability in biased contexts. The distribution of bias probabilities (Figure 11c) demonstrates a high frequency of bias probabilities close to 1 in biased contexts, indicating that many words/phrases are perceived as highly biased. The bias probabilities for neutral and biased contexts (Figure 11d) states that bias probabilities are generally higher for biased contexts compared to neutral contexts. A comparison of the bias probabilities (Figure 11e) revealed that most points clustered towards the top-right, indicating that words/phrases with high bias probabilities in biased contexts also tended to have higher probabilities in neutral contexts.

A detailed analysis of the top 20 biased tokens in the age cyberbullying data was conducted to identify commonly biased words and phrases. Figure 12 represents the results.

The analysis revealed that words like “school”, “high”, “bullied”, “bully”, “girls”, and “like” are among the most frequently occurring biased tokens in age cyberbullying contexts. Splitting the data into clusters created by the model ‘MiniLM-L6-v1’, a sentence transformer optimized for generating embeddings for sentences or paragraphs, provided further insights. MiniLM is a smaller, faster variant of the BERT-like transformer family [52], designed to offer a similar performance to larger models like BERT but with a fraction of the parameters, making it faster and more efficient.

To introduce more diversity and variability in the data, the following data augmentation techniques were applied: synonym replacement (replaces random words in a sentence with their synonyms, introducing variation without changing the overall meaning) and random insertion (inserts synonyms of random words into random positions in the sentence, increasing length and complexity). The framework developed in this study integrates bias detection using the DistilBERT model for initial bias analysis, followed by multilabel classification for both biases and cyberbullying labels using various models. This approach ensures comprehensive analysis and detection of biases and cyberbullying in diverse datasets. The efficiency and effectiveness of these models in detecting biases and cyberbullying highlight the potential for AI to contribute to creating safer and more inclusive online environments.

Data augmentation helps mitigate bias by introducing more diversity and variability into the training data. By generating multiple variations of each sentence, the model is exposed to a wider range of linguistic patterns and contexts. This can help reduce overfitting and make the model more robust to different expressions of the same underlying concepts. Algorithm 5 illustrates our approach to applying data augmentation techniques to cyberbullying detection, detailing the steps used to enhance sentence variation and improve model robustness.

Algorithm 5. Data Augmentation for Cyberbullying Detection

Input: Sentences, labels, number of augmentations (num_augments)
Output: Augmented sentences and labels

1.: Define get_synonyms(word):
2.: Initialize an empty set ‘synonyms’
3.: For each synset in wordnet.synsets(word):
4.: For each lemma in synset.lemmas():
5.: Add lemma.name() to ‘synonyms’ (replace ‘_’ with ‘ ’)
6.: If word is in ‘synonyms’, remove it
7.: Return list of ‘synonyms’
8.: Define synonym_replacement(sentence, n):
9.: Split ‘sentence’ into ‘words’
10.: Copy ‘words’ to ‘new_words’
11.: Create a list ‘random_word_list’ of unique words that have synonyms
12.: Shuffle ‘random_word_list’
13.: Set ‘num_replacements’ to the minimum of ‘n’ and the length of ‘random_word_list’
14.: For each ‘random_word’ in the first ‘num_replacements’ words of ‘random_word_list’:
15.: Get ‘synonyms’ for ‘random_word’
16.: If ‘synonyms’ exist, randomly choose a ‘synonym’
17.: Replace ‘random_word’ in ‘new_words’ with ‘synonym’
18.: Join ‘new_words’ into a string and return it
19.: Define random_insertion(sentence, n):
20.: Split ‘sentence’ into ‘words’
21.: Copy ‘words’ to ‘new_words’
22.: For each _ in range(n):
23.: Randomly choose ‘new_word’ from ‘words’
24.: Get ‘synonyms’ for ‘new_word’
25.: If ‘synonyms’ exist, randomly choose a ‘synonym’
26.: Randomly choose ‘insert_position’ in ‘new_words’
27.: Insert ‘synonym’ at ‘insert_position’ in ‘new_words’
28.: Join ‘new_words’ into a string and return it
29.: Define augment_data(sentences, labels, num_augments):
30.: Initialize empty lists ‘augmented_sentences’ and ‘augmented_labels’
31.: For each ‘sentence’, ‘label’ in zip(sentences, labels):
32.: Append ‘sentence’ to ‘augmented_sentences’
33.: Append ‘label’ to ‘augmented_labels’
34.: For each _ in range(num_augments):
35.: If random.random() < 0.5:
36.: Perform synonym_replacement on ‘sentence’ & append to ‘augmented_sentences’
37.: Else:
38.: Perform random_insertion on ‘sentence’ and append to ‘augmented_sentences’
39.: Append ‘label’ to ‘augmented_labels’
40.: Return ‘augmented_sentences’ and ‘augmented_labels’
41.: Load sentences and labels from file paths
42.: Augment data using augment_data(sentences, labels, num_augments)

Figure 13a represents how the embeddings based on the augmented sentences led to a more complex and intertwined structure. The data augmentation techniques introduced more variability, making the clusters in the t-SNE plot less distinct but potentially capturing more nuanced relationships between sentences. The training accuracy and loss plots demonstrate that as the epochs progress, the model’s accuracy steadily increases while the loss decreases, indicating effective learning and convergence towards optimal performance. This trend suggests that the model is becoming more accurate in its predictions over time, and the loss function is being minimized effectively.

The data augmentation techniques introduce more variability and diversity in the training data, which helps the model to generalize better and reduces the likelihood of overfitting to specific patterns in the original data, thereby mitigating bias. The resulting t-SNE plot from the second script shows a more complex structure, indicating that the model is capturing a wider range of linguistic variations. In comparison with Figure 9a, the model’s understanding of the data has evolved, potentially leading to improved classification performance.

3.6. Applying Optimization and Quantization Techniques to Authentic Cyberbullying Data

The trial results for the various methods of optimization can be seen below. This analysis highlights the importance of choosing appropriate pre-processing techniques and understanding their impact on the training process to achieve optimal model performance. Figure 14 provides insights into how weights, biases, accuracies, and losses change over epochs during the training process.

Each colored line in the Weights vs. Epoch part of the graphs in Figure 14 represents the evolution of a specific feature’s weight over time during training. Since the model learns a separate weight for each feature in the data, the different colors are used to distinguish these weights from one another.

As can be seen above, various optimization strategies were applied, with focus on stochastic gradient descent (SGD) using different pre-processing techniques. Each figure corresponds to a different configuration of the training process: SGD with TF-IDF, SGD with TF-IDF + scaling, SGD with TF-IDF + interaction terms. According to Figure 14a, the weights gradually stabilized over epochs, indicating convergence of the model parameters. The bias reduces quickly and stabilizes, showing that the model adjusts quickly and becomes consistent. The accuracy stabilizes after some initial fluctuation, indicating the model’s learning progress. The loss decreases and stabilizes, confirming that the model is learning and minimizing errors. What can be seen in Figure 14b SGD with TF-IDF + scaling is that this configuration uses SGD optimization with TF-IDF for feature extraction, followed by scaling. The weights fluctuate significantly, suggesting instability in the training process due to scaling. The bias shows a high variability, indicating that the model is struggling to find a stable solution. The accuracy remains at zero throughout, indicating that the model is not learning effectively with this configuration. The loss remains high and variable, showing that the training process is not effective. According to Figure 14c, SGD with TF-IDF + Interaction Terms, it uses SGD optimization with TF-IDF for feature extraction, including interaction terms. The weights converge, showing that the model parameters stabilize. The bias shows an initial decrease but then increases slightly, suggesting the interaction terms introduce complexity. The accuracy improves and stabilizes, indicating effective learning with interaction terms. The loss decreases initially but shows a slight fluctuation, indicating that the model’s error minimization is impacted by the added complexity of the interaction terms.

Among the configurations tested, SGD with TF-IDF in Figure 14a shows the most stable and effective results, with the weights and bias stabilizing, the accuracy improving, and the loss decreasing. The addition of scaling in Figure 14b introduces instability, while the inclusion of the interaction terms in Figure 14c adds complexity that slightly affects the stability.

Dynamic quantization helps to reduce model size and improve inference speed without significant changes to the model architecture or training process. It was tried on a DistilBert model. This method quantizes the weights of the model during the runtime, typically focusing on reducing the memory footprint and computational cost without requiring extensive changes to the training process. Some preparation steps for quantization-aware training (QAT) were tried as well. For testing purposes, the models were trained using fake quantization modules that simulate the effects of quantization during the training process. This approach helped the model to better adapt to the eventual quantized state. Moving forward, we will try static quantization, which involves calibrating the model using a representative dataset to determine the appropriate scaling factors for activations and weights. It is expected to improve performance by quantizing both the weights and activations statically. There are trials currently in progress, which will be published in later papers once the investigation is complete.

3.7. Preliminary Results of Multilabel Classification

We are working on multilabeled natural language processing of the same dataset where the data are first labeled as biased vs not biased, then both biases and cyberbullying are detected at the same time.

A concise version of the algorithm can be seen below (Algorithm 6):

Algorithm 6. Multilabel Classification for Cyberbullying Detection

Input: Text files containing cyberbullying and non-cyberbullying data
Output: Trained models, evaluation results, and visualizations

1.: Define read_text_file(filepath, cyber_label) and read text files
2.: Assign a cyber_label to each sentence/line of text.
3.: Define load_lexicon(filepaths) and load biased lexicons from a file and store them in a set.
4.: Load cyberbullying and non-cyberbullying datasets
5.: Use read_text_file() to read and label both datasets.
6.: Combine datasets into one DataFrame
7.: Use train_test_split() to split the dataset into training and testing sets (80% vs. 20%)
8.: Define simple_bias_detection(text)
9.: Tokenize the text and calculate the proportion of biased words from the lexicon.
10.: Apply simple_bias_detection()
11.: Add a simple_bias score column to the DataFrame.
12.: Convert bias scores to binary labels
13.: Create a bias_label column based on a threshold applied to the simple_bias score.
14.: Combine cyberbullying and bias labels
15.: Create a multilabel labels column by combining cyber_label and bias_label.
16.: Convert DataFrame to Hugging Face Dataset
17.: Use Dataset.from_pandas() to convert the DataFrame into Hugging Face-compatible datasets.
18.: Define tokenize_function(examples)
19.: Tokenize the text data using a Hugging Face tokenizer with padding and truncation.
20.: Define MultiLabelClassificationModel class
21.: Create a custom multi-label classification model using a pretrained Transformer (e.g., BERT).
22.: Define train_and_evaluate_model(model_name, token)
23.: Load the dataset and tokenizer for the given model and apply tokenization.
24.: Initialize and train the MultiLabelClassificationModel.
25.: Evaluate the trained model and return the evaluation results.
26.: Initialize Hugging Face Trainer
27.: Configure training arguments (e.g., learning rate, epochs) and initialize the Trainer with the model, datasets, and evaluation metrics.
28.: Train and evaluate models
29.: Iterate through a predefined list of Transformer models and call train_and_evaluate_model() for each.
30.: Plot results using Plotly
31.: Visualize model evaluation metrics (e.g., accuracy, F1 score) across different models.

The preliminary results of the multilabel classification can be seen in Figure 15.

3.8. Swarm of AI Agents and the Apps

As was explored at the beginning of this paper, large language models (LLMs) can generate biased data on demand. While it is possible to generate these data manually, API calls can also be used. We utilized the OpenAI Assistants API [54] and ChatGPT-4o LLMs to create our system. The AI assistant biased data generator was created using the API and utilized in the study. The code is simple and straightforward; see Figure 16.

We developed several possible prototypes of the cyberbullying detector application. One of them can be seen in Figure 17. Additional information on it, including a YouTube video and a URL can be found in the Supplementary Materials.

In this project, we explored the phenomena of swarm of agents, which became possible due to the introduction of multiple threads in version 2 of OpenAI Assistants API. Technically, this idea becomes increasingly real (robots will build robots to build robots, etc.). At this point, we consider three possible situations: when agents can perform tasks in parallel, applying the concept of divide-and-conquer to either split data or tasks, or both, if possible, as well as splitting various modalities. Figure 18 presents an overview of these three test cases. The idea of swarm of agents is also known as mixtures of experts. Due to the multithreaded nature of the assistant’s API v2, our AI agent is already running on a thread, and making several such agents can be as simple as applying a loop. Therefore, we can utilize five or more agents (five were tested) to generate biased and cyberbullying data simultaneously.

Figure 19 represents another app prototype developed for the project.

4. Results

Addressing RQ1: Leveraging state-of-the-art LLMs such as ChatGPT-4o, Pi AI, Claude 3 Opus, and Gemini-1.5, this study generated a diverse synthetic cyberbullying dataset. Authentic data from Twitter served as a foundation for training and validating various AI models. Both datasets were meticulously cleaned to remove noise and irrelevant information. Tokenization of the text data enhanced the efficiency and accuracy of the transformer models in processing large volumes of text. Models including DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet were pre-trained and fine-tuned for bias and cyberbullying detection tasks. Context-aware detection mechanisms were implemented, enabling the models to better understand and identify nuanced forms of cyberbullying and biases, thus reducing false positives and negatives. The innovative intersectional analysis approach helped to detect complex cases where different biases overlap with cyberbullying, providing a deeper understanding of these co-occurring phenomena.

Addressing RQ2: Transformers such as BERT, RoBERTa, and ELECTRA were pre-trained on extensive text corpora, providing a solid foundation for understanding natural language. These models were fine-tuned on both synthetic and authentic datasets to specialize in detecting biases and cyberbullying. The fine-tuning process adapted the models to the nuances of social media text and specific biases present in the data. Ethical guidelines were integrated into the model training process, involving regular evaluations to mitigate any detected biases in the models’ outputs. The leading LLMs generated synthetic datasets, addressing limitations of authentic data such as scarcity and lack of diversity. Advanced NLP techniques allowed the models to understand the context and nuances in social media text that crucial for accurately identifying biases and cyberbullying.

Addressing RQ3: Multilabel classification enabled the models to detect multiple labels simultaneously, increasing the efficiency and comprehensiveness of the detection process. Training the models to recognize and classify multiple labels improved their ability to handle complex and overlapping instances of bias and cyberbullying. An intersectional analysis allowed the models to identify nuanced forms of cyberbullying involving multiple biases. High performance metrics, including precision, recall, and F1-scores exceeding 90%, demonstrated the models’ effectiveness in detecting biases and cyberbullying. Rigorous cross-validation and ethical evaluations confirmed the reliability and generalizability of the models. The deployment in real-world applications with continuous monitoring and user feedback integration ensured practical relevance and impact.

5. Discussion

This study distinguishes itself by integrating bias and cyberbullying detection with data generation, leveraging advanced transformer models and leading LLMs into a cohesive research project and ultimately into a single application. A particularly innovative aspect of this work lies in the application of cybersecurity techniques, such as jailbreaking, to generate synthetic datasets. The importance of testing biases within these synthetic datasets cannot be overstated, especially as newer models like the anticipated ChatGPT-5 are rumored to be increasingly trained on synthetic data. As generative AI becomes more prevalent in creating a wide range of content, including text, images, videos, and music, it is inevitable that future models will, at least partially, rely on synthetic datasets—even if that was not the original intention of their creators.

Building on previous successes in jailbreaking [45,55,56], we successfully compelled the LLMs under investigation to produce extreme language, which was ethically excluded from the final study. Historically, models have been tricked into performing tasks by framing them in imaginary contexts, such as a movie setting or an environment where they cannot lie or withhold information. However, this study revealed that even a straightforward scenario—like generating cyberbullying content for the positive goal of improving bias and cyberbullying detection systems—could effectively bypass model filters. An intriguing finding was that when models, including ChatGPT-4, were asked to assist in refining code for a censored word cloud, they generated a list of hardcoded “bad words” that significantly exceeded our expectations, indicating that the models had effectively overridden their built-in filters. This underscores the need for continuous validation of not only leading LLMs but also emerging Small Language Models (SLMs) like ChatGPT-4o mini [48]. The results of this study highlight the critical importance of ongoing scrutiny and ethical considerations in the development and deployment of AI models, particularly as they become more integrated into the fabric of content creation and cybersecurity.

6. Conclusions and Future Work

6.1. Final Remarks

This research has demonstrated a significant enhancement in bias detection and the development of more equitable AI systems by integrating both synthetic and authentic datasets and employing advanced AI models [57,58]. Through the creation of the enhanced CyberBullyBiasedBot and the utilization of OpenAI’s ChatGPT-4o mini, we have established a novel approach in AI that emphasizes the generation of biased data for rigorous system testing, addressing the critical challenge of cyberbullying and inherent biases in AI models.

6.2. Theoretical Contributions

This study contributes to theoretical advancements in AI ethics and cyberbullying detection by introducing several innovative concepts: (1) Intersectional detection mechanisms: We have showcased how multilabel classification can simultaneously address multiple forms of bias and cyberbullying, offering a more nuanced understanding of their intersections. (2) Synthetic dataset generation: Using advanced LLMs to generate relevant datasets has provided a new avenue for enhancing the robustness and understanding of AI systems. (3) Jailbreaking techniques: Our methods for bypassing model filters to generate extreme data content have proven essential for testing the durability of bias and cyberbullying detection systems. (4) Optimization and quantization techniques: These efforts are pivotal for scaling AI models efficiently, ensuring they are effective across diverse applications and maintaining high ethical standards.

6.3. Practical Contributions

The practical applications of this research are broad and impactful, particularly in how AI systems are deployed in real-world scenarios: (1) hybrid application development: We developed applications that detect intersectional cyberbullying and generate synthetic data, demonstrating the practical implementation of theoretical advancements. (2) Advanced classification methods: By applying simple yet effective classification techniques to complex datasets, we have achieved a high bias and cyberbullying detection accuracy. (3) Ethical data visualization tools: The censored word cloud algorithm and other visualization tools developed help in understanding the context of offensive language in data while maintaining ethical standards. (4) Comprehensive bias analysis: Our in-depth analysis of synthetic data [59] using cutting-edge models has led to the development of more effective bias mitigation strategies [60].

6.4. Future Work

Our research will focus on several key areas to further the capabilities of AI in detecting and mitigating bias and cyberbullying: (1) Multithreading and parallel processing: We plan to harness the multithreaded capabilities of the assistant’s APIs to enhance data processing and manage multiple tasks simultaneously. (2) Expansion of data sources: By integrating multimodal data, we aim to broaden the scope of our detection systems to include a variety of digital interactions, enhancing detection accuracy and robustness. (3) Real-world application: We will deploy our tools in diverse environments to gather practical insights and improve the models based on real-world feedback. (4) Cross-cultural and multilingual expansion: Future research will also explore the application of our methods across different languages and cultural contexts to ensure global applicability and effectiveness. (5) It would be useful to build a set of data with discussions about bias and cyberbullying, and stories. Currently, with a words-focused approach, many stories that contain triggering words (for example, Huckleberry Finn) would probably be flagged as bias attacks. And in truth, some people are distinctly uncomfortable with the choice of racially charged words in such books. However, no one could legitimately call them an attack. We would like to generate a more sophisticated test that could detect works that can trigger people based on vocabulary, works that are distinctly racist in their subject matter even if they are fictional, and a scholarly discussion which is not racist, and last, a scholarly discussion which is biased. An AI being used to block free speech must be capable of correctly distinguishing between these cases.

By addressing these areas, we anticipate building upon the solid foundation this study has laid, continuously advancing the field of AI ethics and effectiveness in combating cyberbullying and bias in digital communications.

Supplementary Materials

The supporting information can be found at: https://youtu.be/rzZTQyDFBpM?si=gvUfSHDKxLybtbVg, Video S1: CyberBullyBiasedBot. Live CyberBullyBiasedBot app can be found at https://cyberbullybiasedbot-1df0d57d071c.herokuapp.com/, all URL accessed on 19 August 2024.

Author Contributions

Conceptualization, Y.K.; methodology, Y.K.; software, R.J., A.P. and G.Y.; validation, D.K., P.M. and J.J.L.; formal analysis, J.J.L.; investigation, K.H.; resources, Y.K.; data curation, G.Y.; writing—original draft preparation, Y.K.; writing—review and editing, J.J.L.; visualization, Y.K. and R.J.; supervision, P.M. and D.K.; project administration, Y.K.; funding acquisition, Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the NSF, grants 1834620 and 2129795, and Kean University (Union, NJ).

Data Availability Statement

The synthetic dataset created using LLMs is published together with this paper [59].

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Huber, M.; Luu, A.T.; Boutros, F.; Kuijper, A.; Damer, N. Bias and Diversity in Synthetic-based Face Recognition. arXiv 2023, arXiv:2311.03970. [Google Scholar]
Raza, S.; Bamgbose, O.; Chatrath, V.; Ghuge, S.; Sidyakin, Y.; Muaad, A.Y.M. Unlocking Bias Detection: Leveraging Transformer-Based Models for Content Analysis. arXiv 2023, arXiv:2310.00347. [Google Scholar] [CrossRef]
Tejani, A.S.; Ng, Y.S.; Xi, Y.; Rayan, J.C. Understanding and mitigating bias in imaging artificial intelligence. Radiographics 2024, 44, e230067. [Google Scholar] [CrossRef] [PubMed]
Turpin, M.; Michael, J.; Perez, E.; Bowman, S. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. Adv. Neural Inf. Process. Syst. 2024, 36, 74952–74965. [Google Scholar]
Perera, A.; Fernando, P. Accurate Cyberbullying Detection and Prevention on Social Media. Procedia Comput. Sci. 2021, 181, 605–611. [Google Scholar] [CrossRef]
Ogunleye, B.; Dharmaraj, B. The Use of a Large Language Model for Cyberbullying Detection. Analytics 2023, 2, 694–707. [Google Scholar] [CrossRef]
Raj, M.; Singh, S.; Solanki, K.; Selvanambi, R. An Application to Detect Cyberbullying Using Machine Learning and Deep Learning Techniques. SN Comput. Sci. 2022, 3, 401. [Google Scholar] [CrossRef]
Nadeem, M.; Raza, S. Detecting Bias in News Articles Using NLP Models Stanford CS224N Custom Project. Available online: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1224/reports/custom_116661041.pdf (accessed on 19 August 2024).
Raza, S.; Garg, M.; Reji, D.J.; Bashir, S.R.; Ding, C. Nbias: A natural language processing framework for BIAS identification in text. Expert Syst. Appl. 2024, 237, 121542. [Google Scholar] [CrossRef]
Pinto, A.G.; Cardoso, H.L.; Duarte, I.M.; Warrot, C.V.; Sousa-Silva, R. Biased Language Detection in Court Decisions. In Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2020; pp. 402–410. [Google Scholar] [CrossRef]
Lu, Y.; Shen, M.; Wang, H.; Wang, X.; van Rechem, C.; Wei, W. Machine learning for synthetic data generation: A review. arXiv 2023, arXiv:2302.04062. [Google Scholar]
Ruiz, D.M.; Watson, A.; Manikandan, A.; Gordon, Z. Reducing Bias in Cyberbullying Detection with Advanced LLMs and Transformer Models. Center for Cybersecurity. 2024, p. 36. Available online: https://digitalcommons.kean.edu/cybersecurity/36 (accessed on 19 August 2024).
Joseph, V.A.; Prathap, B.R.; Kumar, K.P. Detecting Cyberbullying in Twitter: A Multi-Model Approach. In Proceedings of the 2024 4th International Conference on Data Engineering and Communication Systems (ICDECS), Bangalore, India, 22–23 March 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Mahmud, T.; Ptaszynski, M.; Masui, F. Exhaustive Study into Machine Learning and Deep Learning Methods for Multilingual Cyberbullying Detection in Bangla and Chittagonian Texts. Electronics 2024, 13, 1677. [Google Scholar] [CrossRef]
Mishra, A.; Sinha, S.; George, C.P. Shielding against online harm: A survey on text analysis to prevent cyberbullying. Eng. Appl. Artif. Intell. 2024, 133, 108241. [Google Scholar] [CrossRef]
Huang, J.; Ding, R.; Zheng, Y.; Wu, X.; Chen, S.; Jin, X. Does Part of Speech Have an Influence on Cyberbullying Detection? Analytics 2023, 3, 1–13. [Google Scholar] [CrossRef]
Islam, M.S.; Rafiq, R.I. Comparative Analysis of GPT Models for Detecting Cyberbullying in Social Media Platforms Threads. In Annual International Conference on Information Management and Big Data; Springer: Cham, Switzerland, 2023; pp. 331–346. [Google Scholar]
Saeid, A.; Kanojia, D.; Neri, F. Decoding Cyberbullying on Social Media: A Machine Learning Exploration. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, 25–27 June 2024. [Google Scholar]
Gomez, C.E.; Sztainberg, M.O.; Trana, R.E. Curating cyberbullying datasets: A human-AI collaborative approach. Int. J. Bullying Prev. 2022, 4, 35–46. [Google Scholar] [CrossRef] [PubMed]
Jacobs, G.; Van Hee, C.; Hoste, V. Automatic classification of participant roles in cyberbullying: Can we detect victims, bullies, and bystanders in social media text? Nat. Lang. Eng. 2022, 28, 141–166. [Google Scholar] [CrossRef]
Verma, K.; Milosevic, T.; Davis, B. Can attention-based transformers explain or interpret cyberbullying detection? In Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), Gyeongju, Republic of Korea, 12–17 October 2022; pp. 16–29. [Google Scholar]
Verma, K.; Milosevic, T.; Cortis, K.; Davis, B. Benchmarking language models for cyberbullying identification and classification from social-media texts. In Proceedings of the First Workshop on Language Technology and Resources for a Fair, Inclusive, and Safe Society within the 13th Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; pp. 26–31. Available online: https://aclanthology.org/2022.lateraisse-1.4/ (accessed on 19 August 2024).
Ali, A.; Syed, A.M. Cyberbullying detection using machine learning. Pak. J. Eng. Technol. 2020, 3, 45–50. [Google Scholar] [CrossRef]
Atapattu, T.; Herath, M.; Zhang, G.; Falkner, K. Automated detection of cyberbullying against women and immigrants and cross-domain adaptability. arXiv 2020, arXiv:2012.02565. [Google Scholar]
Wang, J.; Fu, K.; Lu, C.T. Sosnet: A graph convolutional network approach to fine-grained cyberbullying detection. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Virtual, 10–13 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1699–1708. [Google Scholar]
Al-Ajlan, M.A.; Ykhlef, M. Deep learning for cyberbullying detection. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 9. [Google Scholar]
Orelaja, A.; Ejiofor, C.; Sarpong, S.; Imakuh, S.; Bassey, C.; Opara, I.; Tettey, J.N.A.; Akinola, O. Attribute-specific Cyberbullying Detection Using Artificial Intelligence. J. Electron. Inf. Syst. 2024, 6, 10–21. [Google Scholar] [CrossRef]
Lee, P.J.; Hu, Y.H.; Chen, K.; Tarn, J.M.; Cheng, L.E. Cyberbullying Detection on Social Network Services. PACIS 2018, 61. Available online: https://core.ac.uk/download/pdf/301376129.pdf (accessed on 19 August 2024).
Dadvar, M.; de Jong, F.M.; Ordelman, R.; Trieschnigg, D. Improved cyberbullying detection using gender information. In DIR 2012; Universiteit Gent: Gent, Belgium, 2012. [Google Scholar]
Dusi, M.; Gerevini, A.E.; Putelli, L.; Serina, I. Supervised Bias Detection in Transformers-based Language Models. In Proceedings of the CEUR Workshop Proceedings, Vienna, Austria, 21 October 2024; Volume 3670. [Google Scholar]
Raza, S.; Reji, D.J.; Ding, C. Dbias: Detecting biases and ensuring fairness in news articles. Int. J. Data Sci. Anal. 2024, 17, 39–59. [Google Scholar] [CrossRef]
Raza, S.; Bamgbose, O.; Chatrath, V.; Ghuge, S.; Sidyakin, Y.; Muaad, A.Y.M. Unlocking Bias Detection: Leveraging Transformer-Based Models for Content Analysis. IEEE Trans. Comput. Soc. Syst. 2024. [Google Scholar] [CrossRef]
Yu, Y.; Zhuang, Y.; Zhang, J.; Meng, Y.; Ratner, A.J.; Krishna, R.; Shen, J.; Zhang, C. Large language model as attributed training data generator: A tale of diversity and bias. Adv. Neural Inf. Process. Syst. 2024, 36, 55734–55784. [Google Scholar]
Baumann, J.; Castelnovo, A.; Cosentini, A.; Crupi, R.; Inverardi, N.; Regoli, D. Bias on demand: Investigating bias with a synthetic data generator. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI), Macao, SAR, 19–25 August 2023; International Joint Conferences on Artificial Intelligence Organization. pp. 7110–7114. Available online: https://www.ijcai.org/proceedings/2023/0828.pdf (accessed on 19 August 2024).
Barbierato, E.; Vedova, M.L.D.; Tessera, D.; Toti, D.; Vanoli, N. A methodology for controlling bias and fairness in synthetic data generation. Appl. Sci. 2022, 12, 4619. [Google Scholar] [CrossRef]
Gujar, S.; Shah, T.; Honawale, D.; Bhosale, V.; Khan, F.; Verma, D.; Ranjan, R. Genethos: A synthetic data generation system with bias detection and mitigation. In Proceedings of the 2022 International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS), Online, 23–25 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
Li, B.; Peng, H.; Sainju, R.; Yang, J.; Yang, L.; Liang, Y.; Jiang, W.; Wang, B.; Liu, H.; Ding, C. Detecting gender bias in transformer-based models: A case study on bert. arXiv 2021, arXiv:2110.15733. [Google Scholar]
Silva, A.; Tambwekar, P.; Gombolay, M. Towards a comprehensive understanding and accurate evaluation of societal biases in pre-trained transformers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 2383–2389. [Google Scholar]
Singh, V.K.; Ghosh, S.; Jose, C. Toward multimodal cyberbullying detection. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; pp. 2090–2099. [Google Scholar]
List of Dirty Naughty Obscene and Otherwise-Bad-Words Github Repo. Available online: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words (accessed on 27 April 2024).
Google Profanity Words GitHub Repo. Available online: https://github.com/coffee-and-fun/google-profanity-words/blob/main/data/en.txt (accessed on 27 April 2024).
Carroll, L. Alice’s Adventures in Wonderland. Available online: https://www.gutenberg.org/ebooks/11 (accessed on 26 May 2024).
Inflection, A.I. Inflection-1. Technical Report. 2023. Available online: https://inflection.ai/assets/Inflection-1.pdf (accessed on 6 June 2024).
Sentiment Pipeline from Hugging Face. Available online: https://huggingface.co/docs/transformers/en/main_classes/pipelines (accessed on 6 June 2024).
Hannon, B.; Kumar, Y.; Sorial, P.; Li, J.J.; Morreale, P. From Vulnerabilities to Improvements-A Deep Dive into Adversarial Testing of AI Models. In Proceedings of the 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), Las Vegas, NV, USA, 24–27 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2645–2649. [Google Scholar]
Rosa, H.; Pereira, N.; Ribeiro, R.; Ferreira, P.C.; Carvalho, J.P.; Oliveira, S.; Coheur, L.; Paulino, P.; Simão, A.V.; Trancoso, I. Automatic cyberbullying detection: A systematic review. Comput. Hum. Behav. 2019, 93, 333–345. [Google Scholar] [CrossRef]
Sentence Transformers All-MiniLM-L6-v2 Page on Hugging Face. Available online: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 (accessed on 27 April 2024).
Kumar, Y.; Morreale, P.; Sorial, P.; Delgado, J.; Li, J.J.; Martins, P. A Testing Framework for AI Linguistic Systems (testFAILS). Electronics 2023, 12, 3095. [Google Scholar] [CrossRef]
Wang, J.; Fu, K.; Lu, C.-T. Fine-Grained Balanced Cyberbullying Dataset. IEEE Dataport. 12 November 2020. Available online: https://ieee-dataport.org/open-access/fine-grained-balanced-cyberbullying-dataset (accessed on 19 August 2024).
Transformer Model D4data/Bias-Detection-Model Page on Hugging Face. Available online: https://huggingface.co/d4data/bias-detection-model (accessed on 8 June 2024).
Home Page of Mistral-Bias-0.9 Model on Hugging Face. Available online: https://huggingface.co/yuhuixu/mistral-bias-0.9 (accessed on 27 April 2024).
Sentence Transformer Bert-Base-Uncased Page on Hugging Face. Available online: https://huggingface.co/google-bert/bert-base-uncased (accessed on 27 April 2024).
Project Source Code GitHub Repo. 2024. Available online: https://github.com/coolraycode/cyberbullyingBias-model-code (accessed on 24 July 2024).
OpenAI API Website. 2024. Available online: https://openai.com/api/ (accessed on 24 May 2024).
Hannon, B.; Kumar, Y.; Gayle, D.; Li, J.J.; Morreale, P. Robust Testing of AI Language Model Resiliency with Novel Adversarial Prompts. Electronics 2024, 13, 842. [Google Scholar] [CrossRef]
Kumar, Y.; Paredes, C.; Yang, G.; Li, J.J.; Morreale, P. Adversarial Testing of LLMs Across Multiple Languages. In Proceedings of the 2024 International Symposium on Networks, Computers and Communications (ISNCC), Washington, DC, USA, 22–25 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar]
Chiang, W.L.; Zheng, L.; Sheng, Y.; Angelopoulos, A.N.; Li, T.; Li, D.; Zhang, H.; Zhu, B.; Jordan, M.; Gonzalez, J.E.; et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv 2024, arXiv:2403.04132. [Google Scholar]
LMSYS Chatbot Arena (Multimodal): Benchmarking LLMs and VLMs in the Wild. Available online: https://chat.lmsys.org/ (accessed on 9 August 2024).
Selected Parts of the Generated Synthetic Dataset. Available online: https://github.com/Riousghy/BiasCyberbullyingLLMDataSet (accessed on 9 August 2024).
Tellez, N.; Serra, J.; Kumar, Y.; Li, J.J.; Morreale, P. Gauging Biases in Various Deep Learning AI Models. In Proceedings of the SAI Intelligent Systems Conference, Amsterdam, The Netherlands, 1–2 September 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 171–186. [Google Scholar]

Figure 1. Global workflow for bias and cyberbullying detection methodology.

Figure 2. User interaction and data flow in cyberbullying detection system.

Figure 3. Examples of initial prompts used to create cyberbullying data: (a) Pi AI cyberbullying data response; (b) Pi AI cyberbullying data response (25 May 2024).

Figure 4. Twitter cyberbullying dataset: (a) snapshot of the first 10 records from the age cyberbullying file; (b) bad word overlaps in cyberbullying sentences (by category and list).

Figure 5. Extracted features from age cyberbullying vs non-cyberbullying datasets: (a) sentiment analysis for positive tweets; (b) sentiment analysis for negative tweets; (c) positive tweets sentiment distribution; (d) negative tweets sentiment distribution; (e) special character frequency in negative tweets; (f) integer frequency in negative tweets; (g) emoji frequency in positive tweets.

Figure 6. Trivial models of cyberbullying detection with two classes only (age cyberbullying vs non-cyberbullying): (a) logistic regression; (b) Naïve bayes; (c) decision tree; (d) random forest; (e) support vector machine.

Figure 7. Trivial model weight details—binary classification: (a) logistic regression; (b) naïve Bayes; (c) decision tree; (d) random forest.

Figure 8. DeBERTa, Longformer, BigBird, HateBERT, MobileBERT, DistilBERT, BERT, RoBERTa, ELECTRA, and XLNet transformer results: (a) training and testing losses; (b) training accuracy; (c) output activations of the first transformer layer of selected models.

Figure 9. DistilBERT transformers embeddings: (a) age cyberbullying embeddings; (b) training accuracy (c) training loss.

Figure 10. Word cloud of age cyberbullying authentic dataset tokens with extreme words censored.

Figure 11. Bias probabilities analysis: (a) heatmap of bias probabilities; (b) box plot of bias probabilities; (c) distribution of bias probabilities; (d) bias probabilities for neutral and biased contexts; (e) comparison of bias probabilities.

Figure 12. Top 20 biased tokens for authentic cyberbullying data.

Figure 13. DistilBert transformer data augmentation results: (a) age cyberbullying embeddings; (b) training accuracy; (c) training loss.

Figure 14. Optimization trials: (a) Stochastic Gradient Descent (SGD) optimization with Term Frequency–Inverse Document Frequency (TF-IDF) for feature extraction; (b) SGD with TF-IDF + scaling (c) SGD with TF-IDF + interaction terms.

Figure 15. Model comparison based on accuracy and F1 score [53].

Figure 16. Backend of the biased data generator AI assistant: (a) custom GPT settings; (b) Python code and output.

Figure 17. Cyberbullying detector prototype: (a) non-cyberbullying test case; (b) cyberbullying test case.

Figure 18. Swarm of agents in bias and cyberbullying data generation at scale.

Figure 19. Enhanced CyberBullyBiasedBot prototype: (a) user feedback page; (b) home page; (c) data generation page; (d) app detection page.

Table 1. Comparisons with related works.

#	Paper Authors	Year	Similarities	Differences	Ref.
1	Ali and Syed	2020	Machine learning algorithms for cyberbullying detection.	No focus on biases or synthetic data generation or the intersection of both biases and cyberbullying.	[16]
2	Al-Ajlan and Ykhlef	2018	Deep learning algorithms for cyberbullying detection.		[17]
3	Atapattu et al.	2020	Automated detection of cyberbullying.		[6]
4	Dadvar et al.	2012	Improved cyberbullying detection by incorporating gender information.		[18]
5	Gomez et al.	2022	Human–AI collaborative approach in curating cyberbullying datasets.		[19]
6	Huang et al.	2023	Cyberbullying data analysis.		[20]
7	Islam and Rafiq	2023	Used LLMs for detecting cyberbullying in social media.		[14]
8	Jacobs et al.	2022	Cyberbullying data analysis.		[13]
9	Joseph et al.	2024	Multi-model approach for detecting cyberbullying on Twitter.		[21]
10	Lee et al.	2018	Cyberbullying detection on social network services.		[22]
11	Mahmud et al.	2024	Exhaustive study into machine learning methods for multilingual cyberbullying detection.		[23]
12	Mishra et al.	2024	Survey on text analysis to prevent cyberbullying.		[24]
13	Ogunleye and Dharmaraj	2023	Used LLMs for cyberbullying detection.		[15]
14	Orelaja et al.	2024	Attribute-specific cyberbullying detection using AI.		[25]
15	Ruiz et al.	2024	Used LLMs and transformers for cyberbullying detection.		[12]
16	Saeid et al.	2023	Machine learning exploration for cyberbullying detection.		[26]
17	Verma et al.	2022	Attention-based transformers for cyberbullying detection.		[27]
18	Verma et al.	2022	Benchmarking language models for cyberbullying detection.		[28]
19	Wang et al.	2020	Cyberbullying data analysis.		[29]
20	Barbierato et al.	2022	Controlling bias and fairness in synthetic data generation.	Not a cyberbullying study; did not examine the intersection of both or synthetic data generation.	[30]
21	Baumann et al.	2023	Synthetic data generator to investigate bias on demand.		[31]
22	Dusi et al.	2024	Bias detection in transformers.		[32]
23	Gujar et al.	2022	Synthetic data generation with bias detection and mitigation.		[33]
24	Li et al.	2021	Detecting gender bias in transformers. Proposals to reduce biases.		[34]
25	Raza et al.	2024	Detecting biases and ensuring fairness in news articles.		[35]
26	Raza et al.	2024	Content analysis focusing on unlocking bias detection capabilities.		[36]
27	Silva et al.	2021	Evaluation of societal biases in transformers.		[37]
28	Singh et al.	2017	Multimodal approach to cyberbullying detection.		[38]
29	Yu et al.	2024	LLMs as an attributed training data generator.		[39]

Table 2. The combined datasets of this study.

Category	Data Type	Number of Records	Amount of Bad Word Overlap with Open Lists of Negative Words
Category	Data Type	Number of Records	Google List [41]	LDNOOBW List [40]
Age cyberbullying sentences	Authentic	8000	1552	1254
Ethnicity cyberbullying sentences	Authentic	8000	14,608	12,759
Gender cyberbullying sentences	Authentic	8000	5780	5568
Non-cyberbullying sentences	Authentic	8000	644	474
Other types of cyberbullying sentences	Authentic	8000	1332	1021
Religion cyberbullying sentences	Authentic	8000	840	585
Biased words	Synthetic	4000	1	1
Cyberbullying words	Synthetic	4000	3	3
Biased sentences	Synthetic	4000	317	6
Cyberbullying sentences	Synthetic	4000	0	12
Neutral words	Synthetic	4000	0	0
Neutral sentences	Synthetic	4000	0	0
Alice’s Adventures in Wonderland [42]	Authentic	26,765	2	3

Table 3. Fragment of a bias vs neutral dataset, generated by Gemini-1.5 in mid-2024.

Word/Phrase	Neutral Context	Sentiment Score	Biased Context	Sentiment Score
Assertive	She presented her ideas in an assertive manner.	0.999067	The woman was too assertive for a leadership position.	−0.999237
Outspoken	He is an outspoken advocate for social justice.	0.996832	The outspoken feminist alienated potential allies.	−0.995667
Emotional	The movie evoked a strong emotional response.	0.999865	She’s too emotional to handle the stress of the job.	−0.999722
Demanding	The project requires a demanding work schedule.	−0.998890	The client is overly demanding and difficult to please.	−0.999668
Opinionated	He has strong, well-informed opinions.	0.999869	She’s too opinionated and unwilling to compromise.	0.774576
Ambitious	He has ambitious career goals.	0.999852	Her ambition is off-putting and intimidating.	−0.994839
Confident	She exudes confidence in her abilities.	0.999789	He’s overly confident and arrogant.	−0.968566
Independent	She values her independence.	0.999077	The single mother is too independent and doesn’t need help.	−0.965888
Direct	He communicates in a direct and honest way.	0.999848	Her communication style is too direct and abrasive.	−0.997936

Table 4. Biased and cyberbullying synthetic dataset generation framework.

Steps	Step Description
Ethical considerations	Obtain necessary approvals to generate the data if applicable, maintain transparency, warn team members of potentially extreme content
Model selection	Select suitable LLM(s) for the task, considering free and premium options
Define objectives	Explain to the AI model why and in what format these data are needed
Prompt engineering	Carefully craft prompts/prompt injections to achieve the desired result
Filtering and moderation	Review generated content, develop and/or automate flagging of extreme content
Dataset construction	Mix cyberbullying/biased and neutral content to create a well-balanced dataset
Data labelling	Verify data labeling or label the data
Data analysis	Analyze the generated data for common patterns, token usage, contextual markers
Integration with code	Convert the generated text into a format that fits your code, integrate it
Continuous monitoring	Print intermediate steps to verify the outcomes
Human-in-the-loop	Incorporate feedback from stakeholders to refine the data generation process

Table 5. Responsiveness of top LLMs in creating biased and cyberbullying data *.

Top LLM	Responded to a Prompt to Generate
Model	Bias Data	Cyberbullying Data
ChatGPT-4
ChatGPT-4o-mini
ChatGPT-4o
Microsoft Copilot
claude-3-5-sonnet
Perplexity
You.com
Pi AI
gemini-1.5-flash
gemini-1.5-pro

* In the table above, the Electronics 13 03431 i001

(explosion) sign stands for a model failure (it provided a response to the prompt it was not supposed to answer) and Electronics 13 03431 i002

stands for the system resistance, also known as its success against adversarial prompts.

Table 6. Sentiment statistics comparison: age cyberbullying vs non-cyberbullying data.

Tweets	Total Tweets	Total Tokens	Total Characters	Total Links	Total Integers	Total Emojis	Average Positive Sentiment	Average Negative Sentiment
Non-cyber	8000	141,121	40,214	1201	1244	10,504	0.2401	0.7599
Age	7999	301,140	39,447	204	3015	5813	0.2113	0.7887
Gender	8000	236,510	54,613	860	1630	11,228	0.1475	0.8525
Ethnicity	8000	244,302	51,813	351	3078	8466	0.0846	0.9154
Religion	8000	320,529	56,838	559	2813	10,205	0.1520	0.8480
Other cyber types	7999	143,404	41,968	1384	1004	7817	0.2157	0.7843
Alice text	2803	41,602	8691	9	320	278	0.4196	0.5804
Intersectional	31,999	1,102,481	202,711	1974	10,536	35,712	0.1488	0.8512

Table 7. Data frame created after feature extraction from the same dataset.

Has a(n)							Count
Emoji	Link	Integer	Stop Words	Special Chars	Bad Words	Words	Nouns	Verb	adj	Sentiment Score	Intersectional Cyberbullying
1	0	0	1	1	0	14	5	0	2	−0.9997	1
1	0	1	1	1	0	26	6	0	3	−0.9988	1
0	0	0	1	1	0	12	3	1	3	−0.9931	1
0	0	0	1	1	0	23	5	0	2	0.9709	0
0	0	0	1	1	0	23	3	1	1	−0.9984	1

Table 8. Training statistics of the primitive models.

Model	Accuracy	Precision	Recall	F1 Score	Time (Seconds)
Logistic regression	0.830937	0.816935	0.854115	0.835111	0.129687
Naive Bayes	0.760312	0.861711	0.621571	0.722202	0.007047
Decision tree	0.825625	0.830594	0.819202	0.824859	0.089590
Random forest	0.864688	0.854204	0.880299	0.867056	1.218650
SVM	0.831250	0.808943	0.868454	0.837643	2.811025

Table 9. Transformer AI models used in this study.

Model Name	Model Version	Hidden Size	Number of Layers	Attention Heads	Parameters
DeBERTa	Microsoft/deberta-v3-base	768	12	12	198,971,138
Longformer	Allenai/longformer-base-4096	768	12	12	148,660,994
Bigbird	Google/bigbird-roberta-base	768	12	12	128,060,930
HateBERT	GroNLP/hateBERT	768	12	12	109,483,778
MobileBERT	Alireza1044/mobilebert_sst2	512	24	4	24,582,914
DistilBERT	distilbert-base-uncased-finetuned-sst-2-english	768	6	12	66,955,010
BERT	BERT-base-uncased	768	12	12	109,483,778
RoBERTa	Roberta-base	768	12	12	124,647,170
Electra	Google/electra-small-discriminator	256	12	4	13,549,314
XLNet	xlnet-base-cased	768	12	12	117,310,466

Table 10. Classification results of the transformer AI models of this study.

Model Name	Validation Loss	Validation Accuracy	Eval Runtime (s)	Eval Samples Per Second	Eval Steps Per Second	Potential Risk of Overfitting/Gap between Train and Eval Loss
DeBERTa	0.076267	0.982244	9.78	431.868	26.992	Moderate gap/risk
Longformer	0.063387	0.987453	29.68	142.335	8.896	Consistent, low risk
Bigbird	0.064859	0.983665	5.66	746.401	46.65	Consistent, low risk
HateBERT	0.049245	0.987689	4.68	902.243	56.39	Very low gap, low risk
MobileBERT	0.073255	0.984688	17.21	245.417	15.339	Noticeable gap, high risk
DistilBERT	0.057516	0.984612	3.18	1329.871	83.117	Very low gap, low risk
BERT	0.050202	0.988636	4.68	903.018	56.439	Very low gap, low risk
RoBERTa	0.058577	0.987926	5.04	837.442	52.34	Low gap, high risk
Electra	0.066912	0.983902	5.47	772.339	48.271	Consistent, low risk
XLNet	0.067009	0.986269	9.27	455.607	28.475	Consistent, low risk

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kumar, Y.; Huang, K.; Perez, A.; Yang, G.; Li, J.J.; Morreale, P.; Kruger, D.; Jiang, R. Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models. Electronics 2024, 13, 3431. https://doi.org/10.3390/electronics13173431

AMA Style

Kumar Y, Huang K, Perez A, Yang G, Li JJ, Morreale P, Kruger D, Jiang R. Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models. Electronics. 2024; 13(17):3431. https://doi.org/10.3390/electronics13173431

Chicago/Turabian Style

Kumar, Yulia, Kuan Huang, Angelo Perez, Guohao Yang, J. Jenny Li, Patricia Morreale, Dov Kruger, and Raymond Jiang. 2024. "Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models" Electronics 13, no. 17: 3431. https://doi.org/10.3390/electronics13173431

APA Style

Kumar, Y., Huang, K., Perez, A., Yang, G., Li, J. J., Morreale, P., Kruger, D., & Jiang, R. (2024). Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models. Electronics, 13(17), 3431. https://doi.org/10.3390/electronics13173431

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bias and Cyberbullying Detection and Data Generation Using Transformer Artificial Intelligence Models and Top Large Language Models

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Project Datasets

3.1.1. Synthetic Dataset

3.1.2. Authentic Datasets

3.2. Initial Work

3.3. Cyberbullying Detection Using Transformers

3.4. Data Augmentation and Word Cloud

3.5. Bias Detection Tokens and Mitigation

3.6. Applying Optimization and Quantization Techniques to Authentic Cyberbullying Data

3.7. Preliminary Results of Multilabel Classification

3.8. Swarm of AI Agents and the Apps

4. Results

5. Discussion

6. Conclusions and Future Work

6.1. Final Remarks

6.2. Theoretical Contributions

6.3. Practical Contributions

6.4. Future Work

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI