Bias Mitigation via Synthetic Data Generation: A Review
Abstract
:1. Introduction
Motivation
2. Search Methodology
2.1. Inclusion Criteria
- Research articles that are focused on bias and bias handling in AI models using synthetic data generation were included.
- Research articles published between 2020 and 2024 were included to review the relevant, latest techniques.
- Research studies that were published in conferences, journals, book chapters, or proceedings were included.
- Research articles only written in the English language were selected.
2.2. Exclusion Criteria
- Research articles that did not align with our research questions were excluded.
- Non-peer-reviewed articles were not considered.
3. Bias Mitigation via Synthetic Data Generation
Detailed Analysis of the Reviewed Articles
4. Discussion
- Generative Adversarial Network (GAN): GANs use two neural networks, the generator and the discriminator, which compete to produce realistic synthetic data. This technique compares actual real and synthetic data. The generator creates data, while the discriminator evaluates it against real data, improving the generator’s output iteratively. For example, SynSigGAN uses an LSTM as a generator and a CNN as a discriminator to obtain biomedical signal datasets to produce high-quality synthetic biomedical signals such as ECG, EEG, and EMG in time-series data signal data [18].
- Bayesian network: The Bayesian networks use probabilistic models to simulate data with controlled biases. They can model complex dependencies between variables, providing a framework for generating synthetic datasets. For example, the BayesBoost method employs Bayesian networks to generate synthetic datasets to address under-represented group samples in healthcare data, showing improvements in AUC and ROC curve metrics [17].
- Structural Causal Models (SCM): SCM focuses on understanding causal relationships within data. They remove biased edges in the causal graph to generate fair synthetic data, ensuring that the generated data adheres to fairness criteria like demographics and equal opportunity. For example, DECAF is a framework that uses SCMs to create fair synthetic data, maintain high-quality data utility, and achieve fairness through causally aware generative networks [16].
- Synthetic Minority Over-sampling (SMOTE): SMOTE is an oversampling technique designed to identify data imbalance among the datasets. SMOTE is a kind of data augmentation approach that creates additional data points by interpolating between the minority class samples and one of its k-nearest neighbors. This method helps the model to make the data more balanced, which enhances the minority class’s representation and reduces the bias in classification tasks. For example, if a dataset has 100 instances of class A and 10 instances of class B, SMOTE would generate additional synthetic data instances of class B to balance the dataset [28].
- Gaussian Copulas: Copula-based techniques simulate synthetic data by modeling the samples from different populations with similar marginal dependencies structure between variables separately from their marginal distributions. Gaussian copulas use the multivariate uniform distributions method to examine and compare the dependence between variables to represent the relation between various features and mitigate bias among them. This technique is especially helpful for producing high-dimensional data with complex dependencies, such as medical records with multiple variables like age, blood pressure, and cholesterol levels [24].
- Techniques like GANs, CT-GAN, and adversarial training require significant computational resources and high-quality initial datasets with meaningful data, which can be a barrier to the widespread adoption and implementation of this method [18].
- In Bayesian networks, the generation and effectiveness of synthetic data heavily depend on the quality of the original datasets. Poor-quality or biased input data can lead to synthetic data that are also biased [17].
- Structural Causal models require a deep understanding of causal relationships between variables and expert knowledge to be implemented effectively and to understand the definition of specific results. This complexity can lead to significant changes in the inferred causal relationships among data; also, this method focuses on tabular structured data [16].
- Techniques like SMOTE can overlap samples while interpolating between instance variables. SMOTE only focuses on tabular structured data. In high-dimensional datasets, the interpolation process to generate synthetic samples that lack accuracy among instances may lead to bias in the dataset [17].
- Copula-based methods, particularly Gaussian copulas, assume a specific dependency structure between variables captured by the copula remains consistent across the dataset. These methods often require a substantial large amount of data to model the dataset accurately without bias. These methods also require a good understanding of the relationships between variables, and they might go to dependency, which makes it complex to generate synthetic data without bias [24].
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Tavares, S.; Ferrara, E. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies. Sci 2024, 6, 3. [Google Scholar] [CrossRef]
- Jain, A.; Brooks, J.R.; Alford, C.C.; Chang, C.S.; Mueller, N.M.; Umscheid, C.A.; Bierman, A.S. Awareness of racial and ethnic bias and potential solutions to address bias with use of health care algorithms. Proc. JAMA Health Forum. Am. Med. Assoc. 2023, 4, e231197. [Google Scholar] [CrossRef] [PubMed]
- Babic, B.; Gerke, S.; Evgeniou, T.; Glenn Cohen, I. Algorithms on Regulatory Lockdown in Medicine. Science (1979) 2019, 366, 1202–1204. [Google Scholar] [CrossRef]
- Kiyasseh, D.; Laca, J.; Haque, T.F.; Miles, B.J.; Wagner, C.; Donoho, D.A.; Anandkumar, A.; Hung, A.J. A Multi-Institutional Study Using Artificial Intelligence to Provide Reliable and Fair Feedback to Surgeons. Commun. Med. 2023, 3, 42. [Google Scholar] [CrossRef] [PubMed]
- Mandal, A.; Leavy, S.; Little, S. Dataset Diversity: Measuring and Mitigating Geographical Bias in Image Search and Retrieval. In Proceedings of the 1st International Workshop on Trustworthy AI for Multimedia Computing, Co-Located with ACM MM 2021, Virtual, 20–24 October 2021; Volume 21, pp. 19–25. [Google Scholar] [CrossRef]
- Kordzadeh, N.; Ghasemaghaei, M.; Mikalef, P.; Popovic, A.; Lundström, J.E.; Conboy, K. Algorithmic Bias: Review, Synthesis, and Future Research Directions. Eur. J. Inf. Syst. 2022, 31, 388–409. [Google Scholar] [CrossRef]
- Suresh, H.; Guttag, J. A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. In Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, Virtually, 5–9 October 2021. [Google Scholar] [CrossRef]
- Naresh Mandhala, V.; Bhattacharyya, D.; Midhunchakkaravarthy, D.; Kim, H.-J. Detecting and Mitigating Bias in Data Using Machine Learning with Pre-Training Metrics. Ingénierie Syst. d’Inf. 2022, 27, 119–125. [Google Scholar] [CrossRef]
- Raghunathan, T.E. Synthetic Data. Annu. Rev. Stat. Appl. 2021, 8, 129–140. [Google Scholar] [CrossRef]
- Kandpal, N.; Deng, H.; Roberts, A.; Wallace, E.; Raffel, C. Large Language Models Struggle to Learn Long-Tail Knowledge. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 15696–15707. [Google Scholar]
- Draghi, B.; Wang, Z.; Myles, P.; Tucker, A. Identifying and Handling Data Bias within Primary Healthcare Data Using Synthetic Data Generators. Heliyon 2024, 10, e24164. [Google Scholar] [CrossRef] [PubMed]
- Oblizanov, A.; Shevskaya, N.; Kazak, A.; Rudenko, M.; Dorofeeva, A. Evaluation Metrics Research for Explainable Artificial Intelligence Global Methods Using Synthetic Data. Appl. Syst. Innov. 2023, 6, 26. [Google Scholar] [CrossRef]
- Bhanot, K.; Bennett, K.P.; Hendler, J.A.; Zaki, M.J.; Guyon, I.; Baldini, I. Synthetic Data Generation and Evaluation for Fairness. Doctoral Dissertation, Rensselaer Polytechnic Institute, Troy, NY, USA, 2023. [Google Scholar]
- Gujar, S.; Shah, T.; Honawale, D.; Bhosale, V.; Khan, F.; Verma, D.; Ranjan, R. GenEthos: A Synthetic Data Generation System with Bias Detection and Mitigation. In Proceedings of the International Conference on Computing, Communication, Security and Intelligent Systems, IC3SIS 2022, Kochi, India, 23–25 June 2022. [Google Scholar] [CrossRef]
- Sharafutdinov, K.; Fritsch, S.J.; Iravani, M.; Ghalati, P.F.; Saffaran, S.; Bates, D.G.; Hardman, J.G.; Polzin, R.; Mayer, H.; Marx, G.; et al. Computational Simulation of Virtual Patients Reduces Dataset Bias and Improves Machine Learning-Based Detection of ARDS from Noisy Heterogeneous ICU Datasets. IEEE Open J. Eng. Med. Biol. 2023, 5, 611–620. [Google Scholar] [CrossRef]
- Van Breugel, B.; Kyono, T.; Berrevoets, J.; van der Schaar, M. DECAF: Generating Fair Synthetic Data Using Causally-Aware Generative Networks. Adv. Neural. Inf. Process Syst. 2021, 34, 22221–22233. [Google Scholar]
- Draghi, B.; Wang, Z.; Myles, P.; Tucker, A.; Moniz, N.; Branco, P.; Torgo, L.; Japkowicz, N.; Wo, M.; Wang, S. BayesBoost: Identifying and Handling Bias Using Synthetic Data Generators. In Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications, Bilbao, Spain, 17 September 2021; Volume 154. [Google Scholar]
- Hazra, D.; Byun, Y.C. SynSigGAN: Generative Adversarial Networks for Synthetic Biomedical Signal Generation. Biology 2020, 9, 441. [Google Scholar] [CrossRef] [PubMed]
- Paladugu, P.S.; Ong, J.; Nelson, N.; Kamran, S.A.; Waisberg, E.; Zaman, N.; Kumar, R.; Dias, R.D.; Lee, A.G.; Tavakkoli, A. Generative Adversarial Networks in Medicine: Important Considerations for This Emerging Innovation in Artificial Intelligence. Ann. Biomed. Eng. 2023, 51, 2130–2142. [Google Scholar] [CrossRef]
- Celi, L.A.; Cellini, J.; Charpignon, M.-L.; Dee, E.C.; Dernoncourt, F.; Eber, R.; Mitchell, W.G.; Moukheiber, L.; Schirmer, J.; Situ, J.; et al. Sources of Bias in Artificial Intelligence That Perpetuate Healthcare Disparities—A Global Review. PLoS Digit. Health 2022, 1, e0000022. [Google Scholar] [CrossRef]
- Fletcher, R.R.; Nakeshimana, A.; Olubeko, O. Addressing Fairness, Bias, and Appropriate Use of Artificial Intelligence and Machine Learning in Global Health. Front. Artif. Intell. 2021, 3, 561802. [Google Scholar] [CrossRef]
- Yogarajan, V.; Dobbie, G.; Leitch, S.; Keegan, T.T.; Bensemann, J.; Witbrock, M.; Asrani, V.; Reith, D. Data and Model Bias in Artificial Intelligence for Healthcare Applications in New Zealand. Front. Comput. Sci. 2022, 4, 1070493. [Google Scholar] [CrossRef]
- Yang, J.; Soltan, A.A.S.; Eyre, D.W.; Clifton, D.A. Algorithmic Fairness and Bias Mitigation for Clinical Machine Learning with Deep Reinforcement Learning. Nat. Mach. Intell. 2023, 5, 884–894. [Google Scholar] [CrossRef] [PubMed]
- Rodriguez-Almeida, A.J.; Fabelo, H.; Ortega, S.; Deniz, A.; Balea-Fernandez, F.J.; Quevedo, E.; Soguero-Ruiz, C.; Wagner, A.M.; Callico, G.M. Synthetic Patient Data Generation and Evaluation in Disease Prediction Using Small and Imbalanced Datasets. IEEE J. Biomed. Health Inf. 2023, 27, 2670–2680. [Google Scholar] [CrossRef] [PubMed]
- Libbi, C.A.; Trienes, J.; Trieschnigg, D.; Seifert, C. Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records. Future Internet 2021, 13, 136. [Google Scholar] [CrossRef]
- Pettit, R.W.; Fullem, R.; Cheng, C.; Amos, C.I. Artificial Intelligence, Machine Learning, and Deep Learning for Clinical Outcome Prediction. Emerg. Top. Life Sci. 2021, 5, 729–745. [Google Scholar] [CrossRef]
- Baumann, J.; Castelnovo, A.; Cosentini, A.; Crupi, R.; Inverardi, N.; Regoli, D. Bias On Demand: Investigating Bias with a Synthetic Data Generator. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23) Demonstrations Track, Macao, China, 19–25 August 2023. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Stages | Number of Articles |
---|---|
Initial results (using query) | 83,700 |
After applying inclusion and exclusion criteria | 16,800 |
After quality assessment | 50 |
Based on the abstract and conclusion | 24 |
Full article reading (Finally selected articles) | 17 |
Category | Number of Articles |
---|---|
Academic Journal paper | 12 |
Conference paper | 4 |
Doctoral dissertation | 1 |
Refs. | Year | Methodology | Dataset | Modality | Evaluation Metric | Strengths | Limitations |
---|---|---|---|---|---|---|---|
[12] | 2023 | Analyses explainable artificial intelligence (XAI) methods and suggests metrics to determine the technical characteristics of the methods. Focuses on XAI methods such as LIME and SHAP, applied to machine learning models using synthetic data. | A large and heterogeneous corpus of one million Dutch EHR notes. | text data | Balanced accuracy and fairness scores to compare real and synthetic models, time-series metrics, and disparate impact metrics for representational bias and evaluating resemblance. | Auditing pipelines for robust evaluation under varying data, combines GANs, fairness metrics, and bias mitigation algorithms. | Focuses only on synthetic data generation and fairness, lack of datasets, and complexity in adversarial auditing methods. |
[13] | 2023 | Focuses on generating synthetic data strategies for fairness assessment. Explores techniques for creating synthetic datasets that can be used to stress-test machine learning models and assess bias mitigation algorithms. | Healthcare data | Temporal and non-temporal datasets. Subgroup-level analysis of protected attributes such as gender and race | Metrics include faithfulness (correlation between feature and model predictions), monotonicity (correctness of feature), and incompleteness (effect of noise in feature). | Addresses the critical area of explainable AI, essential for user trust, provides insights into the evaluation of XAI methods, and guides future research. | The imperfection of existing XAI methods undermines user trust, does not delve into specific datasets, and lacks diverse data. |
[14] | 2022 | Utilized GANs. Interactive GUI tool to generate synthetic data. Integrated bias detection. Evaluated with LFR. Uses statistical fairness metrics including Statistical Parity Difference (SPD). | German Credit dataset, adult dataset | Tabular data | VP model-based clustering approach compared to clustering based solely on original patient data reduced biases by data from different hospitals with ARDS. | VP modeling approach mitigates biases by heterogeneous datasets and improves cluster discovery. | Depends on the availability of relevant observational data, and the complexity of use. |
[15] | 2022 | Employs mechanistic virtual patient (VP) modeling to capture specific features of patients’ states and dynamics while reducing biases introduced by heterogeneous datasets. | Observational data of mixed origin, including data pooled from different hospitals. ICU datasets like MIMIC or HiRID | text and tabular data | LFR improved fairness by 62% and 17.5%. Reduced SPD by 93%. | Interactive GUI, effective bias mitigation, improved fairness, comprehensive bias mitigation. | Specific to used datasets, generalizability issues, rely on specific metrics, limited to specific algorithms. |
[16] | 2021 | DECAF framework. Structural causal model. Biased edge removal. Evaluated fairness. SCM focus on understanding causal relationships within data. | Tabular data | Tabular data | AUC, ROC, and Precision–Recall curves improved the representation of under-represented groups | Privacy preservation, effective bias identification, maintaining high data quality, and versatility for various medical datasets. | Dependence on initial dataset quality, methodological complexity, and parameter sensitivity may need adjustments for different dataset |
[17] | 2021 | Data size reduction. Simulation of data biases. Bayes Boost approach for bias handling. Probabilistic models and synthetic data generation. Comparison with SMOTE and ADASYN for generating synthetic datasets | CPRD-based synthetic datasets (Synthetic CVD dataset) with 499,344 patients and 21 variables. | Tabular data | High-quality synthetic data, maintained utility of real data, fairness evaluated by demographic parity, equal opportunity. | Compatible with multiple fairness definitions, high-quality data, effective debiasing, and theoretical guarantees. | Requires causal relationship understanding, definition-specific results, focused on tabular data, and expert knowledge needed. |
[18] | 2020 | GANs for generating synthetic biomedical signals using LSTM generator and CNN discriminator. Comparison with real signals to assess quality. Training using signal augmentation techniques. | Biomedical signal datasets like ECG, EEG, EMG, and PPG (17 types of ECG signals). | Time-series data (signal data) | Signal fidelity, noise levels, and usability. Signal similarity: high accuracy approx. 92%. | High-quality synthetic signals, reduce data scarcity issues, high accuracy, and versatility. | Computationally intensive, requires high-quality initial datasets, and model complexity. |
[19] | 2023 | Review of GAN applications in medicine. Discussion on ethical security privacy conservation. Comparative analysis of different GAN architecture techniques with visual review. | Various medical datasets such as MRI, CT scans, and retinal images. | Structured and unstructured data | Accuracy, precision, recall, F1-score, and visual quality assessments. GAN metrics vary in each use case. | GANs generate synthetic medical images for datasets and new data patterns | Focus on theoretical aspects, limited empirical data, and ethical concerns. |
[20] | 2022 | Review of clinical papers from PubMed about AI-assessed disparities in dataset country sources, and clinical demographics (nationality, sex, expertise). Manually tagged a subsample of articles to train a model using transfer-learning techniques to predict. Studied transfer learning with the BioBERT model and automated tools like Entrez Direct and Gendarize. | -- | -- | Country metrics: U.S. 40.8%, Chinese 13.7%. Clinical Specialties: Radiology: 40.4%, Pathology: 9.1%. | A comprehensive analysis of 300,000 articles, highlights disparities in data, transfer-learning and automated tools to analyze the dataset. | Focuses on U.S. and China data only. |
[21] | 2021 | Discusses the critical issues of fairness, bias, and the use of AI and ML in global health, specifically in Low- and Middle-Income Countries. It proposes a framework for appropriateness, bias, and fairness | Diagnosis clinical records of 200 consecutive patients at a clinic. | Text data | Disparate impact scores, fairness prediction from different groups, measure fairness in true positive rates across groups. | Addresses AI biases in underrepresented Indigenous populations, employing fairness metrics enhances transparency | Data specific to New Zealand, dependent on available health data. |
[22] | 2022 | Analyzes data and algorithmic bias related to data collection and model development, training, and testing using health data collected in New Zealand. Measures fairness using DI scores, impact scores, equal opportunities and equalized tabular data. | Health data collected in New Zealand, including Maori populations. NZ-GP Harms and NZ-GDF Diabetes Mellitus (kaggle), PIMA, SACardio, MNCD-RED | Tabular data (NZ-GP Harms), Free-text | Accuracy was approximately 89.2% for both genders, models showed varying ROC for different groups. | A robust framework assessing fairness, bias, and appropriateness in AI/ML applications, offers clear guidelines for AI/ML fairness and reducing bias. | Heavily dependent on the quality and diversity of the training data, practical challenges in deploying AI/ML, and bias inherent in biological differences. |
[23] | 2023 | Deep reinforcement learning framework for bias mitigation. Adversarial training to reduce biases during model development. Application on COVID-19 screening and patient discharge prediction | eICU Collaborative Clinical data Research Database | Text and structured data with multiple attributes such as patient demographics, diagnoses, treatments, and outcomes | Improved fairness and reduced bias in clinical AUC-ROC score of 0.818 to 0.875 using the XGBoost model and Random Forest (RF) model | Effective bias mitigation, application on real-world datasets, enhances clinical decision support. | Requires extensive resources, dependent on the quality of initial data. Limited dataset information |
[24] | 2022 | Synthetic data generation to address the issues of small and imbalanced medical datasets. Algorithms containing tabular data of different sizes to test data, balancing using SMOTE, and ADASYN and data augmentation via Gaussian Copulas, and CTGANs | Eight medical datasets including MNCD, Bangladesh Diabetes | Tabular data | Synthetic data effectively maintained, improved synthetic data, and enhanced machine learning model training without the original dataset. Metrics: MNCD, PCD, KLD, MMD for statistical similarity and F1-score. | Evaluated multiple datasets, advanced synthetic data generation techniques such as CTGAN, and the effectiveness of Gaussian Copula in preserving data structure. | It can be time-consuming, errors encountered depend on the quality of original datasets, and challenges with CTGANs in maintaining balance and performance. |
[25] | 2021 | Neural language models (LSTM and GPT-2) for generating synthetic EHR text. Joint generation of synthetic text and annotations for NER with in-text PHI tag. User study for privacy assessment. Privacy was evaluated using ROUGE n-gram and BM25 scoring. Combining real and synthetic data to improve recall without manual annotation | A large and heterogeneous corpus of one million Dutch EHR notes. | Text-based (EHR data) | The accuracy of de-identification is 95%, The LSTM method produces synthetic text with higher utility compared to GPT-2. | Addresses privacy concerns, reduces manual annotation efforts and compares the utility of synthetic data to real data. | Privacy risks with synthetic text, challenges in evaluating privacy-preservation. The study focuses on Dutch EHR data. |
[26] | 2021 | Discussed various AI models and algorithms for clinical outcome prediction. These models include decision trees, linear models, regression models, ensemble learning and neural networks. | Multiple clinical datasets include patient records (EHR), imaging, and genetic data. | Structured and unstructured data (mixed: text, tabular, images) | -- | High predictive accuracy, robust model validation, personalized medicine, and effective data integration. | Model interpretability, and generalizability to diverse populations, require large datasets and lack privacy concerns. |
[27] | 2023 | Synthetic data generation with controlled bias. Qualitative analysis of bias impact. Open-source toolkit. Controlled scenarios for bias studies. | Synthetic datasets with predefined bias | Synthetic data generation (open-source toolkit) | Qualitative analysis of bias impact. | Precise control over bias types, open source availability, flexibility to model various biases, contribution to fairness research | May not capture real-world complexities, generalization issues, requires expertise, scope of bias types might be limited. |
[11] | 2021 | Synthetic data generation using various techniques. Comparative study. Analysis of bias reduction | Anonymized CPRD data. Large-scale datasets including real and synthetic data. Request-based access. | Tabular EHR data | Bias reduction by 15–20%, accuracy improved by 10–12%, privacy with <5% re-identification risk, and data utility retained by 90–95%. | Effective bias mitigation ensures privacy, scalable, and versatile. | Varying results and quality depend on models, resource-intensive, and metric reliance. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shahul Hameed, M.A.; Qureshi, A.M.; Kaushik, A. Bias Mitigation via Synthetic Data Generation: A Review. Electronics 2024, 13, 3909. https://doi.org/10.3390/electronics13193909
Shahul Hameed MA, Qureshi AM, Kaushik A. Bias Mitigation via Synthetic Data Generation: A Review. Electronics. 2024; 13(19):3909. https://doi.org/10.3390/electronics13193909
Chicago/Turabian StyleShahul Hameed, Mohamed Ashik, Asifa Mehmood Qureshi, and Abhishek Kaushik. 2024. "Bias Mitigation via Synthetic Data Generation: A Review" Electronics 13, no. 19: 3909. https://doi.org/10.3390/electronics13193909
APA StyleShahul Hameed, M. A., Qureshi, A. M., & Kaushik, A. (2024). Bias Mitigation via Synthetic Data Generation: A Review. Electronics, 13(19), 3909. https://doi.org/10.3390/electronics13193909