Synthetic Boosted Resampling Using Deep Generative Adversarial Networks: A Novel Approach to Improve Cancer Prediction from Imbalanced Datasets
Simple Summary
Abstract
1. Introduction
2. Materials and Methods
2.1. Data Description
2.2. Data Acquisition and Preprocessing
2.3. Addressing Class Imbalance with GANs
2.4. GAN Training Process
- The first layer has 256 neurons with a ReLU activation function.
- The second layer consists of 128 neurons, also using a ReLU activation function.
- The final output layer has a number of neurons matching the input feature dimension of the dataset, with a sigmoid activation function to produce normalized synthetic data.
- The input layer has a size equal to the number of input features.
- The first hidden layer contains 256 neurons with a ReLU activation function.
- The second hidden layer has 128 neurons, also using ReLU activation.
- The output layer has a single neuron with a sigmoid activation function to predict whether the input is real or generated.
2.5. Classification Models and Evaluation
2.6. Performance Assessment, and Results Compilation
3. Results and Discussion
3.1. Scenario 1: Baseline Scenario at an Imbalance Ratio of 5.5 Without Resampling
3.2. Scenario 2: Minority Oversampling Scenario with an Imbalance Ratio of 3
3.3. Scenario 3: Minority Oversampling Scenario with an Imbalance Ratio of 2
3.4. Scenario 4: Minority Oversampling Scenario with an Imbalance Ratio of 1
3.5. Scenario 5: Majority and Minority Oversampling Scenario with an Imbalance Ratio of 1
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Roy, S.; Meena, T.; Lim, S.J. Demystifying Supervised Learning in Healthcare 4.0: A New Reality of Transforming Diagnostic Medicine. Diagnostics 2022, 12, 2549. [Google Scholar] [CrossRef] [PubMed]
- Carvalho, D.; Cruz, R. Big Data and Machine Learning in Health. Eur. J. Public. Health 2020, 30. [Google Scholar] [CrossRef]
- Ahsan, M.M.; Luna, S.A.; Siddique, Z. Machine-Learning-Based Disease Diagnosis: A Comprehensive Review. Healthcare 2022, 10, 541. [Google Scholar] [CrossRef] [PubMed]
- Mathema, V.B.; Sen, P.; Lamichhane, S.; Orešič, M.; Khoomrung, S. Deep Learning Facilitates Multi-Data Type Analysis and Predictive Biomarker Discovery in Cancer Precision Medicine. Comput. Struct. Biotechnol. J. 2023, 21, 1372–1382. [Google Scholar] [CrossRef]
- Jones, M.A.; Islam, W.; Faiz, R.; Chen, X.; Zheng, B. Applying Artificial Intelligence Technology to Assist with Breast Cancer Diagnosis and Prognosis Prediction. Front. Oncol. 2022, 12, 980793. [Google Scholar] [CrossRef]
- Boeri, C.; Chiappa, C.; Galli, F.; De Berardinis, V.; Bardelli, L.; Carcano, G.; Rovera, F. Machine Learning Techniques in Breast Cancer Prognosis Prediction: A Primary Evaluation. Cancer Med. 2020, 9, 3234–3243. [Google Scholar] [CrossRef]
- Khushi, M.; Shaukat, K.; Alam, T.M.; Hameed, I.A.; Uddin, S.; Luo, S.; Yang, X.; Reyes, M.C. A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data. IEEE Access 2021, 9, 109960–109975. [Google Scholar] [CrossRef]
- Thabtah, F.; Hammoud, S.; Kamalov, F.; Gonsalves, A. Data Imbalance in Classification: Experimental Evaluation. Inf. Sci. 2020, 513, 429–441. [Google Scholar] [CrossRef]
- Gurcan, F.; Soylu, A. Learning from Imbalanced Data: Integration of Advanced Resampling Techniques and Machine Learning Models for Enhanced Cancer Diagnosis and Prognosis. Cancers 2024, 16, 3417. [Google Scholar] [CrossRef]
- Wang, G.; Wang, J.; He, K. Majority-to-Minority Resampling for Boosting-Based Classification under Imbalanced Data. Appl. Intell. 2023, 53, 4541–4562. [Google Scholar] [CrossRef]
- Khan, A.A.; Chaudhari, O.; Chandra, R. A Review of Ensemble Learning and Data Augmentation Models for Class Imbalanced Problems: Combination, Implementation and Evaluation. Expert Syst. Appl. 2024, 244, 122778. [Google Scholar] [CrossRef]
- Haixiang, G.; Yijing, L.; Shang, J.; Mingyun, G.; Yuanyue, H.; Bing, G. Learning from Class-Imbalanced Data: Review of Methods and Applications. Expert Syst. Appl. 2017, 73, 220–239. [Google Scholar] [CrossRef]
- Gurcan, F. What Issues Are Data Scientists Talking about? Identification of Current Data Science Issues Using Semantic Content Analysis of Q&A Communities. PeerJ Comput. Sci. 2023, 9, e1361. [Google Scholar] [CrossRef] [PubMed]
- Fotouhi, S.; Asadi, S.; Kattan, M.W. A Comprehensive Data Level Analysis for Cancer Diagnosis on Imbalanced Data. J. Biomed. Inform. 2019, 90, 103089. [Google Scholar] [CrossRef]
- Buda, M.; Maki, A.; Mazurowski, M.A. A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks. Neural Netw. 2018, 106, 249–259. [Google Scholar] [CrossRef]
- Xiao, Y.; Wu, J.; Lin, Z. Cancer Diagnosis Using Generative Adversarial Networks Based on Deep Learning from Imbalanced Data. Comput. Biol. Med. 2021, 135, 104540. [Google Scholar] [CrossRef]
- Choi, H.S.; Jung, D.; Kim, S.; Yoon, S. Imbalanced Data Classification via Cooperative Interaction Between Classifier and Generator. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 3343–3356. [Google Scholar] [CrossRef]
- Ganaie, M.A.; Hu, M.; Malik, A.K.; Tanveer, M.; Suganthan, P.N. Ensemble Deep Learning: A Review. Eng. Appl. Artif. Intell. 2022, 115, 105151. [Google Scholar] [CrossRef]
- Karadeniz, T.; Tokdemir, G.; Maraş, H.H. Ensemble Methods for Heart Disease Prediction. New Gener. Comput. 2021, 39, 569–581. [Google Scholar] [CrossRef]
- Gurcan, F. Forecasting CO2 Emissions of Fuel Vehicles for an Ecological World Using Ensemble Learning, Machine Learning, and Deep Learning Models. PeerJ Comput. Sci. 2024, 10, e2234. [Google Scholar] [CrossRef]
- Figueira, A.; Vaz, B. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 2022, 10, 2733. [Google Scholar] [CrossRef]
- Wang, K.; Gou, C.; Duan, Y.; Lin, Y.; Zheng, X.; Wang, F.Y. Generative Adversarial Networks: Introduction and Outlook. IEEE/CAA J. Autom. Sin. 2017, 4, 588–598. [Google Scholar] [CrossRef]
- Yun, J.; Lee, J.S. Learning from Class-Imbalanced Data Using Misclassification-Focusing Generative Adversarial Networks. Expert. Syst. Appl. 2024, 240, 122288. [Google Scholar] [CrossRef]
- Fatima, N.; Liu, L.; Hong, S.; Ahmed, H. Prediction of Breast Cancer, Comparative Review of Machine Learning Techniques, and Their Analysis. IEEE Access 2020, 8, 150360–150376. [Google Scholar] [CrossRef]
- Liu, H.; Liu, Z.; Jia, W.; Zhang, D.; Tan, J. A Novel Imbalanced Data Classification Method Based on Weakly Supervised Learning for Fault Diagnosis. IEEE Trans. Ind. Inform. 2022, 18, 1583–1593. [Google Scholar] [CrossRef]
- Plaia, A.; Buscemi, S.; Fürnkranz, J.; Mencía, E.L. Comparing Boosting and Bagging for Decision Trees of Rankings. J. Classif. 2022, 39, 78–99. [Google Scholar] [CrossRef]
- Ebrahim, M.; Sedky, A.A.H.; Mesbah, S. Accuracy Assessment of Machine Learning Algorithms Used to Predict Breast Cancer. Data 2023, 8, 35. [Google Scholar] [CrossRef]
- Teng, J. SEER Breast Cancer Data. Available online: https://www.kaggle.com/datasets/reihanenamdari/breast-cancer/data (accessed on 22 September 2024).
- Shukla, N.; Hagenbuchner, M.; Win, K.T.; Yang, J. Breast Cancer Data Analysis for Survivability Studies and Prediction. Comput. Methods Programs Biomed. 2018, 155, 199–208. [Google Scholar] [CrossRef]
- Manikandan, P.; Durga, U.; Ponnuraja, C. An Integrative Machine Learning Framework for Classifying SEER Breast Cancer. Sci. Rep. 2023, 13, 5362. [Google Scholar] [CrossRef]
- Gurcan, F. Major Research Topics in Big Data: A Literature Analysis from 2013 to 2017 Using Probabilistic Topic Models. In Proceedings of the 2018 International Conference on Artificial Intelligence and Data Processing (IDAP), Malatya, Turkey, 28–30 September 2018; pp. 1–4. [Google Scholar]
- Scikit-Learn 1. Supervised Learning—Scikit-Learn 1.4.2 Documentation. Available online: https://scikit-learn.org/stable/supervised_learning.html (accessed on 29 April 2024).
- Gurcan, F.; Ayaz, A.; Menekse Dalveren, G.G.; Derawi, M. Business Intelligence Strategies, Best Practices, and Latest Trends: Analysis of Scientometric Data from 2003 to 2023 Using Machine Learning. Sustainability 2023, 15, 9854. [Google Scholar] [CrossRef]
- Nelli, F. Machine Learning with Scikit-Learn. In Python Data Analytics; Apress: Berkeley, CA, USA, 2023; pp. 259–287. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Ahsan, M.M.; Mahmud, M.A.P.; Saha, P.K.; Gupta, K.D.; Siddique, Z. Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies 2021, 9, 52. [Google Scholar] [CrossRef]
- Kumar, V.; Lalotra, G.S.; Sasikala, P.; Rajput, D.S.; Kaluri, R.; Lakshmanna, K.; Shorfuzzaman, M.; Alsufyani, A.; Uddin, M. Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques. Healthcare 2022, 10, 1293. [Google Scholar] [CrossRef] [PubMed]
- Gupta, S.; Gupta, M.K. A Comprehensive Data-Level Investigation of Cancer Diagnosis on Imbalanced Data. Comput. Intell. 2022, 38, 156–186. [Google Scholar] [CrossRef]
- Yang, Y.; Mirzaei, G. Performance Analysis of Data Resampling on Class Imbalance and Classification Techniques on Multi-Omics Data for Cancer Classification. PLoS ONE 2024, 19, e0293607. [Google Scholar] [CrossRef]
- Zheng, Z.; Cai, Y.; Li, Y. Oversampling Method for Imbalanced Classification. Comput. Inform. 2015, 34, 1017–1037. [Google Scholar]
- Wang, Y.C.; Cheng, C.H. A Multiple Combined Method for Rebalancing Medical Data with Class Imbalances. Comput. Biol. Med. 2021, 134, 104527. [Google Scholar] [CrossRef]
- Burkhardt, S.; Kramer, S. Decoupling Sparsity and Smoothness in the Dirichlet Variational Autoencoder Topic Model. J. Mach. Learn. Res. 2019, 20, 1–27. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative Adversarial Networks: An Overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
- Teoh, J.R.; Hasikin, K.; Lai, K.W.; Wu, X.; Li, C. Enhancing Early Breast Cancer Diagnosis through Automated Microcalcification Detection Using an Optimized Ensemble Deep Learning Framework. PeerJ Comput. Sci. 2024, 10, e2082. [Google Scholar] [CrossRef] [PubMed]
- Gurcan, F. What Are Developers Talking about Information Security? A Large-Scale Study Using Semantic Analysis of Q&A Posts. PeerJ Comput. Sci. 2024, 10, e1954. [Google Scholar] [CrossRef] [PubMed]
- Al-Azzam, N.; Shatnawi, I. Comparing Supervised and Semi-Supervised Machine Learning Models on Diagnosing Breast Cancer. Ann. Med. Surg. 2021, 62, 53–64. [Google Scholar] [CrossRef] [PubMed]
- Palli, A.S.; Jaafar, J.; Hashmani, M.A.; Gomes, H.M.; Gilal, A.R. A Hybrid Sampling Approach for Imbalanced Binary and Multi-Class Data Using Clustering Analysis. IEEE Access 2022, 10, 118639–118653. [Google Scholar] [CrossRef]
- Wang, L.; Han, M.; Li, X.; Zhang, N.; Cheng, H. Review of Classification Methods on Unbalanced Data Sets. IEEE Access 2021, 9, 64606–64628. [Google Scholar] [CrossRef]
- Silva Filho, T.; Song, H.; Perello-Nieto, M.; Santos-Rodriguez, R.; Kull, M.; Flach, P. Classifier Calibration: A Survey on How to Assess and Improve Predicted Class Probabilities. Mach. Learn. 2023, 112, 3211–3260. [Google Scholar] [CrossRef]
- Wu, J.; Hicks, C. Breast Cancer Type Classification Using Machine Learning. J. Pers. Med. 2021, 11, 61. [Google Scholar] [CrossRef]
- Islam, S.S.; Haque, M.S.; Miah, M.S.U.; Sarwar, T.B.; Nugraha, R. Application of Machine Learning Algorithms to Predict the Thyroid Disease Risk: An Experimental Comparative Study. PeerJ Comput. Sci. 2022, 8, e898. [Google Scholar] [CrossRef]
- Walsh, R.; Tardy, M. A Comparison of Techniques for Class Imbalance in Deep Learning Classification of Breast Cancer. Diagnostics 2023, 13, 67. [Google Scholar] [CrossRef]
- Rasool, A.; Bunterngchit, C.; Tiejian, L.; Islam, M.R.; Qu, Q.; Jiang, Q. Improved Machine Learning-Based Predictive Models for Breast Cancer Diagnosis. Int. J. Environ. Res. Public Health 2022, 19, 3211. [Google Scholar] [CrossRef]
Classifier | Type | Accuracy | Precision | Recall | F1 Score | ROC AUC |
---|---|---|---|---|---|---|
GradientBoosting | Boosting | 0.9068 | 0.8573 | 0.5262 | 0.6327 | 0.8667 |
CatBoost | Boosting | 0.9046 | 0.8566 | 0.5051 | 0.6175 | 0.8628 |
LogisticRegression | Linear | 0.8954 | 0.8406 | 0.4484 | 0.5642 | 0.8626 |
LinearSVC | Linear | 0.8926 | 0.8441 | 0.4093 | 0.5353 | 0.8605 |
Bagging | Bagging | 0.9073 | 0.8735 | 0.4938 | 0.6181 | 0.8583 |
RandomForest | Bagging | 0.9043 | 0.8616 | 0.4905 | 0.6086 | 0.8483 |
ExtraTrees | Bagging | 0.8897 | 0.8300 | 0.4093 | 0.5294 | 0.8394 |
XGBoost | Boosting | 0.8954 | 0.8240 | 0.4986 | 0.5925 | 0.8298 |
SVC | Non-linear | 0.8892 | 0.8560 | 0.3541 | 0.4921 | 0.8215 |
PassiveAC | Linear | 0.8551 | 0.8670 | 0.0695 | 0.1208 | 0.8031 |
KNeighbors | Non-linear | 0.8760 | 0.7897 | 0.3540 | 0.4621 | 0.7784 |
DecisionTree | Non-linear | 0.8332 | 0.6841 | 0.5065 | 0.4804 | 0.6994 |
Classifier | Type | Accuracy | Precision | Recall | F1 Score | ROC AUC |
---|---|---|---|---|---|---|
GradientBoosting | Boosting | 0.9158 | 0.9039 | 0.7424 | 0.8143 | 0.9276 |
CatBoost | Boosting | 0.9153 | 0.9133 | 0.7309 | 0.8113 | 0.9244 |
Bagging | Bagging | 0.9138 | 0.9187 | 0.7186 | 0.8056 | 0.9216 |
RandomForest | Bagging | 0.9136 | 0.9161 | 0.7204 | 0.8057 | 0.9174 |
ExtraTrees | Bagging | 0.8984 | 0.9056 | 0.6641 | 0.7646 | 0.9134 |
XGBoost | Boosting | 0.9081 | 0.8816 | 0.7309 | 0.7982 | 0.9111 |
SVC | Non-linear | 0.8997 | 0.9390 | 0.6403 | 0.7608 | 0.8990 |
KNeighbors | Non-linear | 0.8896 | 0.8845 | 0.6429 | 0.7438 | 0.8722 |
LogisticRegression | Linear | 0.7576 | 0.5282 | 0.2867 | 0.3709 | 0.8407 |
LinearSVC | Linear | 0.7794 | 0.6652 | 0.2410 | 0.3523 | 0.8402 |
DecisionTree | Non-linear | 0.8511 | 0.6899 | 0.7371 | 0.7119 | 0.8131 |
PassiveAC | Linear | 0.7616 | 0.6024 | 0.1346 | 0.2165 | 0.8043 |
Classifier | Type | Accuracy | Precision | Recall | F1 Score | ROC AUC |
---|---|---|---|---|---|---|
GradientBoosting | Boosting | 0.9253 | 0.9384 | 0.8304 | 0.8809 | 0.9538 |
CatBoost | Boosting | 0.9251 | 0.9466 | 0.8216 | 0.8795 | 0.9493 |
Bagging | Bagging | 0.9263 | 0.9573 | 0.8152 | 0.8802 | 0.9485 |
RandomForest | Bagging | 0.9257 | 0.9524 | 0.8181 | 0.8798 | 0.9462 |
ExtraTrees | Bagging | 0.9102 | 0.9470 | 0.7747 | 0.8516 | 0.9432 |
XGBoost | Boosting | 0.9173 | 0.9214 | 0.8222 | 0.8687 | 0.9391 |
SVC | Non-linear | 0.9110 | 0.9659 | 0.7600 | 0.8500 | 0.9321 |
KNeighbors | Non-linear | 0.9032 | 0.9396 | 0.7588 | 0.8389 | 0.9168 |
LogisticRegression | Linear | 0.8652 | 0.7864 | 0.8181 | 0.8016 | 0.8946 |
LinearSVC | Linear | 0.8641 | 0.7829 | 0.8198 | 0.8006 | 0.8915 |
PassiveAC | Linear | 0.8686 | 0.8360 | 0.7558 | 0.7928 | 0.8858 |
DecisionTree | Non-linear | 0.8746 | 0.7999 | 0.8334 | 0.8158 | 0.8643 |
Classifier | Type | Accuracy | Precision | Recall | F1 Score | ROC AUC |
---|---|---|---|---|---|---|
GradientBoosting | Boosting | 0.9425 | 0.9722 | 0.9111 | 0.9406 | 0.9764 |
Bagging | Bagging | 0.9449 | 0.9789 | 0.9094 | 0.9428 | 0.9734 |
CatBoost | Boosting | 0.9428 | 0.9737 | 0.9102 | 0.9409 | 0.9733 |
RandomForest | Bagging | 0.9434 | 0.9770 | 0.9082 | 0.9413 | 0.9721 |
ExtraTrees | Bagging | 0.9328 | 0.9752 | 0.8882 | 0.9296 | 0.9698 |
XGBoost | Boosting | 0.9383 | 0.9645 | 0.9099 | 0.9364 | 0.9695 |
SVC | Non-linear | 0.9341 | 0.9853 | 0.8815 | 0.9304 | 0.9637 |
KNeighbors | Non-linear | 0.9277 | 0.9729 | 0.8800 | 0.9240 | 0.9558 |
PassiveAC | Linear | 0.9045 | 0.9110 | 0.8973 | 0.9037 | 0.9423 |
LinearSVC | Linear | 0.9212 | 0.9336 | 0.9070 | 0.9201 | 0.9420 |
LogisticRegression | Linear | 0.9127 | 0.9110 | 0.9149 | 0.9129 | 0.9415 |
DecisionTree | Non-linear | 0.9033 | 0.8944 | 0.9149 | 0.9045 | 0.9033 |
Classifier | Type | Accuracy | Precision | Recall | F1 Score | ROC AUC |
---|---|---|---|---|---|---|
GradientBoosting | Boosting | 0.9611 | 0.9812 | 0.9402 | 0.9602 | 0.9890 |
Bagging | Bagging | 0.9610 | 0.9847 | 0.9366 | 0.9600 | 0.9880 |
CatBoost | Boosting | 0.9603 | 0.9810 | 0.9388 | 0.9594 | 0.9870 |
RandomForest | Bagging | 0.9617 | 0.9845 | 0.9382 | 0.9608 | 0.9869 |
ExtraTrees | Bagging | 0.9541 | 0.9834 | 0.9238 | 0.9526 | 0.9859 |
XGBoost | Boosting | 0.9568 | 0.9739 | 0.9388 | 0.9560 | 0.9857 |
KNeighbors | Non-linear | 0.9512 | 0.9808 | 0.9204 | 0.9496 | 0.9756 |
SVC | Non-linear | 0.9549 | 0.9892 | 0.9198 | 0.9532 | 0.9744 |
PassiveAC | Linear | 0.9268 | 0.9200 | 0.9354 | 0.9275 | 0.9634 |
LogisticRegression | Linear | 0.9371 | 0.9312 | 0.9440 | 0.9375 | 0.9569 |
LinearSVC | Linear | 0.9442 | 0.9473 | 0.9408 | 0.9440 | 0.9563 |
DecisionTree | Non-linear | 0.9312 | 0.9283 | 0.9348 | 0.9314 | 0.9312 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gurcan, F.; Soylu, A. Synthetic Boosted Resampling Using Deep Generative Adversarial Networks: A Novel Approach to Improve Cancer Prediction from Imbalanced Datasets. Cancers 2024, 16, 4046. https://doi.org/10.3390/cancers16234046
Gurcan F, Soylu A. Synthetic Boosted Resampling Using Deep Generative Adversarial Networks: A Novel Approach to Improve Cancer Prediction from Imbalanced Datasets. Cancers. 2024; 16(23):4046. https://doi.org/10.3390/cancers16234046
Chicago/Turabian StyleGurcan, Fatih, and Ahmet Soylu. 2024. "Synthetic Boosted Resampling Using Deep Generative Adversarial Networks: A Novel Approach to Improve Cancer Prediction from Imbalanced Datasets" Cancers 16, no. 23: 4046. https://doi.org/10.3390/cancers16234046
APA StyleGurcan, F., & Soylu, A. (2024). Synthetic Boosted Resampling Using Deep Generative Adversarial Networks: A Novel Approach to Improve Cancer Prediction from Imbalanced Datasets. Cancers, 16(23), 4046. https://doi.org/10.3390/cancers16234046