1. Introduction
Cancer is considered one of the leading health issues with the highest mortality rates globally. It can arise in various organs and tissues, with each type possessing unique characteristics and treatment approaches. Additionally, cancer cells often exhibit a tendency to spread rapidly, making early diagnosis and effective treatment challenging. A significant contributing factor to cancer’s high mortality rate is that the disease is often diagnosed at advanced stages. Late diagnosis limits treatment options and frequently leads to less successful outcomes. In modern medicine, diagnosing and prognosing cancer typically requires the analysis of large and complex datasets [
1,
2]. These datasets may include a wide range of information such as patients’ genetic data, clinical test results, imaging data, and biomarkers [
3,
4]. Effectively diagnosing, managing, treating, and predicting the course of cancer necessitates thorough analysis of complex data from various contexts [
1,
4,
5]. This data complexity mandates the use of robust data analysis and modeling techniques to achieve accurate and reliable results [
6,
7,
8]. Labeled data are data where each instance is associated with a specific target value or class, which is used for training supervised learning algorithms [
9,
10,
11]. Unlabeled data, on the other hand, are data without classification or target value information, typically used for unsupervised learning methods [
12,
13,
14]. The selection of suitable analysis techniques is crucial for both types of data, as it directly impacts the reliability of results and the performance of the model [
6,
8]. Proper and comprehensive data analysis allows for early-stage diagnosis, the personalization of treatment methods, and more accurate prognoses [
1,
4,
6,
14,
15].
However, the imbalances commonly encountered in these complex datasets significantly hinder the effectiveness of analysis processes [
16,
17]. In particular, situations where minority classes (e.g., rare types of cancer) are underrepresented compared to majority classes (e.g., common types of cancer) are frequently observed [
11,
15,
18]. Data imbalance arises when there is a substantial difference in the number of samples between classes, and the underrepresentation of minority classes complicates the models’ ability to learn them accurately [
8,
19]. This can result in low sensitivity and high false negative rates for minority classes. Models that place more weight on the majority class may not adequately learn the minority classes, leading to an increase in misclassification rates [
6,
7,
11]. Issues stemming from such imbalanced data distributions can adversely affect the performance of traditional classification algorithms, particularly hindering the accurate prediction of minority classes, which are critical in cancer diagnosis and prognosis [
20,
21].
To overcome these data imbalances, it is essential to develop specific strategies and techniques for data analysis and modeling processes [
6,
7]. Resampling methods and balancing techniques play a crucial role in this context. These techniques are implemented to ensure better representation of minority classes in the dataset and support classification algorithms in producing fairer and more accurate results [
21,
22,
23]. Resampling methods include various strategies such as reducing the majority class samples (under-sampling), increasing the minority class samples (over-sampling), and smoothing methods that combine these two approaches [
7,
16,
24]. These approaches facilitate dataset balancing, allowing classification algorithms to generate more accurate and reliable results [
6,
11,
23].
The aim of this study is to thoroughly examine how resampling techniques are utilized in cancer diagnosis and prognosis and to compare the performance of these techniques in conjunction with various classifiers. In this context, while evaluating the effectiveness of resampling methods, we compared classification techniques based on various machine learning approaches designed to enhance performance in imbalanced datasets. The findings obtained can contribute to understanding the effects of resampling methods on medical data, improving clinical decision support systems, enhancing the overall quality of healthcare services, and more effectively addressing imbalanced datasets.
2. Materials and Methods
This study aims to identify the most effective resampling methods and classifiers for cancer diagnosis and prognosis, addressing the challenge of imbalanced data that can distort model performance. To achieve this, we have structured our methodology in a detailed and systematic manner, which is outlined in the following sequential subsections:
Description of the datasets: An overview of the datasets used in this study, including their characteristics and relevance to cancer diagnosis and prognosis.
Adopted data resampling strategies: Overview of techniques used to address data imbalance and their implementation.
Adopted machine learning classifiers: Summary of algorithms applied, including their purpose and usage.
Performance evaluation metrics: The criteria and methods used to assess the effectiveness of the classifiers and resampling methods.
Proposed methodology and experimental setup: An outline of the overall methodological framework and experimental design.
2.1. Description of the Datasets
In this study, five datasets focused on diagnosis and prognosis have been used, including three diagnostic and two prognostic datasets. The diagnostic datasets are the Wisconsin Breast Cancer Database [
25], which classifies tumors as benign or malignant, the Cancer Prediction Dataset [
26], which includes clinical and biological data for various cancer types, and the Lung Cancer Detection Dataset [
27], which assesses lung cancer risk using demographic and clinical variables. The prognostic datasets are the Seer Breast Cancer Dataset [
28], which predicts long-term survival and cancer progression, and the Differentiated Thyroid Cancer Recurrence Dataset [
29], which estimates the likelihood of cancer recurrence in patients with thyroid cancer.
All of these datasets are open-source and publicly available, ensuring transparency and accessibility for research purposes. Researchers can freely access and download these datasets via the Kaggle data repository, using the provided links. These datasets exhibit significant class imbalance problems, with minority classes being underrepresented. This imbalance can negatively affect the performance of machine learning models. Resampling methods, which are used to correct such imbalances, help balance class distribution and improve the accuracy of the models. Each of these datasets will now be explained in detail under individual subsections.
Wisconsin Breast Cancer Database (Diagnostic): The Wisconsin Breast Cancer Database (WBCD) is a widely recognized dataset in the field of medical research and machine learning, particularly used for breast cancer diagnosis [
25]. Each sample is described by 10 real-valued features, which include characteristics such as clump thickness, size and shape uniformity, marginal adhesion, epithelial cell size, and others that are critical for distinguishing between benign and malignant breast masses. These features are derived from digitized images of fine-needle aspirates (FNAs) of breast tissue. The primary objective when using this dataset is to predict the class of the tumor (benign or malignant) based on the given features. The dataset comprises 699 samples, with 458 instances of benign tumors (65.5%) and 241 instances of malignant tumors (34.5%) [
6,
24,
25].
Cancer Prediction Dataset (Diagnostic): This dataset contains health and lifestyle information for 1500 patients, with 9 variables (Age, Gender, BMI, Smoking, GeneticRisk, PhysicalActivity, AlcoholIntake, CancerHistory, Diagnosis), aimed at predicting cancer diagnosis [
26]. It presents a realistic challenge for predictive modeling in the medical field by encompassing a range of variables that influence cancer diagnosis and prognosis. The dataset includes both categorical and continuous features, such as age, BMI, smoking status, and genetic risk factors, which collectively contribute to the complexity of predicting cancer presence. Each variable captures different aspects of patients’ health and lifestyle, adding layers of intricacy to the modeling process [
17,
23]. Additionally, the dataset includes information on patients’ cancer history and current cancer diagnosis status. In this dataset, 943 patients (62.9%) have not been diagnosed with cancer, while 557 patients (37.1%) have been diagnosed with cancer [
26].
Lung Cancer Detection Dataset (Diagnostic): This dataset contains information related to lung cancer risk factors and symptoms for individuals, with the aim of detecting the presence of lung cancer [
27]. It comprises 309 samples with 16 variables that capture a range of demographic, lifestyle, and medical factors linked to lung cancer risk. These Variables Include Gender, Age, Smoking, Yellow_Fingers, Anxiety, Peer_Pressure, Chronic_Disease, Fatigue, Allergy, Wheezing, Alcohol_Consuming, Coughing, Shortness_Of_Breath, Swallowing_Difficulty, Chest_Pain, and Lung_cancer. Each variable provides valuable insights into potential indicators of lung cancer, contributing to the overall analysis. The target variable, Lung_cancer, represents the diagnostic outcome, where 270 individuals (87.4%) are diagnosed with lung cancer, and 39 individuals (12.6%) are not [
27]. This imbalance ratio indicates that for every individual without lung cancer, there are about 7 individuals with lung cancer [
7,
11,
20].
Seer Breast Cancer Dataset (Prognostic): The Seer Breast Cancer Dataset offers comprehensive clinical and demographic information crucial for evaluating breast cancer outcomes [
28]. The dataset encompasses 4024 patients, with exclusions made for cases with unknown tumor size, unexamined regional lymph nodes, positive regional lymph nodes, and survival times less than one month. This dataset includes a total of 16 variables, providing a broad spectrum of information: Age, Race, Marital Status, T Stage, N Stage, 6th Stage, Differentiation, Grade, A Stage, Tumor Size, Estrogen Status, Progesterone Status, Regional Node Examined, Regional Node Positive, Survival Months, and Status. The target variable, Status, represents the patient’s survival outcome and is categorized as “Alive” or “Dead”. In this dataset, 84.4% (3398 patients) are classified as “Alive”, while 15.6% (607 patients) are classified as “Dead” [
28]. This significant class imbalance highlights a key challenge in prognosis prediction [
4,
6,
7].
Differentiated Thyroid Cancer Recurrence Dataset (Prognostic): This dataset encompasses important features and information for assessing prognosis in well-differentiated thyroid cancer. The Differentiated Thyroid Cancer Recurrence dataset contains a total of 17 variables and includes 383 patient records [
29]. Out of these, 13 are clinicopathologic features: Age, Gender, Smoking, Hx Smoking (history of smoking), Hx Radiotherapy (history of radiotherapy), Thyroid Function, Physical Examination, Adenopathy, Pathology, Focality, Risk, T (tumor stage), N (nodal stage), M (metastasis stage), Stage (overall stage), and Response (treatment response). The target variable, Recurred shows that 275 patients (71.8%) are classified as “No”, meaning they did not experience recurrence, while 108 patients (28.2%) are classified as “Yes”, indicating that their cancer recurred [
29]. This substantial imbalance between the two classes presents a significant challenge for predictive modeling [
7,
20].
2.2. Adopted Data Resampling Strategies
In this study, we employ a total of 19 resampling methods, categorized into three types: over-sampling (6 methods), under-sampling (10 methods), and hybrid sampling (2 methods). Additionally, the Baseline scenario, which involves no resampling, is included for comparative purposes [
30,
31]. For over-sampling, aimed at increasing the number of minority class samples, we use KMSMOTE, SMOTE, ADASYN, ROS, BSMOTE, and SVMSMOTE [
7,
32,
33]. Under sampling methods, which focus on reducing the number of majority class samples to balance the dataset, include IHT, RENN, EditedNN, NCR, TomekLinks, RUS, NearMiss, ClusterC, OSS, and CNN [
1,
23,
30]. Hybrid sampling techniques, which combine both over- and under-sampling methods, are represented by SMOTEENN and SMOTETomek [
1,
34]. The Baseline scenario, with no resampling, serves as a reference to evaluate the effectiveness of the resampling methods [
24]. Each approach is selected to address class imbalance and enhance the performance of machine learning models.
2.3. Adopted Machine Learning Classifiers
In this study, we utilize 10 different machine learning classifiers, categorized into four main groups: Balancing Ensemble Classifiers, Linear Classifiers, Standard Ensemble Classifiers, and Deep Learning Classifiers [
1,
7,
45]. This diverse set of classifiers allows for a thorough exploration of machine learning approaches, especially in the context of imbalanced data in cancer diagnosis and prognosis, where achieving accurate predictions is critical for patient outcomes [
46,
47]. By utilizing a combination of traditional linear models, advanced ensemble techniques, and deep learning classifiers, we aim to assess the effectiveness of each method in addressing the challenges posed by class imbalance. This holistic approach provides valuable insights into the performance and adaptability of different algorithms in predicting cancer progression and aiding early diagnosis.
Balancing Ensemble Classifiers:
BRF (Balanced Random Forest): Modifies the traditional Random Forest by using balanced datasets during training to handle class imbalance [
48,
49].
EE (Easy Ensemble): Uses multiple bootstrap samples of the minority class and trains independent models, combining their results to create a more balanced classification [
16,
31].
RB (Random Under-Sampling Boost): Applies random under-sampling followed by AdaBoost to provide a balanced classification [
24,
50].
BB (Balanced Bagging): Employs balanced resampling in each bootstrap sample to enhance the performance of the bagging method on imbalanced data [
48,
51].
Linear Classifiers:
LR (Logistic Regression): A linear model that predicts class probabilities using a sigmoid function [
1,
44].
SVC (Support Vector Classifier): Finds the optimal hyperplane for classification and supports linear and non-linear separations using different kernels [
17,
52].
Standard Ensemble Classifiers:
RF (Random Forest): A popular bagging method that creates multiple decision trees, combining their outputs to improve robustness and reduce overfitting [
9,
48].
XGB (XGBoost): An optimized gradient boosting method that builds trees iteratively to correct previous errors, known for its high performance in large datasets [
9,
51].
Deep Learning Classifiers:
MLP (Multi-Layer Perceptron): A feedforward neural network with fully connected layers, effective for solving non-linear classification problems [
4,
53].
DNN (Deep Neural Network): An extension of MLP with multiple hidden layers, allowing for more advanced feature learning and higher capacity in handling large and complex datasets [
46,
54].
2.4. Performance Evaluation Metrics
In this study, we utilize a comprehensive set of five performance evaluation metrics to assess our machine learning models. Accuracy provides a straightforward measure of overall prediction correctness, while the F1 Score balances precision and recall, offering a nuanced evaluation especially important for imbalanced datasets [
9,
53,
55]. ROC-AUC evaluates the model’s ability to distinguish between classes, which is crucial for tasks such as differentiating between benign and malignant tumors [
20,
56]. The Mean of these metrics, which integrates Accuracy, F1 Score, and ROC-AUC, offers a consolidated view of model performance, reflecting overall effectiveness [
50,
57,
58]. This diverse set of metrics ensures a thorough evaluation of model performance, addressing various aspects crucial for effective cancer diagnosis and prognosis [
9,
50,
59,
60]. Class Count refers to the number of instances in each class within the dataset. This is crucial for understanding the distribution and balance of classes [
20,
33]. The Imbalance Ratio is calculated by dividing the count of the minority class by the count of the majority class, then multiplying by 100 to express it as a percentage. This ratio indicates how prevalent the minority class is relative to the majority class [
11,
16].
2.5. Proposed Methodology and Experimental Setup
This section describes the methodology of this study, which includes six key components: Experimental Setup, Data Loading, Encoding, and Scaling; Application of Resampling Techniques; Application of Classifiers; Performance Evaluation Metrics; and Results Analysis. The sequential stages of this methodology are detailed below under the following subheadings.
Experimental Setup: The experimental setup for this study was carried out in a stable and efficient computing environment. The system configuration included an Intel i7-12650H processor, Nvidia RTX 3060 GPUs, and 16 GB of RAM, running on Windows 10, 64-bit. Python 3.12 (64-bit) was used as the development language, with Jupyter Notebook version 7.1.3 serving as the primary environment [
59]. Each phase of our methodology was supported by specific tools and libraries [
49,
50]. Python, with its extensive ecosystem, was selected due to its flexibility, scalability, and wide use in machine learning and data science. Its comprehensive library offerings enabled efficient management of every stage of the machine learning workflow, from data preparation to model evaluation and deployment [
8,
49,
61,
62,
63]. The source codes used in this study are publicly available on Figshare [
64].
Data Loading, Encoding, and Scaling: The datasets used in this study were loaded from CSV files and prepared for analysis. Invalid values were replaced with NaN and filled using the most frequently repeated values for each respective column. The dependent and target variables were identified and extracted based on the specific objectives of this study [
50]. Non-numeric target variables were converted into numerical format using Label Encoding, while features were scaled using Standard Scaling [
3,
16]. Data splitting was performed using 5-Fold Cross-Validation, where the dataset was divided into five equal parts [
50,
65]. Each fold served as a test set while the remaining four folds were used for training, ensuring that each instance of the dataset was used for both training and testing. This method preserves the class distribution in each fold and provides a comprehensive evaluation of the model’s performance [
49,
50,
65].
Application of Resampling Methods: To reduce class imbalance and enhance model performance, a total of 19 different resampling techniques were applied across three categories: over-sampling, under-sampling, and hybrid methods [
1,
4,
20,
38,
44]. Also, a Baseline model without resampling was used for performance comparison. The dataset was processed using these methods to create balanced training sets, which were then used to evaluate the performance of various classifiers [
18,
20,
34]. After resampling, the balanced training sets were used to train various classifiers [
49,
50,
65].
Application of Classifiers: Various classifiers, encompassing a total of ten models, were applied to the resampled datasets, and these were systematically categorized into four key groups—Balancing Ensemble Classifiers [
48], Linear Classifiers [
54], Standard Ensemble Classifiers [
9], and Deep Learning Classifiers [
17,
58]—each contributing to a comprehensive performance evaluation. Each classifier was tested on the resampled datasets to evaluate its performance [
66]. Additionally, Early Stopping was used for deep learning models to prevent overfitting [
49,
50,
61,
65].
Performance Evaluation and Result Analysis: The performance of the classifiers was assessed using metrics such as accuracy, F1 score, ROC-AUC, and the mean of these metrics [
11,
60]. All experiments were conducted using the Stratified K-Fold (5-Fold) Cross-Validation method [
1,
9,
67]. Performance results for each resampling method and classifier combination were collected and analyzed. The results were summarized using metrics such as accuracy, F1 score, ROC AUC, Mean, imbalance ratio, and class counts [
6,
7,
52].
3. Results and Discussion
3.1. Results for Wisconsin Breast Cancer Database (Diagnostic)
In this study, the impact of class imbalance on classification performance and the effectiveness of various resampling methods were assessed using the Wisconsin Breast Cancer (Diagnostic) dataset. Resampling methods significantly improved these results, as detailed in
Table 1. Initially, the dataset had an imbalance ratio of 52.62%, limiting model performance. The IHT method yielded the highest performance with a Mean value of 99.73%, 99.59% Accuracy, 99.59% F1 Score, and 99.99% ROC AUC. SMOTEENN followed closely, with the same Mean (99.73%) and slightly higher scores of 99.61% for both Accuracy and F1 and 99.96% ROC AUC. RENN and EditedNN also performed well, with Mean values of 99.41% and 99.20%, respectively. The Baseline approach lagged behind with a 97.20% Mean, reflecting the impact of imbalance. As shown in
Table 1, resampling methods, especially IHT and SMOTEENN, significantly enhanced performance. However, under-sampling methods like OSS and CNN led to reduced accuracy, highlighting the need for carefully chosen resampling techniques to mitigate class imbalance.
Table 2 presents the performance of ten different classification algorithms applied to the Wisconsin Breast Cancer Database (Diagnostic) using the resampling methods mentioned above. Multi-Layer Perceptron (MLP) achieved the best performance with a mean value of 97.83%, excelling with 97.41% Accuracy, 97.39% F1 Score, and 98.68% ROC AUC, effectively distinguishing both classes. Random Forest ranked second with a mean of 97.80%. Logistic Regression placed third with a mean of 97.79%, particularly standing out with a 98.94% ROC AUC, highlighting its ability to distinguish positive classes. Balanced Bagging, at the bottom of the list, showed relatively lower performance with a mean of 96.60%, but still achieved strong classification with a 98.08% ROC AUC. This analysis demonstrates that MLP, Random Forest, and Logistic Regression provided the best performance, with strong class separation shown by their ROC AUC values.
3.2. Results for Cancer Prediction Dataset (Diagnostic)
The classification performance of various resampling methods on the Cancer Prediction Dataset (Diagnostic) is compared in
Table 3. Initially, the dataset exhibited a notable class imbalance with an imbalance ratio of 59.07% (0: 943, 1: 557). In the Baseline results, where no resampling methods were applied, the accuracy was 89.59%, the F1 score was 89.54%, and the ROC AUC value was 93.97%. These results indicate that the class imbalance negatively impacted the model’s performance. The RENN method achieved the highest performance with a mean value of 98.67%. Similarly, the hybrid method SMOTEENN also demonstrated strong performance, with 97.90% accuracy, 97.90% F1 score, and 99.65% ROC AUC. Hybrid methods, such as SMOTEENN and SMOTETomek, effectively managed class imbalance by combining both over-sampling and under-sampling techniques. The IHT method yielded a mean value of 95.07%. However, under-sampling methods like NearMiss, ClusterC, OSS, and CNN showed lower mean values. Overall, the results highlight that resampling methods can significantly improve classification performance on imbalanced datasets. Methods such as RENN and SMOTEENN provided the best performance, underscoring the importance of resampling techniques in handling imbalanced data.
Table 4 presents a performance evaluation in which a number of classification algorithms are ranked based on their mean values. The Easy Ensemble model achieved the highest performance, with a mean score of 94.35%. This model provided balanced and effective classification, reaching 93.27% Accuracy, 93.28% F1 Score, and 96.48% ROC AUC. Following this, Balanced Random Forest exhibited strong performance with a mean score of 93.93%, while Random Forest ranked third with a mean score of 93.82%, also delivering robust results. The lowest-performing model was Logistic Regression, which ranked last with a mean score of 88.04%. This model particularly struggled in terms of Accuracy (85.85%) and F1 Score (85.79%). Overall, this evaluation highlights that the Easy Ensemble and Balanced Random Forest models provided the best results, while the other models displayed lower performances in comparison.
3.3. Results for Lung Cancer Detection Dataset (Diagnostic)
The experiments conducted on the Lung Cancer Detection Dataset (Diagnostic) highlight the impact of class imbalance on classification performance. The imbalanced distribution within the dataset was addressed through various resampling methods, and the effects of these methods on model performance are presented in
Table 5. Among hybrid sampling methods, SMOTEENN stood out with the highest mean value of 99.64%. Similarly, IHT, an under-sampling method, performed well with a mean value of 98.20%. Other methods like RENN and KMSMOTE also delivered notable results, achieving mean values of 97.01% and 96.14%, respectively. However, the Baseline method, which involved classification without any resampling, showed lower performance with a mean value of 89.13%. This result underscores the limitation of model accuracy and overall performance when class imbalance is not addressed. Notably, some under-sampling methods, such as CNN and NearMiss, demonstrated lower performance, achieving mean values of 71.50% and 70.63%, respectively. In conclusion, methods such as SMOTEENN, IHT, and RENN emerged as the most effective approaches for addressing class imbalance and improving classification accuracy.
Table 6 presents the performance of various classification algorithms ranked by their Mean values. The Random Forest model demonstrated the best performance with a Mean value of 93.49%. This model achieved 92.61% Accuracy, 92.44% F1 Score, and 95.43% ROC AUC, delivering strong classification results. Balanced Random Forest ranked second with a high performance, achieving a Mean value of 93.10%. The Support Vector Classifier (SVC) ranked third with a Mean value of 92.77%, demonstrating strong class distinction with a ROC AUC of 95.03%. Among the lower-performing models, Easy Ensemble achieved a Mean value of 91.02%, while Balanced Bagging and RUSBoost occupied the last positions with Mean values of 89.47% and 89.29%, respectively. These results highlight that the Random Forest and Balanced Random Forest models achieved the best performance, while other models performed slightly lower in comparison.
3.4. Results for Seer Breast Cancer Dataset (Prognostic)
The impact of various resampling methods on classification performance is examined using the Seer Breast Cancer Dataset (Prognostic), with results presented in
Table 7. The dataset revealed a remarkable class imbalance with an imbalance ratio of 18.08% (0:3408, 1:616), and the effectiveness of various resampling methods in addressing this imbalance was investigated. In the baseline scenario, where no resampling is applied, a Mean value of 84.61% is achieved, indicating lower performance. This result clearly demonstrates the effectiveness of resampling methods in addressing class imbalance. The SMOTEENN method, which performs hybrid sampling, stands out with a Mean value of 94.49%. Among under-sampling methods, IHT exhibits the second-best performance with a Mean value of 94.40%. KMSMOTE is also noteworthy, with a Mean value of 93.50%, demonstrating effective handling of class imbalance. By contrast, some under-sampling methods like CNN and RUS exhibit lower performance. CNN, with a Mean value of 72.24%, ranks at the bottom (see
Table 7). The results emphasize that the effective use of resampling methods can substantially improve model performance. Specifically, methods like SMOTEENN, IHT, and KMSMOTE have shown high performance and successful outcomes in classification tasks.
Table 8 presents the performance of various classification algorithms. Random Forest model achieved the best performance, topping the list with a Mean value of 91.14%. Balanced Random Forest ranked second with a Mean value of 90.40%, particularly excelling in ROC AUC performance. XGBoost followed closely in third place, showing similar success with a Mean value of 90.34%. The model with the weakest performance was RUSBoost, which ranked last with a Mean value of 84.56%. Overall, Random Forest and Balanced Random Forest were the top-performing models, while the other models achieved lower results.
3.5. Results for Differentiated Thyroid Cancer Recurrence Dataset (Prognostic)
The experiments conducted on the Differentiated Thyroid Cancer Recurrence Dataset (Prognostic) comprehensively demonstrate the impact of various resampling methods on classification performance. In this dataset, class imbalance played a significant role, and the results presented in
Table 9 demonstrate how various methods managed this imbalance. The dataset exhibited an imbalance ratio of 39.27% in the baseline scenario, reflecting a notable disparity between classes. For the baseline case, where no resampling method was applied, a Mean value of 94.69% was obtained, indicating that the model’s performance was constrained by the imbalanced data. The best-performing method was SMOTEENN, with a Mean value of 98.60%. IHT followed closely, ranking second with a Mean value of 98.59%. Among other under-sampling methods, RENN stood out with a Mean value of 97.20%. KMSMOTE performed well with a Mean value of 97.18%, while ROS and SMOTETomek also showed strong performances, with Mean values of 97.15% and 96.63%, respectively. By contrast, under-sampling methods such as ClusterC and RUS performed less effectively, with Mean values of 92.06% and 92.03%, demonstrating limited impact on classification tasks. The CNN method recorded the lowest performance with a Mean value of 87.65%.
Table 10 presents the performance of various classification algorithms ranked by their Mean values. The best-performing model is Random Forest, topping the list with a Mean value of 97.21%. XGBoost ranks second with a Mean of 97.06%, notably excelling in ROC AUC with a score of 99.12%. Balanced Random Forest also achieved a Mean of 97.06%, placing third with balanced results across metrics. Lower-performing models include RUSBoost, Support Vector Classifier (SVC), and Multi-Layer Perceptron (MLP), which scored Mean values of 95.29%, 94.25%, and 93.99%, respectively, below the average. Deep Neural Network (DNN) and Logistic Regression showed even lower performance, with Logistic Regression ranking last with a Mean of 91.82%. Overall, Random Forest, XGBoost, and Balanced Random Forest demonstrated the highest performance, while Logistic Regression performed the worst.
3.6. Overall Performance Evaluation of the Classifiers and Resampling Methods
At this stage of our analysis, we conducted a comprehensive evaluation of the average performance of resampling methods across five datasets: three diagnostic and two prognostics. The Mean value was calculated for each resampling method across all datasets. This approach provides a holistic view of model performance, allowing us to compare the effectiveness of various resampling techniques. Averaging these metrics helps evaluate each method’s performance under different data conditions, enhancing the generalizability of the results. The performance overview in
Figure 1 offers a unified perspective on how resampling methods manage class imbalance across all datasets. As shown, SMOTEENN, a hybrid sampling method, achieved the highest performance with a Mean value of 98.19%. IHT ranked second with a Mean value of 97.20%, as an under-sampling method that focuses on identifying and prioritizing more challenging samples. RENN, another under-sampling technique, ranked third with a Mean value of 96.48%. KMSMOTE (an over-sampling method) and SMOTETomek (a hybrid sampling method) followed, with Mean values of 95.52% and 95.01%, respectively. At the lower end of the ranking, RUS, NearMiss, and CNN—all under-sampling methods—recorded Mean values of 88.63%, 87.84%, and 80.78%, showing comparatively lower performances. Overall, the Table highlights the effectiveness of various resampling methods, with SMOTEENN, IHT, and RENN emerging as the top performers. The calculation of Mean values facilitated a comprehensive assessment of classification success across different conditions, offering valuable insight into the most effective methods for managing class imbalance.
The results of the ANOVA and Kruskal–Wallis tests indicate that the performance differences in resampling methods are statistically significant. The ANOVA test produced an F-statistic of 4.3223 and a p-value of 0.0181, suggesting that there is a significant difference in performance among at least some of the resampling methods. Similarly, the Kruskal–Wallis test yielded an H-statistic of 10.0332 and a p-value of 0.0066. These results emphasize that the performance differences are not due to random variation but rather stem from the effects of different resampling methods on specific datasets and classification tasks.
Figure 2 provides a summary view of each classifier’s performance across the five datasets. As shown, the Random Forest classifier achieved the highest performance, with a Mean value of 94.69%, indicating its robustness and adaptability across different dataset characteristics. Following closely, the Balanced Random Forest, which is designed to handle imbalanced datasets by re-sampling the data during training, recorded a Mean value of 94.43%. XGBoost, another powerful ensemble method, ranked third with a Mean value of 94.04%. At the lower end of the rankings, RUSBoost achieved a Mean value of 91.73%, still showing respectable performance. Lastly, Logistic Regression, a traditional but often reliable classifier, recorded a Mean of 91.19%. Overall, this figure emphasizes the strength of ensemble methods like Random Forest, Balanced Random Forest, and XGBoost, as they consistently ranked highest in terms of average performance.
The results obtained from the ANOVA and Kruskal–Wallis tests clearly indicate that the performance differences among the classifiers are statistically significant. The ANOVA test yielded an F-statistic of 26.6848 and a p-value of 0.0000, suggesting that at least one classifier performs significantly differently from the others. Additionally, the Kruskal–Wallis test provided an H-statistic of 19.3617 and a p-value of 0.0001, reinforcing the finding of significant differences based on ranking. These findings emphasize that each classifier’s performance varies and that these differences are not due to random chance but rather represent a meaningful effect. Therefore, analyzing performance differences is a crucial step in selecting the most effective classifier.
The results of this study demonstrate that the implemented resampling methods offer an effective strategy for addressing the imbalance between classes, significantly enhancing the performance of the classification models. Resampling methods have facilitated the balancing of representation ratios among classes, ensuring that the minority class is adequately represented. Consequently, this has allowed the models to learn the characteristics of the minority class during the training process, enabling more accurate identification of samples belonging to this class. Moreover, these techniques have reduced the risk of overfitting, leading to the attainment of more generalizable and robust results. By eliminating the excessive representation of the majority class, these methods have prevented the model from focusing solely on a specific class during training, thus allowing for an equitable learning of the characteristics of both classes.
3.7. Limitations and Validity
This study presents several limitations that must be considered when interpreting the results. First, the performance of classifiers depends heavily on the quality and diversity of the datasets used. While the selected cancer datasets are reputable, differences in data collection, sample sizes, and demographics may affect generalizability. The datasets primarily feature binary outcome variables and a limited number of predictors, representing only a subset of broader data modeling challenges. Expanding to more diverse datasets with multi-class outcomes and additional features is necessary for broader applicability. Addressing this will be crucial for ensuring more comprehensive and adaptable machine learning solutions in healthcare.
The study focuses on specific resampling techniques and a limited range of classifiers (Balancing Ensemble, Linear, Standard Ensemble, and Deep Learning), which could restrict the findings. Future research should explore a wider array of algorithms, including newer deep learning and hybrid models, to provide a more comprehensive analysis. Additionally, the lack of attention to model interpretability and transparency is a limitation, as understanding model predictions is crucial in clinical settings. Future work should prioritize both model diversity and interpretability to ensure clinical trust and adoption.
To strengthen validation, future work should involve more diverse datasets reflecting different cancer types, stages, and patient demographics. Cross-validation and external validation cohorts mirroring real-world settings would enhance generalizability. Furthermore, incorporating interpretability tools like SHAP or LIME can increase the practical relevance of models in healthcare, ensuring they meet clinical standards. These efforts are key to making machine learning models reliable and applicable in real-world clinical environments.
4. Conclusions
In this study, the performance of various classification algorithms and resampling methods was comprehensively evaluated across three diagnostic and two prognostic datasets. The results obtained demonstrate the effectiveness of both machine learning classifiers and resampling techniques in managing the issue of class imbalance. A total of 19 different resampling methods from three distinct categories were employed in this study. SMOTEENN emerged as the most successful method with a mean value of 98.19%, followed by IHT at 97.20%. RENN ranked third with a mean value of 96.48%. By contrast, the baseline (no resampling) method achieved only 91.33%, underscoring the importance of resampling methods in enhancing model performance on imbalanced datasets. The findings of this study illustrate the effectiveness of implementing resampling techniques in addressing the challenges of imbalance and improving classification success.
A total of 10 algorithms from four different categories were utilized as classifiers in this study. According to the obtained findings, Random Forest achieved the highest mean value of 94.69%, demonstrating the effectiveness of the bagging method. Balanced Random Forest followed closely with a mean value of 94.43%, showcasing its capability to handle imbalanced data. XGBoost ranked third with a mean value of 94.04%, recognized for its optimization and high performance as a gradient boosting method. These results emphasize the role of bagging and boosting techniques in enhancing model performance on classification problems with imbalanced datasets. This study provides significant contributions by comprehensively evaluating the performance of various classification algorithms and resampling methods across five different cancer datasets.
The findings highlight the crucial role of resampling methods in improving model performance on imbalanced datasets, offering a foundation for future research. Future studies should assess different classification algorithms and resampling techniques across various datasets, including both diagnostic and prognostic data, to tackle class imbalance comprehensively. Developing hybrid or optimized versions of ensemble and deep learning models could lead to higher accuracy and faster learning. Testing these methods in clinical applications is key to proving their effectiveness in real-world healthcare settings. Additionally, focusing on model interpretability and reliability is essential for ensuring healthcare professionals can safely utilize and trust model outcomes. These recommendations provide a roadmap for enhancing machine learning’s role in healthcare and its integration into clinical practice.