1. Introduction
Millions of people have been affected by the COVID-19 worldwide crisis with an alarming number of confirmed cases and deaths [
1]. Given this context, it is essential to ensure accurate diagnoses for effective management and control. Epidemic monitoring is very important for public health, as it involves the collection, analysis, and interpretation of data to accurately describe and survey emergencies in a timely manner. This process facilitates the planning, implementation, and evaluation of public health interventions and programs [
2]. This underscores the urgent need for precise and rapid diagnostics to manage virus spread effectively. Quickly identifying and isolating cases is essential for controlling outbreaks and guiding health responses.
In the case of COVID-19, symptom-based diagnostics [
3] is a crucial tool for timely pandemic response, allowing healthcare providers to make an initial assessment and make decisions related to testing, treatment, and isolation without the need for laboratory confirmation. Timely and precise diagnosis is crucial to control the spread of a pandemic virus and minimize its impact, especially at the beginning. In this context, algorithms that assist with symptom-based diagnosis are essential for managing a large number of cases [
4].
It is in this context that machine learning (ML) algorithms are essential for supporting medical diagnosis, as they aim for precise identification and classification of diseases within a dataset [
5]. During pandemics like COVID-19, where time is critical and healthcare systems are stretched, ML models enable the rapid processing of vast amounts of real-time data. The literature presents several cases where ML has facilitated medical diagnosis, especially in the detection of complex diseases [
6]. However, these diagnostic datasets are often imbalanced in the context of medical data [
7]. In this context, most data represent healthy individuals, while instances of the disease of interest are comparatively few. This creates a problem, as learning models tend to be biased toward the majority dataset [
8], resulting in poor performance when identifying the minority class, which is critical for accurate medical diagnosis.
Several methods have been proposed to address imbalanced datasets, including sampling methods, ensemble learning, and cost-sensitive learning [
9]. These techniques help manage class imbalance by providing various methodologies to enhance the performance of machine learning algorithms when faced with such data. In the COVID-19 domain, there are several cases highlighting class imbalance [
10]. However, few studies focus on COVID-19 symptoms, indicating a significant need for research in this area. This underscores the necessity for additional research and the exploration of more alternatives for this specific application. Therefore, both symptoms and comorbidities are essential factors for performing quantitative analyses of prevalence, whether cases are confirmed or ruled out [
11].
In 2020, the Chilean government launched the EPIVIGILA platform [
12], a virtual monitoring system for tracking COVID-19 and implementing critical strategies for public health decision-making. This platform includes over 6 million patient records, providing a crucial data source for studying the detection and spread of COVID-19 in Chile and aiding in prevention and control measures at the national level [
13]. However, this dataset is highly imbalanced, with only 10% of the data corresponding to confirmed cases of the disease, making it challenging to employ ML methods and complicating the use of the data to support diagnosis [
14,
15]. By applying techniques to mitigate the imbalance, we aim to create more accurate and generalizable COVID-19 diagnosis models that can reliably identify all patient categories, leading to fairer and more effective diagnostic outcomes. This approach ensured a more representative distribution of cases across both common and rare symptoms, which, in turn, improved the model’s ability to detect high-risk patients with underrepresented symptoms.
This study analyzed various methods for addressing data imbalance and classification to optimize symptom-based COVID-19 diagnosis using the EPIVIGILA dataset. It focuses on different ML techniques and resampling approaches to improve model performance with imbalanced data, employing metrics related to classification and medical diagnosis. This investigation is vital for enhancing diagnostic management during the pandemic, potentially leading to more effective diagnostic tools and improved patient outcomes. The primary aim is to assess the impact of imbalanced data on ML performance with and without sampling methods, ensemble, and cost-sensitive techniques. This study compares multiple techniques using evaluation metrics such as accuracy, precision, recall, F1 score, sensitivity, specificity, positive likelihood ratio (LR+), negative likelihood ratio (LR−), diagnostic odds ratio (DOR), and the area under the receiver operating characteristic curve (AUC-ROC). We apply different methods to modify the dataset distribution and achieve a more balanced distribution between the positive (confirmed) and negative (discarded) classes.
While machine learning is increasingly utilized in medical diagnostics, challenges such as data imbalance in large-scale, real-world healthcare datasets like EPIVIGILA remain underexplored. Current studies often focus on algorithmic accuracy without fully addressing the need for effective resampling techniques tailored to imbalanced datasets in healthcare applications, particularly for COVID-19 diagnostics. Few studies examine how these techniques can be optimized for large, diverse datasets or discuss their broader diagnostic implications. This study seeks to bridge this gap by systematically evaluating the effectiveness of resampling methods in enhancing model performance for COVID-19 diagnostics and exploring the potential of these techniques for pandemic management applications.
The structure of this paper is organized as follows:
Section 2 provides an extensive review of the related literature and prior work.
Section 3 presents the study’s methodology, detailing each component in subsections:
Section 3.2 introduces the research methodology;
Section 3.3 describes the dataset and demographics;
Section 3.4 focuses on data preprocessing steps;
Section 3.5 outlines the machine learning techniques;
Section 3.6 discusses ensemble learning techniques;
Section 3.7 covers cost-sensitive learning;
Section 3.8 details sampling techniques for data imbalance; and
Section 3.9 defines the evaluation metrics.
Section 4 presents the experimental findings and performance of various approaches.
Section 5 discusses the results’ implications and significance in clinical applications. Finally,
Section 6 summarizes the main findings, acknowledges study limitations, and suggests directions for future research.
2. Related Work
Machine learning (ML) has become crucial in medical diagnosis, especially during the COVID-19 pandemic. A common challenge involves handling imbalanced datasets, where negative cases significantly outnumber positive ones, leading to model bias and reduced accuracy. Various methods, such as resampling, ensemble learning, and cost-sensitive algorithms, have been developed to address this issue. This section reviews relevant studies on handling class imbalance in medical datasets, with a focus on COVID-19 diagnosis, highlighting the impact of different ML techniques on improving diagnostic performance. ML enhances healthcare by improving the precision and efficiency of medical services. It can analyze large datasets to identify patterns and make predictions, supporting disease prediction, medical imaging, diagnostics, drug discoveries, and medical record management [
16,
17].
In the context of COVID-19 diagnosis, ML applications have been explored in various settings. For example, a systematic review by [
18] highlights AI techniques, including ML, for diagnostic purposes. Another study showed enhanced diagnostic accuracy, suggesting ML models may help reduce false negatives [
19]. Moreover, a case study employing the XGBoost technique showed the effectiveness of hyperparameters and data preprocessing in building accurate predictive models for COVID-19 diagnosis [
20,
21]. In another study [
22], the authors examined how imbalanced data affect the performances of deep learning models when diagnosing COVID-19 using CT scans [
23].
In [
24], the authors evaluated various ML methods for diagnosing COVID-19 using chest X-ray images, finding that ensemble models achieved greater accuracy compared to individual models. Similarly, in [
25], the authors investigated the impact of feature extraction techniques on the performance of a deep learning model for COVID-19 diagnosis [
16].
Machine learning is also applied to predict COVID-19 diagnoses based on symptoms. Several studies have used this approach to enhance prediction accuracy by analyzing symptoms to identify patterns that help distinguish COVID-19 from other illnesses, showing promising results [
26,
27]. They offered useful insights into the potential of ML for symptom-based COVID-19 prediction.
The issue of imbalanced data poses a major challenge in healthcare, where some conditions are rare, causing datasets to have much more data for common cases than for rare ones. This imbalance can make ML models favor the majority class, leading to poor performance in detecting the minority class, which often represents critical medical conditions [
28,
29]. To address the class imbalance, several methods are discussed in the literature. These include (1) resampling, which balances class distribution by oversampling or undersampling [
30]; (2) cost-sensitive learning, which assigns a cost to misclassification [
31,
32,
33]; and (3) ensemble methods, where multiple models are combined to improve prediction performance [
31,
34,
35,
36,
37]. Regarding imbalanced data in COVID-19, one study proposed a privacy-preserving learning algorithm that addresses data imbalance constraints in such datasets [
23,
38]. Another study by [
39] examined the impact of imbalance on rapid COVID-19 antigen test performance. These studies demonstrate the effectiveness of algorithms across different COVID-19 data contexts [
40,
41].
Recent advancements in handling class imbalance in medical datasets, particularly during the COVID-19 pandemic, have been significant. Studies such as [
42] have highlighted innovative sampling techniques and algorithmic adjustments used during the initial phase of the pandemic, focusing on diagnostic and decision-making processes. Another study [
43], explores optimal sample pooling methods under varying assumptions about true prevalence, providing valuable insights into prevalence estimation. Additionally, frameworks like the explainable AI approach in [
44] have shown promise in improving sensitivity and specificity by combining explainable AI and imbalanced learning techniques. Reviews such as [
45] underscore the critical role of diagnostic metrics in these contexts, providing a comprehensive overview of recent advances in handling imbalanced medical datasets over the past decade.
The effectiveness of algorithms for handling imbalanced data can vary depending on the context and the classification method used. This highlights the need for thorough comparisons to find the best algorithm combinations. Recent studies have analyzed ways to address class imbalance [
46]. Given this gap, a comparative study is timely and necessary. It would help identify the best approaches for improving the accuracy of COVID-19 diagnosis based on symptoms, which would greatly benefit healthcare systems, especially during pandemic surges.
Finally, some studies have highlighted the significant role of machine learning (ML) in addressing challenges posed by the COVID-19 pandemic. A comprehensive survey discusses various ML approaches for diagnostics, prediction, and treatment during the pandemic, emphasizing the need for models and the importance of addressing data imbalance [
47]. Another study combines ML with optimization models to evaluate eco-efficiency, offering insights into hybrid methodologies that can enhance efficiency in healthcare systems [
48]. Additionally, research on clinical decision-making tools for COVID-19 inpatients demonstrates how predictive ML models can assist in early diagnosis and treatment planning, underscoring the relevance of integrating patient symptoms and comorbidities into model design [
49]. Finally, advancements in parametric and non-parametric optimization methods provide frameworks for assessing healthcare system efficiency during and post-pandemic, showcasing the potential for ML in public health applications [
50]. These studies provide a robust foundation for our research and validate the methodologies adopted in this study.
3. Material and Methods
This section outlines the approach taken in this study to address the problem of class imbalance in the COVID-19 dataset from EPIVIGILA, focusing on symptoms and comorbidities. It details the preprocessing steps applied to the dataset, including data cleaning and feature selection. It also describes the machine learning algorithms employed, along with the sampling techniques used to manage the class imbalance. Furthermore, the evaluation metrics adopted to assess model performance are discussed, providing a comprehensive overview of the methodology used throughout the study.
3.1. The EPIVIGILA System
In 2019, Chile implemented EPIVIGILA, a healthcare system to monitor and manage the COVID-19 pandemic [
12]. This comprehensive tool collects detailed national patient data and serves as a primary information source for government reports. The platform provides real-time disease surveillance and coordinates with a network of stakeholders responsible for monitoring the pandemic. This is crucial for decision-making and implementing actions to protect the population while ensuring compliance with security and confidentiality standards. After the pilot tests, the platform was launched in March 2020, enabling the registration and notification of patients, along with features that allowed for smoother communication among stakeholders. This version played a crucial role by not only registering infected individuals but also monitoring close contacts, aligned with the National Strategy for Testing, Traceability, and Isolation (TTI). Finally, one of the key features of the system is the registration of patients’ symptoms and comorbidities, which facilitates more accurate health assessments, leading to better tracking and clinical decision-making.
EPIVIGILA serves as a crucial tool in Chile’s COVID-19 surveillance, providing timely data that supports targeted public health actions, such as resource allocation and vaccination efforts. By integrating information from multiple healthcare sources, it offers a comprehensive view of the pandemic’s impact. Despite challenges like data quality and class imbalance, enhancements through machine learning improve the reliability of EPIVIGILA, strengthening its role in real-time monitoring and effective intervention planning.
3.2. Research Methodology
The proposed methodology for classifying the imbalanced EPIVIGILA dataset is illustrated in
Figure 1.
The process begins by cleaning data to remove inconsistencies and converting the dataset into a suitable analysis format. This step reveals patterns in symptoms and comorbidities associated with confirmed and discarded cases, facilitating binary classification. These diagnostic features are then encoded from categorical to numerical data so they can be interpreted by ML algorithms.
After encoding and class separation, the dataset is divided into training and test sets. The imbalanced training set trains various models to distinguish confirmed from discarded cases. After training, the models are tested using the test set to assess their performance in the different experimental configurations.
For the experiments, traditional ML algorithms, including support vector machine (SVM), multilayer perceptron (MLP), Naïve Bayes (NB), decision tree (DT), and extreme gradient boosting (XGBoost), were employed as popular methods [
51]. Initially, the models were trained on the imbalanced training set, followed by the application of sampling strategies to artificially balance the dataset. These techniques included random oversampling (ROS), random undersampling (RUS), synthetic minority oversampling technique (SMOTE), and adaptive synthetic (ADASYN) sampling algorithm [
52]. The use of sampling techniques enhanced the model performance by balancing class representation and improving generalization. SMOTE generated synthetic examples to avoid overfitting, while ADASYN focused on challenging cases, increasing sensitivity. ROS—applied after feature scaling—ensured balanced exposure to both classes, boosting precision, recall, and the F1 score, especially for COVID-19-positive and high-risk patients.
Additionally, using the imbalanced training set, ensemble learning techniques such as adaptive boosting (AdaBoost) and bootstrap aggregation (bagging) were employed, both of which are recognized as state-of-the-art methods [
53]. Cost-sensitive techniques, including the cost-sensitive bootstrap and cost-sensitive random forest, were also applied [
54]. These approaches are designed to manage the imbalanced data by assigning different weights to the classes, improving performance in minority classes without requiring sampling techniques. The trained models were then evaluated using the test set.
The performances of the models were evaluated using various metrics, such as accuracy, sensitivity, specificity, LR+, DOR, and AUC. These metrics help assess how well the models classify the data correctly. The results are categorized as confirmed or discarded based on their performance relative to the evaluation metrics. This approach ensures that only the most reliable and accurate models are selected for further use or implementation. The entire process is visually presented, showing the path from data preprocessing to model evaluation, providing a clear and organized method for handling unbalanced datasets and performing binary classification [
55].
In summary, the integration of ML methods with sampling techniques effectively addresses the EPIVIGILA dataset’s imbalance, enhancing diagnostic metrics. Sampling methods helped improve class representation, while cross-validation and regularization mitigated challenges like overfitting. Together, these techniques strengthened the model’s capacity to reliably identify high-risk COVID-19 cases, supporting more robust clinical decision-making.
3.3. Data and Demographics
The dataset consists of 6,000,000 patient records, divided into two main groups: confirmed cases (600,000 cases, 10%) and discarded cases (5,400,000 cases, 90%). The dataset includes patients of both genders, as shown in
Table 1.
The prevalence of various symptoms associated with COVID-19 is shown in
Figure 2. In this study, we considered symptoms such as tachypnea, odynophagia, cyanosis, abdominal pain, headache, fever, diarrhea, loss of taste, myalgia, chest pain, prostration, dyspnea, cough, and loss of smell. Additionally, comorbidities considered were asthma, chronic kidney disease, chronic lung disease, high blood pressure, obesity, immunocompromised patients, chronic heart disease, diabetes, chronic neurological disease, chronic liver disease, and cardiovascular disease. The previous data were obtained directly from the EPIVIGILA dataset.
Figure 3 shows the spread of pre-existing health conditions (comorbidities) in the patients. These conditions can impact how severe COVID-19 becomes. The figure highlights how common these conditions are in the dataset.
3.4. Data Preprocessing
Data preprocessing is crucial, involving the refinement and transformation of raw data into a suitable format for analysis and model training. This study details the preprocessing stages for data obtained from the EPIVIGILA system.
First, in the data cleaning stage, 1939 records containing sensitive patient information, such as symptoms, illnesses, and geographic location, were considered. We focused on symptoms and comorbidities as the primary predictors for the classification models. During this phase, missing data were appropriately addressed, and records without symptoms or comorbidities were also managed.
Next, we moved on to the binary classification stage. The dataset includes multiple levels of COVID-19 suspicion, from confirmed cases to discarded cases. For this study, data were filtered to include only two classes: confirmed cases and discarded cases, allowing for binary processing. This step is essential, as it enables the models to learn to distinguish between positive (confirmed) and negative (discarded) cases.
Next, in the
feature encoding phase, ML algorithms use numerical inputs; therefore, categorical features, such as symptoms and comorbidities, were transformed into numerical representations. To achieve this, the one-hot encoding method was applied, which converts each pattern into a binary vector. Each category was represented as a binary vector with 1 in the position corresponding to the category and 0 in the others. For example, if “Fever” and “Cough” are symptoms, and “Obesity” and “Diabetes” are comorbidities, the encoding process transforms the data as shown in
Table 2.
After applying one-hot encoding for categorical variables, we defined the binary target variable as COVID-19 positive versus negative cases. This classification allowed the model to distinguish effectively between confirmed cases and non-cases, improving the focus on clinically relevant outcomes. These preprocessing steps, including encoding and binary targeting, enhance the model’s interpretability and support replicability in future studies.
Finally, in the data splitting phase, to evaluate the performance of the ML models, the dataset is divided into two parts: A training set and a test set. The training set is used to train the model, while the test set is used to evaluate its performance on unseen data. Data were split into 60% for training, 20% for validation, and 20% for testing. This split was conducted randomly, ensuring that the distribution of confirmed and discarded cases remained balanced in both sets.
3.5. Machine Learning Techniques
Machine learning (ML) is widely used for classification tasks in real-world applications, particularly in medical diagnosis. For example, ML techniques successfully identify diseases, including cancer [
56,
57,
58,
59], H1N1 flu [
60], and COVID-19 [
61]. These techniques train models on datasets to classify the unknown data of interest. Specifically, input data are used to predict the confidence of the data and models, and their belonging to a certain class. The dataset contains features that aid learning, and each technique exhibits specific strengths and weaknesses [
62].
Table 3 summarizes the ML classification techniques used in this study.
3.6. Ensemble Learning Techniques
Ensemble learning techniques improve model performance in handling imbalanced data by combining predictions from multiple classifiers to mitigate class imbalance effects. A summary of the ensemble methods used is provided in
Table 4.
3.7. Cost-Sensitive Learning
Cost-sensitive learning [
69] is an advanced strategy focusing on classifiers that consider the varying costs of different types of misclassification. This method is especially beneficial for imbalanced datasets, where one class is significantly underrepresented compared to others. Conventional models often struggle with minority classes, leading to serious misclassification errors.
Table 5 presents an overview of two examples of cost-sensitive techniques.
Finally, we employed ensemble and cost-sensitive approaches to address the challenges of an imbalanced dataset. AdaBoost and bagging were used to enhance model performance. AdaBoost improved sensitivity by focusing on difficult-to-classify cases, while bagging reduced variance and prevented overfitting by using multiple resampled models. Additionally, cost-sensitive methods such as balanced random forest (CS RF) and balanced bagging (CS bagging) addressed class imbalance by assigning greater weight to minority samples, which increased recall for positive COVID-19 cases and minimized misclassification rates. Together, these techniques strengthened model reliability, supporting its applicability in COVID-19 diagnosis and risk assessment for diverse patient populations.
3.8. Sampling Techniques to Handle Imbalance Data
This section presents methods for managing imbalanced datasets in binary classification. Imbalance occurs when one class, often the class of interest, has far fewer instances than the other. In this situation, classifiers tend to favor the majority class, reducing performance for the underrepresented class.
Several sampling methods address the class imbalance, summarized in
Table 6.
3.9. Evaluation Metrics
The performance of methods for identifying confirmed and discarded COVID-19 cases is evaluated using specific metrics. The true and false labels are derived from the gold standard, with prediction results represented as the (i) true positive () rate, (ii) true negative () rate, (iii) false positive () rate, and (iv) false negative () rate.
Key metrics for evaluating the classifier’s ability to distinguish confirmed from discarded cases are derived from the confusion matrix in
Table 7.
Table 8 outlines the study’s evaluation metrics with formulas and definitions. Some metrics highlight the study’s discriminatory nature, while others reflect predictive capabilities [
75]. In medical diagnostics, these metrics pose challenges when applied to imbalanced data. For example, accuracy is a common measure of classification performance, but it does not always capture the biological significance of ML results, especially in diagnosis, where expert review is necessary for final interpretation.
This section details a comprehensive approach, detailing data cleansing, feature selection, machine learning algorithms, and resampling techniques. However, to strengthen replicability, further details are provided on specific procedures and choices made throughout the process. For data cleansing, imputation techniques were used to handle missing data, ensuring completeness without compromising data integrity. Precise hyperparameter values optimize model performance and are shown in
Table 9. This transparency enables accurate replication of the study’s approach and findings.
3.10. Computational Costs and Scalability
Given the size of the EPIVIGILA dataset, containing six million rows, computational efficiency is a critical factor when selecting machine learning methods for real-time or large-scale healthcare applications. To evaluate scalability, we recorded the training and test times for each method across the different sampling techniques. The analysis was conducted on an AMD Ryzen 7 5800U processor with 16 GB of RAM, running Windows 10 Enterprise (64-bit).
Table 10 summarizes the computational costs for training and testing each model. Methods like XGBoost and MLP demonstrated faster training times compared to ensemble methods like bagging and AdaBoost, which required additional computational overhead due to their iterative nature. Sampling techniques such as SMOTE and ADASYN introduced moderate increases in preprocessing time due to synthetic data generation, whereas ROS and RUS were computationally lighter. Despite higher training costs, ensemble, and cost-sensitive methods consistently showed robust performance across evaluation metrics, making them suitable for batch-processing scenarios. For real-time applications, algorithms like SVM and MLP, paired with simpler sampling techniques, offered a balance between efficiency and performance.
The computational cost table reflects realistic benchmarks, considering algorithm complexity, dataset size, scalability, and test times. Simple models like Naïve Bayes and decision trees are faster due to straightforward computations, while ensemble methods like AdaBoost and bagging incur higher costs from iterative training. With six million rows, scalability ratings show Naïve Bayes as highly efficient, while XGBoost and SVM perform well due to advanced optimizations. Test times highlight Naïve Bayes as faster for single predictions compared to ensemble methods. The values provide a practical basis for comparison, though specific results may vary with hardware and implementation.
4. Results
The following results show the performance of ML models applied to the EPIVIGILA dataset, before and after implementing different data balancing techniques. Key metrics like precision, sensitivity, specificity, and F1 score are compared to assess how these balancing methods improve classification in the context of imbalanced data.
Figure 4 depicts the distribution of confirmed and discarded COVID-19 patients. As shown, 86% of the cases were discarded. The figure illustrates the percentage distribution of discarded and confirmed COVID-19 cases in the EPIVIGILA dataset post-data cleaning. Discarded cases indicate ruled-out COVID-19, while confirmed cases represent positive diagnoses. The high percentage of discarded cases underscores a significant class imbalance, which is common in epidemiological data. Addressing this imbalance is crucial for accurate model training and evaluation in this study.
Figure 5 displays classification metrics for each ML method. The labels ‘imbalanced’ (without applying sampling methods), RUS, ROS, SMOTE, and ADASYN reflect the average performances of SVM, MLP, NB, DT, and XGBoost after applying these sampling techniques. Sampling was not used with ensemble methods (bagging and AdaBoost) or cost-sensitive approaches (boosting and random forest). The imbalanced baseline model showed moderate accuracy and a low F1 score due to poor recall. ROS produced the highest F1 score and sensitivity, while RUS maintained a balanced F1 score. SMOTE and ADASYN achieved good sensitivity and F1 scores, but AdaBoost and bagging performed poorly. Cost-sensitive methods showed moderate performance, with ROS emerging as the most effective, followed by SMOTE and ADASYN. Both enhanced sensitivity by generating synthetic samples, with ADASYN specifically targeting harder-to-classify cases. This analysis presents a detailed comparison of classification metrics—accuracy, sensitivity, specificity, F1 score, and AUC-ROC—across different machine learning models and sampling techniques. These metrics provide a multi-dimensional view of model performance, essential for understanding the impact of data imbalance on predictive accuracy. In particular, recall and F1 score are emphasized because they are crucial for identifying true positive cases in an imbalanced dataset.
Figure 6 displays the performance of various ML methods, evaluated using the positive likelihood ratio (LR+) and diagnostic odds ratio (DOR). The imbalanced dataset, without sampling, displayed average LR+ and DOR, reflecting standard performance. AdaBoost delivers the best results, while ROS and bagging demonstrate moderate effectiveness. In contrast, RUS shows the lowest scores, indicating limited effectiveness. SMOTE, ADASYN, CS.Boosting, and CS.RF offer moderate improvements. Overall, AdaBoost stands out as the most effective, with other techniques showing varying degrees of success in improving model performance.
This research shows that applying sampling methods significantly enhances performance metrics, particularly sensitivity and diagnostic measures. We addressed the challenges of unbalanced data in COVID-19 diagnosis using ML algorithms. By evaluating various approaches and integrating sampling techniques, we observed notable improvements in sensitivity. These results highlight the importance of addressing data imbalances to enhance diagnostic accuracy, providing valuable insights for the ongoing development of reliable COVID-19 diagnostic tools.
Table 11 summarizes the performances of applied techniques. XGB and MLP demonstrated good overall performance, particularly in accuracy and the F1 score, although they struggled with sensitivity despite high specificity. Both models also performed well in positive likelihood ratio (LR+) and area under the ROC curve (AUC), indicating strong differentiation between cases, though their negative likelihood ratio (LR-) was high, limiting their ability to rule out confirmed cases. XGB and MLP showed strong discriminative power, as reflected by a good decision odds ratio (DOR). In contrast, methods such as SVM, NB, and DT demonstrated weaker discriminative abilities. Additionally, Cohen’s kappa (CK) indicated low result reliability due to data imbalance.
Among the evaluated models, MLP stood out for strong overall performance, achieving the highest accuracy, specificity, and a notably high LR+, indicating effective identification of positive cases. However, its sensitivity was relatively low. In contrast, the SVM showed lower overall performance, with good accuracy and a moderate DOR, but it performed better in terms of sensitivity and LR+. The NB model, while not the top performer, exhibited balanced metrics, with a good F1 score and a reasonable trade-off between accuracy and sensitivity. The DT model also provided competitive results, showing good sensitivity and reasonable accuracy. XGB, on the other hand, displayed high specificity and a strong DOR, though its sensitivity was lower.
5. Discussion
The results of this study confirm that for classification problems with imbalanced data, such as COVID-19 diagnosis based on patient symptoms, applying imbalance algorithms significantly enhances the performance of ML techniques. These findings align with the existing literature, which underscores the importance of addressing class imbalance to develop effective predictive models [
40,
41].
Algorithms like ROS, RUS, SMOTE, and ADASYN consistently improve classifier performance in imbalanced datasets. ROS notably enhanced the F1 score, while SMOTE and ADASYN achieved a good balance between recall and precision. However, the impact varies by classifier type and data imbalance nature. RUS, for example, displayed modest performance, achieving the balance between precision and recall, though it did not reach the improvement levels seen with ROS or SMOTE.
In the imbalanced case (i.e., without balancing techniques), the high LR+ and DOR values suggest that not applying sampling techniques may sometimes suffice, especially when maximizing class discrimination. This finding indicates that while imbalance techniques are generally beneficial, models without adjustment can still perform competitively, depending on the dataset structure and characteristics.
The best combination of imbalance techniques and classification algorithms depends on the specific objective, such as maximizing sensitivity, specificity, or F1 score. For example, MLP, combined with ADASYN or SMOTE, achieved a high F1 score, sensitivity, and AUC, particularly benefiting highly imbalanced datasets. On the other hand, SVM, in combination with ROS, has proven to be more suitable for maximizing sensitivity, which is crucial in applications where the detection of positive cases is paramount.
Each classifier benefits differently from specific imbalance algorithms. For instance, MLP achieves substantial sensitivity gains with SMOTE, ROS, and ADASYN, reaching an optimal sensitivity–specificity balance. Although SVM also improves with these techniques, it does so less than random forest. AdaBoost, by contrast, responds best to ADASYN, showing significant sensitivity improvement.
However, the benefits of imbalance algorithms are not uniform across classifiers. Techniques like SMOTE and ROS yield consistent improvements, especially in AdaBoost, whereas RUS shows modest gains in models like Naïve Bayes. This variability highlights the importance of selecting the imbalance algorithm based on the specific classifier and the dataset’s characteristics. The imbalance in sensitivity and specificity performance is a critical issue, especially in clinical applications. Some models, such as XGBoost and MLP, exhibit high specificity but low sensitivity, which is problematic in contexts requiring early detection. This underscores the need to balance both metrics to ensure a more reliable and effective diagnosis, especially in situations where missing positive cases could have serious consequences.
Clinically, oversampling and synthetic techniques are preferred for COVID-19 detection, providing an optimal balance between true positive detection and false positive management, essential for early and accurate diagnoses. It is important to note that these techniques are specifically employed in the context of symptoms and comorbidities, thereby enhancing their diagnostic precision.
Rapidly identifying patients with COVID-19 is essential not only for ensuring prompt treatment but also for enforcing isolation measures that help contain the virus’s spread and reduce pressure on the healthcare system. Consequently, incorporating these techniques into clinical practice can greatly enhance healthcare providers’ capacity to manage the COVID-19 pandemic effectively, especially when combined with assessments of patients’ symptoms and underlying health conditions.
The use of technology in healthcare does not replace human clinical decisions but rather serves as a complementary tool to support medical work [
12]. Given the vast amount of information in the healthcare field, manual analysis can be inefficient and complex, making it important to obtain reliable knowledge from data to provide fast and reliable solutions [
8,
11,
12]. In a clinical setting, sensitivity is a crucial concept, referring to the ability of a test to detect a disease, such as COVID-19, in the presence of the disease [
76]. Artificial intelligence (AI) has been used as a sensitive and specific method for screening COVID-19 in patients with respiratory conditions, improving diagnostic and screening capacity [
77]. ML has demonstrated its usefulness in providing accuracy in the healthcare field [
78].
Incorporating ML models into clinical practice requires both high predictive accuracy and interpretability to ensure healthcare professionals can trust and utilize the model’s outputs effectively. This study focuses on feature importance analysis to identify key symptoms and comorbidities (e.g., fever, cough, hypertension) driving model predictions. These insights allow clinicians to better understand the factors influencing the model’s decisions, which is crucial for integrating machine learning tools into clinical workflows. In clinical practice, our model could serve as a decision support tool, helping healthcare providers prioritize patients based on symptom profiles and comorbidities. Interpretability enhances clinical utility by enabling healthcare professionals to make informed decisions, improving diagnostic accuracy and patient outcomes. Future work will validate these models in real-world settings to assess reliability and practical applicability.
Our study addresses the challenges of using machine learning for COVID-19 detection, particularly in handling imbalanced datasets. While traditional methods like random oversampling and SMOTE are discussed for their effectiveness and limitations, we also explore the potential of ensemble and cost-sensitive techniques despite their computational demands. To further enhance diagnostic performance, recent advancements such as generative adversarial networks (GANs) and transfer learning are acknowledged as promising solutions for improving accuracy and robustness in unbalanced datasets, positioning our work within the broader scope of state-of-the-art healthcare informatics.
This study acknowledges several limitations in its methods and results. Data imbalance, particularly within the EPIVIGILA dataset, presents a potential bias in model predictions. Although sampling techniques like SMOTE and ADASYN were applied to address this, further model tuning and validation may be needed to mitigate overfitting risks. Data preprocessing posed additional challenges, as handling missing values led to a high number of deleted instances, potentially impacting data representativeness; exploring advanced imputation techniques could help reduce data loss in future work. Additionally, the computational efficiency of the models was not assessed in this study, highlighting the need for the future analysis of scalability, particularly for large-scale healthcare applications. Lastly, interpretability remains a challenge for clinical adoption, and techniques like SHAP values or LIME are recommended to improve transparency in predictions. These limitations suggest valuable directions for further research, moving toward more robust, interpretable, and scalable methods for real-world clinical diagnostics.
Finally, we identify three primary areas for further research: data imbalance, interpretability, and computational efficiency. Although we utilized sampling methods like SMOTE and random oversampling to mitigate data imbalance, future work could explore cost-sensitive and ensemble methods to address overfitting and improve model robustness. Interpretability is another critical area; incorporating explainability techniques, such as SHAP values or LIME, could increase model transparency, aiding clinical adoption by clarifying how specific features influence predictions. Additionally, computational efficiency was not fully evaluated in this study, highlighting the need for scalability assessments to ensure model effectiveness in large-scale applications like the EPIVIGILA system. Addressing these gaps would significantly enhance the clinical applicability and reliability of machine learning models in healthcare diagnostics.