Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach

Sultan, Sara Qamar; Javaid, Nadeem; Alrajeh, Nabil; Aslam, Muhammad

doi:10.3390/sym17020185

Open AccessArticle

Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach

by

Sara Qamar Sultan

¹,

Nadeem Javaid

^2,*,

Nabil Alrajeh

³

and

Muhammad Aslam

⁴

¹

Department of Mathematics, COMSATS University Islamabad, Islamabad 44000, Pakistan

²

ComSens Lab, International Graduate School of Artificial Intelligence, National Yunlin University of Science and Technology, Douliou 64002, Taiwan

³

Department of Biomedical Technology, College of Applied Medical Sciences, King Saud University, Riyadh 11633, Saudi Arabia

⁴

Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3FL, UK

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(2), 185; https://doi.org/10.3390/sym17020185

Submission received: 1 October 2024 / Revised: 12 December 2024 / Accepted: 30 December 2024 / Published: 25 January 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

One of the most complex and prevalent diseases is heart disease (HD). It is among the main causes of death around the globe. With changes in lifestyles and the environment, its prevalence is rising rapidly. The prediction of the disease in its early stages is crucial, as delays in diagnosis can cause serious complications and even death. Machine learning (ML) can be effective in this regard. Many researchers have used different techniques for the efficient detection of the disease and to overcome the drawbacks of existing models. Several ensemble models have also been applied. We proposed a stacking ensemble model named NCDG, which uses Naive Bayes, Categorical Boosting, and Decision Tree as base learners, with Gradient Boosting serving as the meta-learner classifier. We performed preprocessing using a factorization method to convert string columns into integers. We employ the Synthetic Minority Oversampling TEchnique (SMOTE) and BorderLineSMOTE balancing techniques to address the issue of data class imbalance. Additionally, we implemented hard and soft voting using voting classifier and compared the results with the proposed stacking model. For the Artificial Intelligence-based eXplainability of our proposed NCDG model, we use the SHapley Additive exPlanations (SHAP) technique. The outcomes show that our suggested stacking model, NCDG, performs better than the benchmark existing techniques. The experimental results of our proposed stacking model achieved the highest accuracy, F1-Score, precision and recall of 0.91, 0.91, 0.91 and 0.91, respectively, and an execution time of 653 s. Moreover, we have also utilized K-Fold Cross-Validation method to validate our predicted results. It is worth mentioning that our prediction results and their validation strongly coincide with each other which proves our approach to be symmetric.

Keywords:

BorderLineSMOTE; heart disease; machine learning; SHapley Additive exPlanations; stacking model; VotingClassifier; K-Fold Cross-Validation; symmetric approach

1. Introduction

The heart is one of the primary organs and serves several vital purposes for the wellness of the body. If the heart does not work properly, the entire circulatory system of the body will fail. Currently, diseases related to the heart are the leading cause of death worldwide. High cholesterol, obesity, an increase in triglycerides levels and hypertension are some of the causes of the risk of heart disease (HD). A World Health Organization report estimates that Cardiovascular Disease (CVD) cause roughly 17.9 million deaths worldwide each year [1]. Artificial Intelligence and Neural Networks help to prevent CVD or lessen its effects, which can lead to a lower mortality rate. Conventional techniques for diagnosing HD such as reviewing the patient’s medical history and the results of their physical examination and analyzing relevant systems can be expensive and computationally demanding, particularly in situations where cutting-edge technology and medical professionals are not available.

The accurate and precise detection of HD is crucial to avoid complications related to HD and to improve heart safety, because the correct prediction can prevent life threats; at the same time, an incorrect prediction can be fatal. Various medical organizations in the world collect data on different health-related issues and these data are used to gain useful insights using various machine learning (ML) techniques. ML can be used to diagnose, detect and predict HD at an early stage. Various studies have aimed to improve prediction accuracy and save lives [2]. In this study, we investigate the effectiveness of various ML algorithms and perform different ensemble techniques to make the prediction of HD more accurate. We used four ML techniques such as Naive Bayes (NB), Categorical Boosting (CatBoost), Decision Tree (DT) and Gradient Boosting (GB) to train and evaluate the models, find useful insights, visualize the results, interpret these results using SHapley Additive exPlanations (SHAP) and predict HD. Our proposed model achieved higher accuracy than all the standalone classifiers.

The following are the key findings of this research:

We propose a stacking ensemble model NCDG using NB, CatBoost, DT and GB for the prediction of HD.
SMOTE and BorderLineSMOTE balancing procedures are utilized to obtain the model’s consistent accuracy.
We performed hard and soft voting using the VotingClassifier and compared the results to our proposed stacking model NCDG.
An eXplainable Artificial Intelligence (XAI) technique, SHAP is used to find the contribution of the features found in the heart disease predictions of our proposed model.
We implemented the K-Fold Cross-Validation method to validate our results, proving that our appied approach is symmetric.

The remaining sections are structured as follows: Section 2, Section 3, Section 4, Section 5 and Section 6 describe the related work of this research study and the details of our proposed model, provide an explanation of the ML classifiers used in our proposed model, the performance metrics and a discussion of the results for proposed model, and also provide the conclusion of the study. Abbreviations part contains all the abbreviations and their definitions.

2. Related Work

The diseases related to heart are the most common and life-threatening diseases found globally. A lot of studies have been conducted on HD to make its prediction progressively more accurate. In this section, the literature related to prediction of HD containing ML and Deep Learning (DL) techniques is presented. The authors in [3] proposed a system that can track and initiate a patient’s present cardiac condition. They classified data using three ML techniques and Random Forest (RF) yielded the best results in terms of accuracy.

Chandrasekhar, N., et al. [4] used six algorithms—RF, KNN, LR, NB, GB and Adaptive Boosting (AdaBoost)—with two different datasets from cleaveland and IEEE dataport. A soft voting of these models was also performed which enhanced the accuracy in both of the datasets. GridSearch with 5 Fold Cross-Validation (FCV) was used for hyperparameter optimization. To evaluate the model’s performance on both benchmark datasets, the accuracy loss for each fold was also examined.

The authors in [5] proposed a Swarm Artificial Neural Network (Swarm-ANN)-based architecture for CVD prediction. For the training and evaluation of the framework, initially, random neural networks were generated by the Swarm-ANN technique. Then, after two stages of weight modification, a newly developed heuristic formulation technique was used for the modification of the weights of NN. Finally, the global optimal weight was distributed among neurons and predicted CVD. The proposed strategy achieved 95.78% accuracy.

The authors in demonstrated the efficacy of data-driven approaches, particularly DL techniques, in improving the accuracy of HD diagnosis. Alongside feature selection techniques like the Point Biserial Correlation Coefficient, strategies such as Adaptive Synthetic Sampling have been used to address the difficulties presented by imbalanced datasets. Two new DL models are presented in this work: the Ensemble-based Cardiovascular Disease Detection Network (EnsCVDD-Net), which combines LeNet and Gated Recurrent Unit (GRU), and the Blending-based Cardiovascular Disease Detection Network (BlCVDD-Net), which combines LeNet, GRU, and Multilayer Perceptron (MLP). With BlCVDD-Net surpassing current state-of-the-art models with 91 percent accuracy and 96% precision, both models exhibit superior performance metrics. Furthermore, by utilizing SHAP, the model’s interpretability is further improved by offering insights into the impact of different factors on CVD diagnoses.

In [6], a model for the detection of CVD and identification of the patient’s severity level was proposed. Six different ML techniques are used with the SMOTE balancing technique on two different datasets. Moreover, hyperparameter optimization is applied to find the best hyperparameters for ML classifiers. Extra Trees (ET) outperforms the others by achieving an accuracy of 99.2% and 98.52% on both datasets. A hybrid decision support system based on the clinical parameters of the patients was proposed in [7] for the early detection of HD. The cleaveland HD dataset was used for this study. The RF classifier outperformed all the others.

By creating a thorough framework that makes use of cutting-edge boosting strategies and machine learning methodologies such as CatBoost, RF, GB, Light GB Machine (Light-GBM) and AdaBoost, the authors in [8] tackle the crucial problem of early HD prediction. The study utilizes a sizable dataset from the UCI ML Repository, which includes 26 feature-based numerical and categorical variables and 8763 samples. According to the results, AdaBoost is the best model, with a remarkable 95% accuracy rate and good performance indicators, such as a negative predicted value of 0.83, a false positive rate of 0.04 and a false negative rate of 0.04. These findings highlight the model’s superiority in predicting risks to cardiovascular health, making a substantial contribution to the field of CVD prediction and emphasizing the significance of reliable predictive models.

The authors in [9] perform a comparison analysis of different ML classifiers to identify the classifier with the highest accuracy for HD prediction. A publically available dataset containing 1025 instances and 14 features is used for simulations. The replace missing value filter function is applied for missing values and Inter Quartile Range (IQR) for outliers. The feature importance score is calculated for every classifier except KNN and MLP, where RF betas are used for these other classifiers. An ML framework for Atherosclerotic CVD risk assessment is proposed in [10]. For this study, the data were collected from 500 patients via Tabriz University Medical Sciences employees during 2020. Different ML techniques such as NB, ANN, SVM, KNN, LR, Regression Tree (RT) and Generalized Additive Model (GAM) were used. The study showed that ANN outperformed all other classifiers.

The authors in [11] presented a model for the early stage detection of CVD. In the preprocessing stage, missing values were filled through mean imputation. Firstly, a GB model containing all the predictors was trained, then SHAP was applied to check which feature contributes most to the outcome; then, based on SHAP values greater than 0.1, the final predictors were selected. They used different ML classifiers for the study. Hyperparameter optimization is used to optimize the parameters of the ML classifiers.

A Cloud RF (C-RF) model is proposed in [12] to predict coronary HD by combining the cloud model and RF. The proposed model’s results are compared with ML classifiers using different evaluation metrics. The proposed method outperforms all others in terms of classification performance and impact on coronary HD risk assessment.

The authors of this study [13] used different optimization techniques such as Bayesian Optimization (BO), Optuna Optimization (OO) and GAsearchCV, utilizing 5 and 10 generations with RF and SVM ML techniques. Default RF, BO with RF and BO with SVM achieved the highest accuracies of 86.6%, 89% and 90%, respectively. One-hot encoding and SS were used in the preprocessing stage. In comparison with standalone default ML models, the Gaussian Algorithm (GA), LR and the classification models predicted with a higher accuracy than the dummy classifiers.

Dalal, S., et al. [14] proposed an ML model for CVD risk prediction using a publicly available dataset containing 70,000 patient records and 11 features. In the preprocessing stage, missing values were handled using mean imputation, IQR was used for outliers and Pearson’s correlation technique was used for FS. Then, different ML techniques were used such as QUEST, RF, Neural Network, Bayesian Network and C5.0. In [15], HD patients were selected using random sampling from Khyber Teaching Hospital and Lady Reading Hospital, Pakistan. The results show that RF outperformed with an accuracy, sensitivity, ROC, specificity and misclassification error of 85.01%, 92.11%, 87.73%, 43.48% and 8.70%, respectively.

In [16], the authors proposed an effective stacked ensemble model named SPFHD for the prediction of HD. A Conditional Variational Autoencoder (CVAE) method was proposed for class balancing. BO was used for hyperparameter optimization. For the interpretation of the model, SHAP was used. The results show that the SPFHD model performed well on four different datasets. Ref. [17] proposed a model referred to as Logistic-AdaBoost (Logistic-AB) for the prediction of stroke. CatBoost was utilized for FS. BorderLineSMOTE was used to balance the dataset. The performance was evaluated using ten similar models.

The study [18] proposed a hybrid DL model of CNN-BiLSTM for the prediction of CVD illness. RFE was used for the selection of the best features. The proposed hybrid model achieved a 94.507% accuracy and a 94% F1-Score. The study [19] focused on the detection of HD using a GB algorithm. The RFE was utilized for dimensionality reduction. The suggested model outcomes were contrasted using various ML methods and the proposed model GB with RFE performed best.

A decision support system for the prediction of HD was designed by the authors of [20] using machine learning. Firstly, the missing values were replaced with actual values. Then, they normalized and standardized the data, detected and eliminated outliers and removed duplicates. They used different ML classifiers such as Gaussian Processes, DT, NB, NN, QDA, Linear SVM, Bagging, Boosting, AdaBoost and a proposed Dense Neural Network (DNN) ranging from 3 to 9 layers with 100 neurons at each layer and relu as the activation function. A 10-FCV was also applied. In [21], Yongcharoenchaiyasit, K., et al. focused on elderly HF, aortic stenosis and dementia using a GB model of multiclass classification. The optuna framework was applied for hyperparameter optimization. The proposed model outperforms all other classifiers after feature engineering.

According to the review of the related studies, significant progress has been made in applying ML to the prediction of HD. Ensemble approaches, data balancing strategies and model interpretability tools like SHAP are highlighted. Recent research emphasizes the significance of ensemble methods for attaining higher predictive accuracy and robustness even though traditional classifiers like DT, NB and GB have demonstrated efficacy. More dependable model outputs are also ensured by data balancing techniques like SMOTE and BorderLineSMOTE that have proven crucial in handling imbalanced datasets, which are frequently found in HD data.

Our research expands on these results by applying a stacking ensemble model (NCDG) that integrates GB, DT, CatBoost and NB. By utilizing the advantages of each classifier, this model outperforms individual classifiers and achieves a significant improvement in accuracy, precision, recall and AUC–ROC scores. The model is further made interpretable by the use of SHAP, which makes it possible to clearly understand the significance of features and how they contribute to predictions. The findings support and advance the current state of research in this area by demonstrating that our ensemble approach offers a reliable and understandable solution for heart disease prediction.

3. Material and Methods for Heart Disease Prediction

In ML and DL, a single algorithm may not be able to learn and accurately predict behavior. Therefore, we use three different classifiers as a base learner, and their output is fed to the meta learner to improve accuracy. The whole process of our suggested model is described in Figure 1. Here are the specifics about the classifiers that are used in stacking:

3.1. Role of Decision Tree (DT) for Heart Disease Prediction

DT is a tree-like structure that involves a sequential decision-making process. It starts with a root node, and the recursive partitioning remains continue until all the training examples belong to the same class. Outliers in the data are not an issue because DT is non-parametric. There are various types of DT, including ID3, ID4.5, ID5.0, and CART, which are based on various Attribute Selection Measures (ASM). The primary issue with DT is determining which attribute is optimal for both the root node and its sub-nodes. ID3 is the most common type based on Information Gain (IG) which itself is based on the concept of entropy.

However, the problem with IG is that it can select attributes which are meaningless from an ML point of view. To tackle this problem, C4.5 is introduced. Similarly, C5.0 introduced the concept of pruning, which involves deleting the unnecessary nodes from the tree. C4.5 and C5.0 contains the Gain Ratio as ASM calculated as shown in Equations (1) and (2):

G a i n R a t i o (A) = \frac{G a i n (A)}{S p l i t I n f o (A)}

(1)

where

S p l i t I n f o_{A} (D) = - \sum_{j = 1}^{n} \frac{∣ D_{j} ∣}{∣ D ∣} \times {log}_{2} (\frac{∣ D_{j} ∣}{∣ D ∣})

(2)

In the above equations, S is root node, A is attribute and D is dataset. The workflow of DT is shown in Figure 2.

3.2. Role of Naive Bayes (NB) for Heart Disease Prediction

The name NB consists of two terms: Naive and Bayes. Naive is used because it considers that the attributes are independent, i.e., the occurrence of one attribute does not affect the other; Bayes is used because it uses the Bayes Theorem. The name Bayes comes from British mathematician Thomas Bayes who lived during the 18th century. The final prediction belongs to the highest posterior probability class, and NB computes the posterior probability of each class.

The NB classifier comes in different types, based on the difference in calculating the distribution of probability. Since NB assumes that attributes are independent, it require less execution time and works better with complex real world problems. However, it faces the zero frequency problem [22]. The workflow of NB is shown in Figure 3.

3.3. Role of Gradient Boosting (GB) for Heart Disease Prediction

The ensemble method known as GB combines several weak learners to create a strong model, which is then used to address a specific problem. It works in a sequential manner; with each iteration, the successor learns from the mistakes of the predecessor and in this way the error is reduced. This process involves updating weights at each step. The iterations continue until the loss function is minimized. GB consists of three main components: the additive model, the loss function, and weak learners. Using the available data, the loss function is employed to determine the accuracy in predicting HD. Weak learners are the ones that classify data poorly, e.g., DT, and the additive model means that it works sequentially and iteratively by adding one weak learner at a time [23]. Let us understand the working of GB mathematically. Let us say that

x_{i}

is the input variable and

y_{i}

is the target variable. Given the expected probability, let us say with N, we can predict the log-likelihood of the data, as indicated by Equation (3), where p is the predicted probability and

y_{i}

is the observed value, which may be 0 or 1.

l o g (N) = [y_{i} \times l o g (p) + (1 - y_{i}) \times l o g (1 - p)]

(3)

To convert log(likelihood) from a function of predicted probability p to predictive log(odds) Equations (4)–(8) can be used.

- [y_{i} \times l o g (p) + (1 - y_{i} \times l o g (1 - p)]

(4)

- [y_{i} \times l o g (p) + (1 - y_{i}) \times l o g (1 - p)]

(5)

- [y_{i} \times l o g (p)] - l o g (1 - p) + y_{i} \times l o g (1 - p)

(6)

- y_{i} \times [l o g (p) - l o g (1 - p)] - l o g (1 - p)

(7)

- y_{i} \times [l o g (\frac{p}{1 - p}) - l o g (1 - p)]

(8)

where

l o g (\frac{p}{1 - p}) = l o g (o d d s)

(9)

Substituting Equation (9) in (8), we obtain

- y_{i} \times l o g (o d d s) - l o g (1 - p)

(10)

Now, since

p = \frac{e^{l o g (o d d s)}}{1 + e^{l o g (o d d s)}}

(11)

l o g (1 - p) = l o g (1 - \frac{e^{l o g (o d d s)}}{1 + e^{l o g (o d d s)}}) = l o g (\frac{1 + e^{l o g (o d d s)}}{1 + e^{l o g (o d d s)}} - \frac{e^{l o g (o d d s)}}{1 + e^{l o g (o d d s)}}) = l o g (\frac{1}{1 + e^{l o g (o d d s)}}) = l o g (1) - l o g (1 + e^{l o g (o d d s)}) = - l o g (1 + e^{l o g (o d d s)})

(12)

The loss function we have now is Equation (10). It must be demonstrated that this is differentiable.

\frac{d}{d l o g (o d d s)} (- y_{i} l o g (o d d s) + l o g (1 + e^{l o g (o d d s)})) = - y_{i} + \frac{e^{l o g (o d d s)}}{1 + l o g (o d d s)}

(13)

The workflow of GB is shown in Figure 4.

3.4. Categorical Boosting (CatBoost) for Heart Disease Prediction

Categorical Boosting (CatBoost) is a gradient boosting algorithm that is specifically built to deal with categorical data. It uses ordered boosting to process categorical data directly, resulting in faster training and an improved performance of the model [24]. CatBoost produces symmetric trees, which reduces prediction time, improves accuracy and manages overfitting through regularization in contrast to other boosting methods. CatBoost can handle any type of feature like numeric, categorical and text.

Let us understand the mathematics behind CatBoost. Suppose we have a training dataset with M variables and N samples. If

x_{i}

is a vector of M variables and

y_{i}

is a corresponding target variable then each sample is denoted by

(x_{i}, y_{i})

. Learning a function

F (x)

that predicts the target variable y is the goal of CatBoost [25]. i.e.,

F (x) = F_{o} (x) + \sum_{m = 1}^{M} \sum_{i = 1}^{N} f_{m} (x_{i})

(14)

where

F_{o} (x)

is the initial guess often containing the target variable’s average behavior. In the training dataset, this is usually set to the mean of the target variable.

Each tree makes its own prediction for each training sample and contributes to the overall prediction.

f_{m} (x_{i})

is the prediction of the

m th

tree for the

i th

training sample.

m = 1

to M and

n = 1

to N represent the summation over the ensemble of trees and training samples, respectively. The training samples are indicated by N, and the total number of trees is M.

F (x)

is the overall prediction of CatBoost which takes x as an input vector and predicts the y as the corresponding target variable.

The workflow of CatBoost is shown in Figure 5 [26].

4. Proposed Methodology for Heart Disease Prediction

The workflow of the proposed methodology will be explained in this section. Our proposed stacking model contains four classifiers, those being NB, CatBoost and DT at level 0 (base layer) and GB at level 1 (meta layer) and is further explained in the following subsections.

4.1. Data Preprocessing

Before passing data to an ML classifier, it needs to be preprocessed. Our dataset categorical features need to be converted into integers, as the techniques that are used do not understand the categorical data. Therefore, we used a factorization method using pd.factorize() to convert string columns into integers. Our dataset does not contain any missing values. The description of our dataset is given below:

4.2. Heart Disease Dataset Description

To detect HD, the Personal Key Indicators of HD Dataset is used [27]. The dataset gathers yearly data on the health status of United State (US) citizens through telephone surveys. It consists of 319,795 instances and 18 features. The data splitting ratio we consider is 70:30 with a random state of 42 for the ease of result comparison. We remove the unnecessary feature SkinCancer because it has no direct relation to HD. Some of the other features of this dataset include Asthma, Stroke, BMI, Smoking, Alcohol Drinking, Difficulty in Walking, etc. The dataset does not contain any missing values but there is a data class imbalance problem, which is handled by using SMOTE and BorderLineSMOTE and is further discussed in below section.

4.3. Data Balancing Using BorderLineSMOTE

In ML, inequality among classes is a serious issue. The model gets skewed towards the majority class when observations in one class are higher than those in the other classes. So, we can use resampling to deal with imbalanced datasets. This can be of two types: oversampling and undersampling. Removing observations from the majority class is known as undersampling, and it can result in information loss, whereas oversampling involves randomly duplicating minority class observations, which can cause overfitting. Different balancing techniques are proposed in the scientific literature to overcome these problems. One of those techniques is SMOTE which can synthesize new data samples from minority classes through linear interpolation, as shown in Figure 6. The method selects a sample at random from a minority class,

P_{i}

, in order to operate. Then K-Nearest Neigbours (KNN) is computed around it and a random instance, say

P_{j}

, is picked from its neighbor. In order to create a fresh data sample from the minority class, the distance between

P_{i}

and

P_{j}

is then computed and multiplied by a random value between 0 and 1. This indicates that there is a

g a p

around it. The mathematical formulation of SMOTE is given by Equation (15).

New Instance = P_{i} + g a p \times (d i s t a n c e (P_{i}, P_{j}))

(15)

BorderLineSMOTE is a variation of SMOTE in which minority instances are classified as safe, noise and danger instances. It is deemed safe if the minority instance’s closest neighbors are also members of the minority class. When a minority instance’s closest neighbors are all members of the majority class, this results in noise. When the number of majority instances identified inside the nearest neighbors of minority instances is between M/2 < M, it is referred to as a danger instance (borderline minority instances

P_{i}

), which is employed to synthesize new instances. Following that, these borderline minority occurrences

P_{i}^{'}

and other minority instances are linearly interpolated to create new synthetic samples. The mathematical formulation of this is given in Equation (16). The imbalanced dataset and the BorderLineSMOTE-balanced dataset is shown in Figure 7.

New Instance = P_{i}^{'} + g a p \times d i s t a n c e (P_{i}^{'}, P_{j})

(16)

4.4. NCDG Stacking Model for Heart Disease Prediction

The stacking model is an ensemble method that uses a two layered framework. The base layer’s prediction is made based on the input feature; then, the combined predictions from the base layer are used as an input to the meta layer and these combined predictions are made based on the meta layer classifier. The main purpose of building a stacking model is to combine the predictions of all the base layer classifiers and to produce more accurate results than a single classifier. Our proposed stacking model consists of four classifiers: DT, NB, GB and CatBoost, utilizing 5-FCV. The training data are divided into k-folds using the K-FCV procedure, which then trains the models on

k - 1

folds and tests them on the kth fold that remains. There are k repetitions of this operation. Figure 8 illustrates how k-FCV operates. All of the classifiers used in stacking have their own pros and cons. Therefore, these classifiers are combined for the efficient prediction of HD. It is difficult to handle high-dimensional data with DT and NB works well with high-dimensional data due to its naive assumptions. GB has an overfitting issue due to target leakage problem and CatBoost can handle this problem by using several permutations of the training dataset. So, the stack of these classifiers gives more accurate results. The algorithm of our proposed stacking model is given in Algorithm 1.

Algorithm 1: NCDG Stacking Model for Heart Disease Prediction

Require: Dataset D, Number of folds K
Ensure:: Trained stacking model S
1:: Split the dataset D into K folds
2:: for each fold k from 1 to K do
3:: Split the data into training set $D_{train}^{k}$ and validation set $D_{val}^{k}$
4:: for each base learner $B_{i}$ in B do
5:: Base Model Training and Prediction:
6:: if $B_{i}$ is Naive Bayes then
7:: 1. Calculate prior probabilities for each class in $D_{train}^{k}$
8:: 2. Calculate likelihood of features given each class
9:: 3. Use Bayes’ theorem to calculate posterior probabilities for $D_{val}^{k}$
10:: 4. Assign class with highest posterior probability to each instance in $D_{val}^{k}$
11:: 5. Get predictions $P_{i}^{k}$
12:: else if $B_{i}$ is Categorical Boosting (CatBoost) then
13:: 1. Initialize CatBoost model with default parameters
14:: 2. Train model on $D_{train}^{k}$ using boosting iterations
15:: 3. For each instance in $D_{val}^{k}$ , compute weighted sum of trees’ predictions
16:: 4. Apply sigmoid function to get probabilities
17:: 5. Assign class labels based on probabilities
18:: 6. Get predictions $P_{i}^{k}$
19:: else if $B_{i}$ is Decision Tree then
20:: 1. Build decision tree using training set $D_{train}^{k}$ by selecting best feature splits
21:: 2. Prune tree to avoid overfitting
22:: 3. For each instance in $D_{val}^{k}$ , traverse tree from root to leaf based on feature values
23:: 4. Assign class label at leaf node
24:: 5. Get predictions $P_{i}^{k}$
25:: end if
26:: end for
27:: Concatenate predictions $P^{k} = {P_{1}^{k}, P_{2}^{k}, \dots, P_{n}^{k}}$
28:: Meta Model Training:
29:: 1. Use concatenated predictions $P^{k}$ as input features
30:: 2. Train meta learner M (Gradient Boosting) on $P^{k}$ :
31:: 3. Get meta model Prediction $M^{k}$
32:: end forreturn Final Predictions $M^{k}$ .

4.5. Voting Classifier for Heart Disease Prediction

An ensemble ML technique called the Voting Classifier is trained using many models and then produces an output depending on the model’s highest probability of the chosen class [28]. Instead of using a single model, we can combine the predictions from the multiple models on the basis of the majority of the votes for each of the output classes. Hard and soft voting are the two types of voting.

The class with the greatest likelihood of being predicted by each classifier is the output in the hard voting process. Suppose we have four classifiers and we want to predict class label y by using hard voting and the classifiers predicted the output class as

{0, 0, 0, 1}

. With this, 0 will be our output because it is the majority-predicted class by the classifiers.

The class determined by averaging the probabilities assigned to each class is the result of soft voting. Suppose that we have four classifiers with the probabilities

0 = {0.37, 0.43, 0.58}

and

1 = {0.28, 0.47, 0.67}

. So, the average for class 0 is

0.46

and the average for class 1 is

0.47

. Hence, the output class will be 1, having the highest probability averaged by each classifier.

5. Results and Discussion of Proposed Models for Heart Disease Prediction

5.1. Performance Metrics

The different performance metrics used for the evaluation of our proposed models are as follows:

1.: Accuracy;
2.: Precision;
3.: F1 Score;
4.: Recall;
5.: AUC–ROC;
6.: Confusion Matrix.

The concept of these metrics comes from False Negative (FN), False Positive (FP), True Negative (TN) and True Positive (TP). The complete details of the performance metrics used for our proposed model are discussed in the following subsections:

TP: HD patient is classified as HD patient.

TN: A healthy individual is categorized as such.

FP: A healthy individual is categorized as an HD patient.

FN: HD patient is classified as healthy.

Accuracy

The ratio of the dataset’s total size to the number of accurate predictions is the definition of accuracy. It helps us to analyze the model’s overall effectiveness. Accuracy is calculated using Equation (17).

Accuracy = \frac{T N + T P}{F N + F P + T N + T P}

(17)

F1-Score

As a single score that combines recall and precision, the F1-Score provides more information about the model’s performance. It is computed using Equation (18).

F 1 - Score = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(18)

Precision

Out of all the positive predictions made, precision quantifies the number of correctly positive cases. Precision can be calculated using Equation (19).

Precision = \frac{T P}{T P + F P}

(19)

Recall

The number of positive results divided by the number of actual positive results gives the recall. It is calculated using the formula given in Equation (20).

Recall = \frac{T P}{T P + F N}

(20)

AUC–ROC

The AUC–ROC, which plots TPR versus FPR, illustrates the classification model’s performance at each classification threshold value. The higher the AUC (the summary of ROC), the more the model discriminates between positive and negative classes.

5.2. Simulation Results for Heart Disease Prediction

To demonstrate the benefits of our suggested stacking model, we will examine the experimental findings in this section. The results we have provided are the averages across all simulations. COLAB notebook and python language are used for evaluation. We conducted various experiments based on (1) different balancing techniques and (2) using different classifiers as meta learners. The findings presented in Table 1 and Figure 9 demonstrate that BorderLineSMOTE with GB as the meta learner yields the best results. The results of our proposed stacking model demonstrate that using a stacking model can be the best choice, as it outperforms all the individual classifiers. However, its computation cost is as high as those compared to the individual classifier, but it is more accurate and reliable. Individually, CatBoost performs best because it can handle categorical features and large datasets and imbalanced data, but it is computationally more expensive than all the other classifiers used in stacking. On the other hand, NB performs the worst because it does not consider the dependency of attributes, which is not the case in real-world problems, but it is computationally more efficient than all others. DT is a rule based classifier and GB is also a good choice for large datasets, but it can be computationally expensive. However, a powerful model is built by combining all these classifiers in a stacking model.

5.3. Comparison of Different Stacking Models with the Proposed Stacking Model

We use different classifiers in the meta layer without balancing, with BorderLineSMOTE and SMOTE, and then we compared the results. Table 2, demonstrates how each classifier performs when balancing techniques are not used. The main reason DTs perform poorly is that they produce hard splits in the feature space, which makes them susceptible to overfitting and noise. Recall and F1-score are decreased when balancing is not completed because the overwhelming presence of majority class samples makes it difficult for the model to detect minority class patterns. NB makes the assumption that features are independent, which is rarely true in real-world datasets; that is why NB performs poorly as well. This presumption produces inaccurate posterior probabilities, particularly in datasets that are unbalanced and under-represent the minority class. Further decreasing the algorithm’s predictive accuracy is the fact that its basic probabilistic framework is unable to capture complex feature interactions. Even though GB is a strong ensemble technique, it has trouble with unbalanced data. By concentrating on misclassified samples, it iteratively improves the model; however, if the dataset is imbalanced, this approach frequently overemphasises the majority class, which lowers recall for the minority class. Because of this imbalance, the boosting process is skewed and is less effective overall at detecting instances of minority classes. CatBoost faces issues with handling imbalanced data, but it performs marginally better than some other models. Poor recall is a result of its ordered boosting, and regularization is unable to address the underlying class imbalance. Furthermore, the absence of equilibrium hinders the model’s ability to effectively utilize its advantages in handling categorical data. RF is a compilation of DT; it has a tendency to favor the majority class in the instance of an imbalanced dataset. This is due to its sampling of training subsets since the bootstrapping process is always likely to bias to the majority class as it may not be able to adequately sample the minority class subsample. Therefore, the model fails to classify the minority class instances correctly and so the recall and F1-scores end up being low. AdaBoost, a boosting algorithm, concentrates on modifying class distribution parameters in order to rectify errors committed by past weak learners within the model. And since this is a process that is repeated and is iterative in nature, the bias in an imbalanced dataset ensures that the majority samples always end up dominating. This bias affects the boosting such that the model is devoted to classifying the majority class instances at the expense of the minority class. At last, the stacking models of NCDG, GNCD, DGNC and CDGN fail because of the compounded effect of the class imbalance on the base classifiers. The limitation of generalization in base classifiers like NB or DT leads to the accretion of their faults in the meta-classifier, whereby prediction integration is poorly achieved. The meta-classifier, which is usually a GB model, performs poorly due to the class imbalance. These provided factors are able to expound why the classifiers perform poorly in the absence of balancing methods. The hyperparameters used for SMOTE and BorderLineSMOTE are given in Table 3. For SMOTE, we employed sampling_strategy = ‘auto’, which creates synthetic samples equal to the size of the majority class to balance the minority class. By guaranteeing an adequate representation of the minority class in the data, this default setting improves model generalizability and helps to avoid bias toward the majority class. We used random_state = ‘42’ for BorderLineSMOTE to guarantee the stability and reproducibility of results across several runs. BorderLineSMOTE focuses on samples close to the decision boundary that are most susceptible to misclassification. By creating synthetic samples in these crucial areas, this method aids the model in more effectively differentiating between classes.

Table 1 and Figure 9 show that our proposed model, which consists of DT, NB and CatBoost at the base layer and GB at the meta layer with 5-FCV, performs best with an accuracy of 91%, a recall of 91%, a precision of 91%, an F1-Score of 91%, an AUC–ROC of 97% and a TPR of 91%. In the meta layer, GB performs the best because it is an ensemble model and can accurately handle large datasets. On the other hand, DT performs worst in the meta layer because it is a rule-based classifier. Small changes can lead to significant alterations in the structure of the tree, which is affected by noise and not ideal for large datasets. Catboost performs better than DT because it is an ensemble model with many in-built features. For instance, it can handle categorical features, missing values, prevent overfitting and can be a better choice for smaller and larger datasets. NB performs better than DT and CatBoost. However, NB assumes that all the features are independent, which can lead to incorrect predictions about complex problems. The performance of RF and AdaBoost is comparatively lower than the proposed model, as shown in Table 1 and Table 4. Even though RF manages data variability well, it may have trouble with unbalanced data, which could result in a decreased ability to distinguish between positive and negative cases. Compared to the suggested method, AdaBoost may have a lower accuracy and recall due to its sensitivity to noisy data and outliers, which could impair its capacity for generalization. Table 1 and Table 4 and Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 validate the above discussion. As illustrated in Figure 17, we have tested the proposed model for 100 epochs.

The AUC–ROC of the proposed model and the base classifiers is shown in Figure 18. It clearly explains that our proposed model has the ability to better identify the TPR while minimizing the FPR. Figure 19 shows a heatmap of our proposed model’s confusion matrix, which consists of the TP.

TN, FP and FN values and clearly describes the effectiveness of our proposed model. The stacking approach takes the longest time for execution compared to base classifiers, as shown in Figure 20. GB and CatBoost are also computationally expensive. However, the choice of classifier depends on the specific need. Depending on whether accuracy or execution time is more important to us, we can choose an appropriate classifier.

The results of our proposed stacking model with BorderLineSMOTE are better than those of the same model with the SMOTE balancing technique because SMOTE uses linear interpolation for each of the instance in minority class, regardless of its position, while BorderLineSMOTE uses linear interpolation to create new data points for the instances that lie between the majority class instances.

We have also performed a hard and soft voting of DT, NB, CatBoost and GB. Soft voting can calculate the average of the probabilities assigned to each class, whereas hard voting bases its results on the class with the highest probability of being predicted by each classifier; soft voting produced 1% better results than hard voting.

The results with stacking are higher than than those of voting because stacking involves more stages of learning for the classifiers, and the meta model gives predictions based on the predictions of base classifiers while voting involves only the parallel working of classifiers with the predictions based on the probabilities assigned to each class.

Figure 21 displays a force plot that is used to observe how each feature affects the models prediction for a single instance. Blue arrows indicate that the feature’s effect is negative, and red arrows indicate that the feature’s effect is positive and the length of the arrows represents the strength of a feature’s influence on the prediction. Figure 22 contains the dependence plot of AgeCategory and Figure 23 contains SleepTime, while Figure 24 represents the dependence of the Asthma feature on the proposed model. These plots show the dependence of each feature for each data point. We plot a point for each data instance in the dependence plots after choosing a feature. The vertical axis SHAP values extent indicates how much each feature influences the prediction.

Figure 25 shows a waterfall plot that provides the complete information for a single instance, showing how each input feature contributes towards the predicted HD. Red and blue arrows indicate the positive and negative contributions of each feature, while the length of the arrows indicates the degree of each feature’s contribution. At the base of the waterfall plot is the anticipated value of the models output.

Figure 26 shows a summary plot that provides a broad overview and is more complex than the dependence plot. It not only provides the relative importance of features but also provides actual relationships with the predicted value. The x-axis shows the SHAP values, while the y-axis shows the input features ranked from top to bottom. Every dot represents a single occurrence. Where red indicates that a feature’s value for a given instance is relatively high, and blue indicates that a feature’s value is relatively low.

6. Conclusions

This paper presents an ensemble stacking model for HD prediction. Before passing data to our proposed model, we perform preprocessing by first converting string columns into integers using the factorization method. Then, we use SMOTE and BorderLineSMOTE data balancing techniques to address the data class imbalance problem. Our proposed stacking model consists of four classifiers: three at the base layer (DT, NB and CatBoost) and one at the meta layer (GB). We also performed hard and soft voting using VotingClassifier and compared the results with those from our proposed stacking model. SHAP is used for model interpretation. Various evaluation metrics are used for the validation of our proposed model. Moreover, simulations are performed and our proposed model outperforms all the base classifiers and achieved 91% Accuracy, 91% F1-Score, 91% Precision, 91% Recall, 97%, AUC–ROC, 91% TPR and 653 s execution time. Furthermore, we compared the results of different stacking models by using different classifiers at the meta layer and different balancing techniques. In the meta layer, CatBoost performs best, while DT performs worst.

Author Contributions

Conceptualization, N.A.; Methodology, M.A.; Software, S.Q.S.; Formal analysis, N.A.; Investigation, N.J.; Resources, M.A.; Writing—original draft, S.Q.S., N.J. and N.A. All authors have read and agreed to the published version of the manuscript.

Funding

This project is funded by Researchers Supporting Project number (RSPD2025R648), King Saud University, Riyadh, Saudi Arabia.

Data Availability Statement

The dataset is publicly available at: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease, accessed on 29 December 2024.

Acknowledgments

Researchers Supporting Project number (RSPD2025R648), King Saud University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	Artificial Intelligence
ANN	Artificial Neural Network
ASM	Attribute Selection Measures
AUC–ROC	Area Under the Curve–Receiver Operating Characteristics Curve
CatBoost	Categorical Boosting
CVD	Cardiovascular Disease
DT	Decision Tree
DL	Deep Learning
FCV	Fold Cross-Validation
GA	Gaussian Algorithm
GB	Gradient Boosting
HD	Heart Disease
KNN	K-Nearest Neighbors
ML	Machine Learning
NB	Naive Bayes
NCDG	Stacking of Naive Bayes, CatBoost, Decision Tree, Gradient Boosting
RF	Random Forest
SHAP	SHapley Additive exPlanations
SMOTE	Synthetic Minority Oversampling TEchnique
TPR	True Positive Rate
XAI	eXplainable Artificial Intelligence
A	Attribute
D	Dataset
$γ_{i}$	Observed value
L	Loss function
p	Predicted probability
S	Root Node
$y_{i}$	Predicted value

References

Martin, S.S.; Aday, A.W.; Almarzooq, Z.I.; Anderson, C.A.; Arora, P.; Avery, C.L.; Baker-Smith, C.M.; Barone Gibbs, B.; Beaton, A.Z.; Boehme, A.K.; et al. 2024 heart disease and stroke statistics: A report of US and global data from the American Heart Association. Circulation 2024, 149, e347–e913. [Google Scholar] [PubMed]
Yashudas, A.; Gupta, D.; Prashant, G.C.; Dua, A.; AlQahtani, D.; Reddy, A.S.K. DEEP-CARDIO: Recommendation System for Cardiovascular Disease Prediction Using IOT Network. IEEE Sens. J. 2024, 24, 14539–14547. [Google Scholar] [CrossRef]
Hannan, A.; Cheema, S.M.; Pires, I.M. Machine learning-based smart wearable system for cardiac arrest monitoring using hybrid computing. Biomed. Signal Process. Control 2024, 87, 105519. [Google Scholar] [CrossRef]
Chandrasekhar, N.; Peddakrishna, S. Enhancing Heart Disease Prediction Accuracy through Machine Learning Techniques and Optimization. Processes 2023, 11, 1210. [Google Scholar] [CrossRef]
Nandy, S.; Adhikari, M.; Balasubramanian, V.; Menon, V.G.; Li, X.; Zakarya, M. An intelligent heart disease prediction system based on swarm-artificial neural network. Neural Comput. Appl. 2023, 35, 14723–14737. [Google Scholar] [CrossRef]
Abdellatif, A.; Abdellatef, H.; Kanesan, J.; Chow, C.O.; Chuah, J.H.; Gheni, H.M. An effective heart disease detection and severity level classification model using machine learning and hyperparameter optimization methods. IEEE Access 2022, 10, 79974–79985. [Google Scholar] [CrossRef]
Rani, P.; Kumar, R.; Ahmed, N.M.S.; Jain, A. A decision support system for heart disease prediction based upon machine learning. J. Reliab. Intell. Environ. 2021, 7, 263–275. [Google Scholar] [CrossRef]
Nissa, N.; Jamwal, S.; Neshat, M. A Technical Comparative Heart Disease Prediction Framework Using Boosting Ensemble Techniques. Computation 2024, 12, 15. [Google Scholar] [CrossRef]
Ali, M.M.; Paul, B.K.; Ahmed, K.; Bui, F.M.; Quinn, J.M.; Moni, M.A. Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison. Comput. Biol. Med. 2021, 136, 104672. [Google Scholar] [CrossRef] [PubMed]
Esmaeili, P.; Roshanravan, N.; Mousavi, S.; Ghaffari, S.; Mesri Alamdari, N.; Asghari-Jafarabadi, M. Machine learning framework for atherosclerotic cardiovascular disease risk assessment. J. Diabetes Metab. Disord. 2023, 22, 423–430. [Google Scholar] [CrossRef] [PubMed]
Baghdadi, N.A.; Farghaly Abdelaliem, S.M.; Malki, A.; Gad, I.; Ewis, A.; Atlam, E. Advanced machine learning techniques for cardiovascular disease early detection and diagnosis. J. Big Data 2023, 10, 144. [Google Scholar] [CrossRef]
Wang, J.; Rao, C.; Goh, M.; Xiao, X. Risk assessment of coronary heart disease based on cloud-random forest. Artif. Intell. Rev. 2023, 56, 203–232. [Google Scholar] [CrossRef]
Rimal, Y.; Sharma, N. Hyperparameter optimization: A comparative machine learning model analysis for enhanced heart disease prediction accuracy. In Multimedia Tools and Applications; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–17. [Google Scholar]
Dalal, S.; Goel, P.; Onyema, E.M.; Alharbi, A.; Mahmoud, A.; Algarni, M.A.; Awal, H. Application of Machine Learning for Cardiovascular Disease Risk Prediction. Comput. Intell. Neurosci. 2023, 2023, 9418666. [Google Scholar] [CrossRef]
Khan, A.; Qureshi, M.; Daniyal, M.; Tawiah, K. A Novel Study on Machine Learning Algorithm-Based Cardiovascular Disease Prediction. Health Soc. Care Community 2023, 2023, 1406060. [Google Scholar] [CrossRef]
Abdellatif, A.; Mubarak, H.; Abdellatef, H.; Kanesan, J.; Abdelltif, Y.; Chow, C.O.; Chuah, J.H.; Gheni, H.M.; Kendall, G. Computational detection and interpretation of heart disease based on conditional variational auto-encoder and stacked ensemble-learning framework. Biomed. Signal Process. Control 2024, 88, 105644. [Google Scholar] [CrossRef]
Rao, C.; Li, M.; Huang, T.; Li, F. Stroke Risk Assessment Decision-Making Using a Machine Learning Model: Logistic-AdaBoost. CMES-Comput. Model. Eng. Sci. 2024, 139, 699–724. [Google Scholar] [CrossRef]
Ahmad, S.; Asghar, M.Z.; Alotaibi, F.M.; Alotaibi, Y.D. Diagnosis of cardiovascular disease using deep learning technique. Soft Comput. 2023, 27, 8971–8990. [Google Scholar] [CrossRef]
Albert, A.J.; Murugan, R.; Sripriya, T. Diagnosis of heart disease using oversampling methods and decision tree classifier in cardiology. Res. Biomed. Eng. 2023, 39, 99–113. [Google Scholar] [CrossRef]
Almazroi, A.A.; Aldhahri, E.A.; Bashir, S.; Ashfaq, S. A Clinical Decision Support System for Heart Disease Prediction Using Deep Learning. IEEE Access 2023, 11, 61646–61659. [Google Scholar] [CrossRef]
Yongcharoenchaiyasit, K.; Arwatchananukul, S.; Temdee, P.; Prasad, R. Gradient Boosting Based Model for Elderly Heart Failure, Aortic Stenosis, and Dementia Classification. IEEE Access 2023, 11, 48677–48696. [Google Scholar] [CrossRef]
Anusha, K.; Archana, M.; Janardhan, G. Heart Disease Prediction Using ML and DL Approaches. In Proceedings of the International Conference on Communications and Cyber Physical Engineering, Hyderabad, India, 28–29 February 2018; Springer Nature: Singapore, 2018; pp. 113–123. [Google Scholar]
Kumar, J.; Pandey, V.; Tiwari, R.K. Predictive Modeling for Heart Disease Detection: A Machine Learning Approach. In Proceedings of the 2024 5th International Conference on Recent Trends in Computer Science and Technology (ICRTCST), Online, 9–10 April 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 668–675. [Google Scholar]
John, B. When to Choose CatBoost over XGBoost or LIGHTGBM [Practical Guide]. 2023. Available online: https://neptune.ai/blog/when-to-choose-catboost-over-xgboost-or-lightgbm (accessed on 20 June 2024).
Jabeur, S.B.; Gharib, C.; Mefteh-Wali, S.; Arfi, W.B. CatBoost model and artificial intelligence techniques for corporate failure prediction. Technol. Forecast. Soc. Chang. 2021, 166, 120658. [Google Scholar] [CrossRef]
Shahani, N.M.; Kamran, M.; Zheng, X.; Liu, C.; Guo, X. Application of gradient boosting machine learning algorithms to predict uniaxial compressive strength of soft sedimentary rocks at Thar Coalfield. Adv. Civ. Eng. 2021, 2021, 1–19. [Google Scholar] [CrossRef]
Pardede, J.; Pamungkas, D.P. The Impact of Balanced Data Techniques on Classification Model Performance. Sci. J. Inform. 2024, 11, 401–412. [Google Scholar]
Trivedi, S.; Patel, N. The Determinants of AI Adoption in Healthcare: Evidence from Voting and Stacking Classifiers. ResearchBerg Rev. Sci. Technol. 2021, 1, 69–83. [Google Scholar]

Figure 1. NCDG stacking model for heart disease prediction.

Figure 2. Workflow of decision tree for heart disease prediction.

Figure 3. Workflow of Naive Bayes for heart disease prediction.

Figure 4. Workflow of gradient boosting for heart disease prediction.

Figure 5. Workflow of Categorical Boosting for heart disease prediction [26].

Figure 6. Generating a new sample in SMOTE.

Figure 7. Comparison of imbalanced and balanced dataset.

Figure 8. K-Fold Cross-Validation for heart disease prediction.

Figure 9. Results of stacking model NCDG and state-of-the-art classifiers with BorderLineSMOTE balancing technique for heart disease prediction.

Figure 10. Results of stacking model GNCD and state-of-the-art classifiers with BorderLineSMOTE balancing technique for heart disease prediction.

Figure 11. Results of stacking model DGNC and state-of-the-art classifiers with BorderLineSMOTE balancing technique for heart disease prediction.

Figure 12. Results of stacking model CDGN and state-of-the-art classifiers with BorderLineSMOTE balancing technique for heart disease prediction.

Figure 13. Results of stacking model NCDG and state-of-the-art classifiers with SMOTE balancing technique for heart disease prediction.

Figure 14. Results of stacking model GNCD and state-of-the-art classifiers with SMOTE balancing technique for heart disease prediction.

Figure 15. Results of stacking model DGNC and state-of-the-art classifiers with SMOTE balancing technique for heart disease prediction.

Figure 16. Results of stacking model CDGN and state-of-the-art classifiers with SMOTE Bbalancing technique for heart disease prediction.

Figure 17. Proposed NCDG stacking model results on 10, 20, 30, 50, 80 and 100 CV for heart disease prediction.

Figure 18. AUC–ROC of stacking model NCDG and base classifiers for heart disease prediction.

Figure 19. Confusion matrix of NCDG stacking model for heart disease prediction.

Figure 20. Execution time of stacking model NCDG and base classifiers for heart disease prediction.

Figure 21. Force plot of stacking model NCDG heart disease prediction.

Figure 22. Dependence plot of AgeCategory with Diabetes in the stacking model NCDG for heart disease Prediction.

Figure 23. Dependence plot of SleepTime with Diabetes in the stacking model NCDG for heart disease prediction.

Figure 24. Dependence plot of Asthma with AgeCategory in the stacking model NCDG for heart disease prediction.

Figure 25. Waterfall plot of stacking model NCDG for heart disease prediction.

Figure 26. Summary plot of stacking model NCDG for heart disease prediction.

Table 1. Comparison results of proposed model with base classifier using BorderLineSMOTE for heart disease prediction.

Classifier	Accuracy	F1-Score	Precision	Recall	Time (s)	AUC–ROC	TPR
Naive Bayes	0.7401	0.7345	0.739	0.7338	0.5	0.8134	0.7306
CatBoost	0.8832	0.8821	0.8814	0.8809	114	0.9612	0.8817
Decision Tree	0.8815	0.8796	0.8789	0.8875	7	0.8779	0.8903
Gradient Boosting	0.8421	0.8437	0.860	0.8035	101	0.9236	0.8612
NCDG	0.9120	0.9123	0.9121	0.9115	653	0.9732	0.9118
AdaBoost	0.8105	0.8198	0.805	0.8361	20	0.8011	0.8725
Random Forest	0.8542	0.8526	0.8541	0.8543	99	0.9623	0.8802
GNCD	0.8342	0.8325	0.8403	0.8230	972	0.8319	0.8217
DGNC	0.8944	0.8928	0.8892	0.9134	512	0.9546	0.9115
CDGN	0.9031	0.9025	0.8987	0.9109	995	0.9617	0.9107
Hard Voting	0.8712	0.8624	0.8821	0.8537	291	0.8715	0.8518
Soft Voting	0.8833	0.8817	0.8640	0.9116	203	0.8832	0.9101

Table 2. Comparison results of proposed model with base classifier without balancing for heart disease prediction.

Classifiers	Accuracy	F1_Score	Precision	Recall	Time (s)	AUC–ROC	TRP
Decision Tree	0.8644	0.2398	0.2325	0.2475	4.8	0.3454	0.5412
Naive Bayes	0.8479	0.3249	0.2635	0.4235	0.3	0.4324	0.4325
Gradient Boosting	0.9147	0.1428	0.0823	0.5404	47	0.2343	0.3421
CatBoost	0.9142	0.1606	0.5208	0.0949	58	0.3476	0.5623
AdaBoost	0.9032	0.2247	0.3221	0.07436	75	0.4974	0.6743
Random Forest	0.9143	0.2612	0.3192	0.0843	43	0.5346	0.6238
NCDG	0.9144	0.0956	0.5522	0.0524	352	0.5423	0.5234
GNCD	0.8672	0.2381	0.2361	0.2402	451	0.4356	0.6238
DGNC	0.9144	0.1365	0.5328	0.0783	205	0.4823	0.4571
CDGN	0.8756	0.3734	0.3307	0.4288	461	0.6234	0.4352

Table 3. Hyperparameters for SMOTE and BorderLineSMOTE.

Technique	Hyperparameter	Value
SMOTE	sampling_strategy	auto
BorderLineSMOTE	random_state	42

Table 4. Comparison results of proposed model with base classifier using SMOTE for heart disease prediction.

Classifier	Accuracy	F1-Score	Precision	Recall	Time (s)	AUC–ROC	TPR
Naive Bayes	0.7236	0.7215	0.7240	0.7153	66	0.7912	0.7127
CatBoost	0.8721	0.8718	0.8734	0.8721	122	0.9537	0.8749
Decision Tree	0.8742	0.8741	0.8645	0.8793	7	0.8769	0.8824
Gradient Boosting	0.8315	0.8357	0.8134	0.8517	83	0.9105	0.8428
NCDG	0.9032	0.9023	0.9015	0.9037	455	0.9615	0.8938
AdaBoost	0.7912	0.8034	0.7826	0.8254	17	0.7632	0.7835
Random Forest	0.8543	0.8543	0.8547	0.8523	94	0.9628	0.8821
GNCD	0.8242	0.8221	0.8329	0.8156	1005	0.8234	0.8137
DGNC	0.8842	0.8815	0.8739	0.8876	522	0.9536	0.8937
CDGN	0.8951	0.8914	0.8821	0.9043	948	0.9623	0.9024
Hard Voting	0.8532	0.8547	0.8834	0.8225	182	0.8542	0.8231
Soft Voting	0.8744	0.8726	0.8624	0.8941	187	0.8715	0.8946

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sultan, S.Q.; Javaid, N.; Alrajeh, N.; Aslam, M. Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach. Symmetry 2025, 17, 185. https://doi.org/10.3390/sym17020185

AMA Style

Sultan SQ, Javaid N, Alrajeh N, Aslam M. Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach. Symmetry. 2025; 17(2):185. https://doi.org/10.3390/sym17020185

Chicago/Turabian Style

Sultan, Sara Qamar, Nadeem Javaid, Nabil Alrajeh, and Muhammad Aslam. 2025. "Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach" Symmetry 17, no. 2: 185. https://doi.org/10.3390/sym17020185

APA Style

Sultan, S. Q., Javaid, N., Alrajeh, N., & Aslam, M. (2025). Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach. Symmetry, 17(2), 185. https://doi.org/10.3390/sym17020185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning-Based Stacking Ensemble Model for Prediction of Heart Disease with Explainable AI and K-Fold Cross-Validation: A Symmetric Approach

Abstract

1. Introduction

2. Related Work

3. Material and Methods for Heart Disease Prediction

3.1. Role of Decision Tree (DT) for Heart Disease Prediction

3.2. Role of Naive Bayes (NB) for Heart Disease Prediction

3.3. Role of Gradient Boosting (GB) for Heart Disease Prediction

3.4. Categorical Boosting (CatBoost) for Heart Disease Prediction

4. Proposed Methodology for Heart Disease Prediction

4.1. Data Preprocessing

4.2. Heart Disease Dataset Description

4.3. Data Balancing Using BorderLineSMOTE

4.4. NCDG Stacking Model for Heart Disease Prediction

4.5. Voting Classifier for Heart Disease Prediction

5. Results and Discussion of Proposed Models for Heart Disease Prediction

5.1. Performance Metrics

5.2. Simulation Results for Heart Disease Prediction

5.3. Comparison of Different Stacking Models with the Proposed Stacking Model

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI