Enhancing the Early Detection of Chronic Kidney Disease: A Robust Machine Learning Model

Arif, Muhammad Shoaib; Mukheimer, Aiman; Asif, Daniyal

doi:10.3390/bdcc7030144

Open AccessArticle

Enhancing the Early Detection of Chronic Kidney Disease: A Robust Machine Learning Model

by

Muhammad Shoaib Arif

^1,2,*

,

Aiman Mukheimer

¹

and

Daniyal Asif

³

¹

Department of Mathematics and Sciences, College of Humanities and Sciences, Prince Sultan University, Riyadh 11586, Saudi Arabia

²

Department of Mathematics, Air University, PAF Complex E-9, Islamabad 44000, Pakistan

³

Department of Mathematics, COMSATS University Islamabad, Park Road, Islamabad 45550, Pakistan

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2023, 7(3), 144; https://doi.org/10.3390/bdcc7030144

Submission received: 11 July 2023 / Revised: 27 July 2023 / Accepted: 10 August 2023 / Published: 16 August 2023

(This article belongs to the Special Issue Big Data in Health Care Information Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Clinical decision-making in chronic disorder prognosis is often hampered by high variance, leading to uncertainty and negative outcomes, especially in cases such as chronic kidney disease (CKD). Machine learning (ML) techniques have emerged as valuable tools for reducing randomness and enhancing clinical decision-making. However, conventional methods for CKD detection often lack accuracy due to their reliance on limited sets of biological attributes. This research proposes a novel ML model for predicting CKD, incorporating various preprocessing steps, feature selection, a hyperparameter optimization technique, and ML algorithms. To address challenges in medical datasets, we employ iterative imputation for missing values and a novel sequential approach for data scaling, combining robust scaling, z-standardization, and min-max scaling. Feature selection is performed using the Boruta algorithm, and the model is developed using ML algorithms. The proposed model was validated on the UCI CKD dataset, achieving outstanding performance with 100% accuracy. Our approach, combining innovative preprocessing steps, the Boruta feature selection, and the k-nearest neighbors algorithm, along with a hyperparameter optimization using grid-search cross-validation (CV), demonstrates its effectiveness in enhancing the early detection of CKD. This research highlights the potential of ML techniques in improving clinical support systems and reducing the impact of uncertainty in chronic disorder prognosis.

Keywords:

chronic kidney disease; machine learning; artificial intelligence; data science; healthcare; bioinformatics; big data

1. Introduction

CKD presents a significant global health challenge, affecting approximately 850 million people worldwide [1]. The kidneys, vital organs situated on both sides of the spine just below the ribcage, play a crucial role in maintaining the body’s internal environment by filtering the blood and removing waste products, excess fluids, and toxins through urine. Additionally, they regulate electrolyte levels, blood pressure, and the acid–base balance, while producing hormones that control calcium metabolism and stimulate red blood cell production [2,3].

CKD is characterized by a progressive and long-term decline in kidney function, leading to an inability to effectively filter waste and maintain fluid and electrolyte balance, resulting in the accumulation of waste products and fluid retention. The burden of CKD is immense, contributing to complications like electrolyte imbalances, bone disorders, anemia, and cardiovascular diseases [4,5,6]. If left untreated, CKD can progress to end-stage renal disease, necessitating dialysis or kidney transplantation [7,8]. Early detection and proper management of CKD are pivotal to preserving kidney function, slowing down the disease progression, and improving patient outcomes [9].

Despite its global prevalence and impact on public health, detecting CKD early and ensuring access to quality kidney care pose significant challenges, particularly in low- and middle-income countries with limited resources [10,11,12]. Traditional methods for CKD detection, such as blood tests and urinalysis, may have limitations in identifying the early stages of kidney damage and might not capture fluctuations in kidney health over time. Invasive procedures like kidney biopsy are unsuitable for routine screening, and imaging tests can be both expensive and time-consuming [13,14,15].

ML methods offer promising solutions to these challenges. ML algorithms can analyze large and complex datasets, improving the accuracy in CKD detection by identifying subtle patterns and trends that may go unnoticed with traditional methods. These models can incorporate various variables, enabling personalized risk assessments and tailored treatment plans. The efficiency of ML algorithms allows for quick processing of new patient data, facilitating timely diagnosis and intervention. Moreover, ML can predict CKD development in high-risk individuals, enabling early preventive measures [16,17,18].

In this paper, we investigate the feasibility and potential benefits of using ML for early CKD diagnosis. Our objective is to develop an ML model that incorporates data imputation, data scaling methods, split ratio, and optimal parameters, while evaluating classifiers based on their classification accuracy. The goal is to effectively detect CKD using ML algorithms such as the k-nearest neighbor and naive Bayes. Missing values are handled using iterative imputation, and a novel sequential data scaling method is introduced by combining robust scaling, z-standardization, and min–max scaling. Boruta feature selection is applied to identify important features, and the hyperparameters are tuned using grid-search CV. The testing accuracy of our proposed work is evaluated by comparing it to the results of various other studies.

The remaining sections of this paper are structured as follows: In Section 2, we conduct a comprehensive review of the existing literature and highlight the novelty of our work. Section 3 outlines the methodologies employed and presents the proposed system model. The experimental results are analyzed in Section 4. In Section 5, we engage in a discussion and compare our proposed model with other studies. Finally, the paper concludes in Section 6 by exploring potential avenues for future research.

2. Literature Review

In recent times, there has been a notable advancement in applying ML techniques to the field of healthcare, with a specific focus on early diagnosis and preventive measures [19,20,21]. This progress has also extended to the field of CKD, where numerous noteworthy studies have contributed to advancements in CKD research [17,22]. In this literature review, we provide a comprehensive overview of the current state of CKD research by thoroughly discussing the relevant studies. Our analysis includes a detailed examination of the methodologies employed, the findings obtained, and the limitations identified in each study. By doing so, we aim to present a comprehensive and unbiased understanding of the progress and challenges in CKD research.

A study by Debabrata et al. (2023) aimed to develop an ML model for early CKD detection using the UCI CKD dataset. The researchers employed imputation techniques, a sampling technique for data balancing, and data normalization. They selected nine features based on the chi-square test and used support vector machines for classification. However, the study had limitations, such as the exclusion of advanced imputation algorithms and the potential information loss from reducing the feature set [23].

In a study by Z. Ullah and M. Jamjoom (2023), the researchers aimed to predict CKD progression using a DT-based missing value imputation method. They performed feature selection using the filter method and employed the k-nearest neighbor algorithm for classification. However, the study did not utilize data scaling methods or hyperparameter optimization techniques [24].

A study conducted by A. Farjana et al. (2023) focused on CKD prediction using ML algorithms on the UCI CKD dataset. The researchers filled the missing data with mean values and employed hold-out validation. Light GBM demonstrated superior performance, but the study lacked advanced imputation techniques, outlier handling, data scaling, feature selection, and model optimization [25].

In a study by M. A. Islam et al. (2023), the researchers predicted CKD using ML algorithms. They used mean and mode techniques for missing data imputation and employed recursive feature elimination and principal component analysis for feature selection. However, the study did not utilize scaling methods or hyperparameter optimization techniques [26].

A study by M. M. Hassan (2023) focused on CKD prediction using ML on patients’ clinical records. The researchers used predictive mean matching for missing data imputation and performed data clustering using K-means. They employed the XGBoost approach with SHAP value analysis for feature selection. However, the study did not incorporate scaling methods or hyperparameter optimization [27].

In a study conducted by C. Kaur et al. (2023), the researchers utilized machine learning for CKD prediction. They employed Little’s MCAR test for missing data analysis and the Ant Colony Optimization algorithm for feature selection. They used ensemble methods and found that bagging produced the best results. However, the study did not employ scaling methods, cross validation, or hyperparameter optimization techniques [28].

Through the review of these studies, it is evident that several research gaps and limitations need to be addressed to further improve the field of CKD prediction. This study aims to specifically target these limitations and contribute novel approaches to the existing body of research. The key novelties of our work are as follows:

An advanced imputation method is employed to iteratively estimate missing values in the dataset. By implementing this technique, the completeness and quality of the dataset can be improved, leading to enhanced accuracy in the CKD prediction models.
A sequential approach to scaling the variables in the dataset is proposed. Robust scaling is initially used to adjust for outliers, ensuring that their influence is minimized. Subsequently, z-standardization is applied to further normalize the variables. Finally, min–max scaling is utilized to bring all features within a similar range.
To ensure the inclusion of only relevant and informative features, a robust feature selection algorithm called Boruta, is utilized.
Various ML models are explored and evaluated using grid-search CV to identify the most suitable algorithm for accurately classifying CKD.
The performance of the proposed model is rigorously validated using a range of evaluation metrics, including accuracy, precision, recall, F1-score, and curve analysis.

By addressing these limitations and incorporating these novel approaches, we aim to contribute to the advancement of CKD prediction models and provide more accurate and reliable predictions forthe early detection and prevention of CKD.

3. Methodology

This work presents a precise system for the detection of CKD through the utilization of a robust model. The proposed approach leverages ML techniques to construct a prediction model that is both effective and accurate. To visually depict the various stages of the proposed system, Figure 1 provides a schematic representation.

3.1. Data Collection

In order to validate our proposed ML model, we obtained the CKD dataset from the UCI ML Repository. The dataset contains a total of 400 samples, which we used for evaluating and validating our ML model in this study [29]. Each sample comprises 24 predictive variables, including 11 numerical variables and 13 categorical (nominal) variables. The dataset also includes a categorical response variable called ‘class’, which indicates the presence or absence of CKD. The ‘class’ variable has two distinct values: ‘ckd’ for samples diagnosed with CKD and ‘notckd’ for samples without CKD. To provide additional insights, a descriptive summary of the attributes involved in our comprehensive analysis is presented in Table 1.

3.2. Preprocessing

Medical datasets are prone to various issues that can have a negative impact on the performance of ML models. Therefore, it is crucial to address these challenges to improve the quality of the data. The preprocessing stage plays a vital role in enhancing data quality by tackling key issues such as data encoding, missing values, and outliers [30].

3.2.1. Data Encoding

To handle the combination of categorical and numeric features in the dataset, the label encoder module from the Scikit-learn library was used. This module transformed the categorical features into numeric representations, allowing for the improved performance of the machine learning model.

3.2.2. Data Imputation

Handling missing data requires choosing appropriate statistical methods based on the extent of missing data and the significance of the missing feature. Traditional techniques like mean, maximum, and mode work well with a low proportion of missing values [31]. In our study, we encountered a substantial amount of missing data, as illustrated in Figure 2.

To tackle this issue, we utilized iterative imputation, a statistical approach that iteratively estimates the missing values based on the observed data while considering the relationships between variables. This iterative process progressively refines the imputed values over multiple iterations, leading to a comprehensive and accurate estimation [32]. Algorithm 1 outlines the steps involved in constructing the iterative imputation process.

Algorithm 1 The iterative imputation pseudocode.

Input:
1: Dataset X
2: Features with missing values:

F_{missing}

3: Maximum iterations:

η

4: Convergence threshold:

ϵ

Output: Imputed dataset

X_{imputed}

5: procedure IterativeImputation(X,

F_{missing}

,

η

,

ϵ

)
6: Initialize

X_{imputed} \leftarrow X

7: for each feature f in

F_{missing}

do
8: Initialize missing mask

M_{f}

for feature f
9: Initialize model

M_{f}

(Linear Regression) for feature f
10: Initialize convergence

\leftarrow False

11: Initialize iterations

\leftarrow 0

12: while not convergence and iterations <

η

do
13: Fit model

M_{f}

on

X_{imputed}

14: Predict missing values using

M_{f}

15: Update

X_{imputed}

with predicted values
16: Check for convergence using mean absolute change
17: if CheckConvergence(

X_{imputed}

, f,

ϵ

) then
18:                convergence ← True
19:            end if
20:            Increment iterations
21:         end while
22:     end for
23:     return

X_{imputed}

24: end procedure

3.2.3. Data Scaling

To address outliers and achieve data normalization, a sequential approach of scaling techniques was employed, as outlined in Algorithm 2. The process began with robust scaling, which reduces the impact of extreme values and enhances robustness. It involved subtracting the median (

Q_{2}

) and dividing by the interquartile range (

Q_{3} - Q_{1}

). This can be represented by the following equation:

Robust Scaling (x) = \frac{x - Q_{2}}{Q_{3} - Q_{1}} .

(1)

Next, z-score standardization was applied, resulting in a standardized distribution by subtracting the mean (

μ

) and dividing by the standard deviation (

σ

). This can be represented by the following equation:

Z-score Standardization (x) = \frac{x - μ}{σ} .

(2)

Finally, to bring the features within a specific range (typically 0 to 1), min–max scaling was performed by subtracting the minimum value (

x_{\min}

) and dividing by the range (

x_{\max} - x_{\min}

). This can be represented by the following equation:

Min-Max Scaling (x) = \frac{x - x_{\min}}{x_{\max} - x_{\min}} .

(3)

Algorithm 2 The sequential approach of scaling techniques.

Input: Dataset

X_{i m p u t e d}

Output: Scaled dataset

X_{s c a l e d}

1: procedure SequentialScaling(X)
2: Initialize

X_{s c a l e d}

3: Apply Robust Scaling to X and store the result in

X_{s c a l e d}

4: Apply Z-score Standardization to

X_{s c a l e d}

and update

X_{s c a l e d}

5: Apply Min–Max Scaling to

X_{s c a l e d}

and update

X_{s c a l e d}

6: return

X_{s c a l e d}

7: end procedure

3.2.4. Feature Selection

Feature selection is a crucial step in ML, as it helps extract a subset of important features from the dataset. This process offers several benefits, including improved prediction accuracy, reduced model complexity, and enhanced interpretability.

In this study, we utilized the Boruta feature selection technique, which leverages random shadow features and an ML model. Boruta compares the importance of each feature to that of the shadow features iteratively, categorizing features as confirmed, tentative, or rejected based on their significance. Ultimately, Boruta provides a subset of the most significant features from the dataset. We implemented the technique using a random forest classifier as the base model to evaluate the feature importance. This classifier was trained on the dataset, including both original and shadow features, using measures such as the mean decrease in accuracy. The combination of the Boruta algorithm and the random forest classifier enabled us to identify the most relevant features for our analysis [33,34].

Algorithm 3 provides a concise overview of the Boruta feature selection algorithm, outlining the steps of initialization, iteration, feature evaluation, and the selection of confirmed features.

Algorithm 3 The Boruta feature selection pseudocode.

Input:
1: Dataset

X_{s c a l e d}

with n samples and m features
2: Target variable y with n labels
3: Random forest classifier with

n_e s t i m a t o r s

and

n_j o b s

4: Number of iterations

m a x_i t e r a t i o n s

for Boruta algorithm
Output: Selected features

s e l e c t e d_f e a t u r e s

5: Initialize a set of tentative features T with all m features
6: Initialize an empty set of confirmed features C
7: Initialize an empty set of rejected features R
8: for

i t e r a t i o n \leftarrow 1

to

m a x_i t e r a t i o n s

do
9: Fit the random forest classifier on

X_{s c a l e d}

using features from T
10:     Perform a permutation test for each feature in T to evaluate its importance
11:     for each feature f in T do
12:         if the feature importance of f is significantly higher than random, then
13:            Move f from T to C
14:         else
15:            Move f from T to R
16:         end if
17:     end for
18:     if T is empty then
19:         break
20:     end if
21: end for
22:

s e l e c t e d_f e a t u r e s \leftarrow C

23: return

s e l e c t e d_f e a t u r e s

The Boruta feature selection technique was applied to the UCI CKD dataset, resulting in the selection of 19 features, while 5 features were rejected. The features that were rejected include pus cell clumps, bacteria, potassium, coronary artery disease, and anemia. The selected 19 features were considered important for the classification task and were used for further analysis and model building. These selected variables are also clinically relevant to CKD, as supported by the previous literature [23,24,26,27]. The incorporation of these relevant features enhances the model’s ability to accurately identify and predict cases of CKD, making it a valuable tool for early detection and effective management of the condition.

3.3. Data Splitting

Data splitting is a crucial step in machine learning for reliable model evaluation and generalization [35]. It involves dividing the dataset into training and testing subsets:

Dataset = training data + testing data .

(4)

In this study, we used an 80:20 split ratio, where 80% of the dataset was allocated for training and the remaining 20% for testing. This ensures that the model learns from a significant portion of the data and is then evaluated on unseen data to assess its generalization performance.

3.4. Model Traning

During the model training phase, we employed two highly efficient ML classifiers: naïve Bayes and k-nearest neighbor. To optimize their performance, we utilized the hyperparameter optimization technique to tune the parameters of both algorithms.

3.4.1. Hyperparameter Optimization

Hyperparameter optimization is a critical step in ML to optimize the model performance by selecting the best combination of hyperparameters. In our study, we employed the widely used technique of grid-search CV. This approach systematically explores predefined grids of the hyperparameter values, evaluating the model’s performance for each combination using CV. By exhaustively searching through the hyperparameter space, it allows for a comprehensive exploration and selection of the optimal hyperparameter configuration [36,37]. The workflow of grid search CV for the selection of the hyperparameters is illustrated in Figure 3.

3.4.2. Naïve Bayes

It is a supervised algorithm that assumes feature independence during classification. It is particularly useful for datasets with a high number of input features. The algorithm considers all features, including those with weak effects on the prediction. The probabilistic model is represented by the equation:

P (A | B) = \frac{P (B | A) \cdot P (A)}{P (B)} .

(5)

In this equation, A and B represent independent events. This equation calculates the probability of event A occurring given that event B has occurred. By applying this model, naïve Bayes can make predictions based on the class with the highest probability.

In our study, we utilized the Gaussian naïve Bayes (NB) algorithm for classification. This variant assumes a Gaussian distribution for the features. It estimates the likelihood of observing specific feature values given a class label using the Gaussian probability density function.

The step-by-step procedure and essential hyperparameter choices for constructing the Gaussian NB model in this research are outlined in the pseudocode provided in Algorithm 4. The hyperparameters include the training data, smoothing parameter, and priors, which are utilized to build the model. The algorithm begins by calculating the prior probability for each class and then estimates the mean and variance of features for each class. Using Bayes’ theorem, it computes the posterior probability for each class given a new data point. Finally, the algorithm assigns the class with the highest posterior probability as the predicted class for the new data point [38].

Algorithm 4 The Gaussian NB pseudocode.

Input:
Training dataset:

X_{t r a i n} = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}

New data point:

x

p r i o r s

: [None, [0.5, 0.5], [0.3, 0.7]]

v a r_s m o o t h i n g

: [

1 \times 10^{- 9}

,

1 \times 10^{- 8}

,

1 \times 10^{- 7}

]
Output:
Predicted class label: y
1: procedure GaussianNBs(

X_{t r a i n}

,

x

,

p r i o r s

,

v a r_s m o o t h i n g

)
2: Calculate the prior probability for each class

y_{i}

:

P (y_{i}) = \frac{count (y_{i})}{n}

using priors
3: for each feature

x_{j}

do
4: if feature

x_{j}

is discrete, then
5: Calculate the proportion of occurrences of

x_{j}

in class

y_{i}

6: else if feature

x_{j}

is continuous, then
7: for each class

y_{i}

do
8: Estimate the mean and variance of

x_{j}

using examples in class

y_{i}

9:            end for
10:         end if
11:     end for
12:     Calculate the posterior probability for each class

y_{i}

using Bayes’ theorem:
13:

P (y_{i} | x) = \frac{P (x | y_{i}) \cdot P (y_{i})}{P (x)}

14: Assign the class label with the highest posterior probability as the predicted class:
15:

y = \arg \max_{y_{i}} P (y_{i} | x)

16: return y
17: end procedure

3.4.3. K-Nearest Neighbor

It is a simple and widely used supervised ML algorithm. It predicts the class of an observation by considering the classes of its k nearest neighbors, determined using a distance metric such as the Euclidean, Minkowski, or Manhattan distance. The equations for these distance metrics are as follows:

d_{E u c l i d e a n} (x_{i}, x_{j}) = \sqrt{\sum_{k = 1}^{d} {(x_{i_{k}} - x_{j_{k}})}^{2}},

(6)

d_{M i n k o w s k i} (x_{i}, x_{j}) = {(\sum_{k = 1}^{d} {| x_{i_{k}} - x_{j_{k}} |}^{p})}^{\frac{1}{p}},

(7)

d_{M a n h a t t a n} (x_{i}, x_{j}) = \sum_{k = 1}^{d} | x_{i_{k}} - x_{j_{k}} | .

(8)

In these equations,

x_{i_{k}}

and

x_{j_{k}}

represent the kth features of

x_{i}

and

x_{j}

in a d-dimensional space, respectively.

Using these distance metrics, it identifies the k nearest neighbors of a data point and determines its class based on the majority class among those neighbors. It is a straightforward and intuitive algorithm, making it applicable to various classification tasks.

The step-by-step procedure and essential hyperparameter choices for constructing the Gaussian NB model in this research are outlined in the pseudocode provided in Algorithm 5. The hyperparameters, such as the training data, leaf size, parameter, weight function, algorithm, number of neighbors, and distance metric, are utilized to build the model. The algorithm predicts the class label of a test instance by considering the majority class among its k nearest neighbors. It accomplishes this by calculating the distances between the test instance and training instances, selecting the k nearest neighbors and determining the predicted class label through a majority voting process [39,40].

Algorithm 5 The k-nearest neighbors pseudocode.

Input:
Training dataset:

X_{t r a i n} = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n})}

Test instance:

X_{t e s t}

a l g o r i t h m

: [‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’]

l e a f_s i z e

: [20, 30, 40]

m e t r i c

: [‘euclidean’, ‘manhattan’, ‘minkowski’]

n_n e i g h b o r s

: [3, 5, 7]
p: [3, 4]

w e i g h t s

: [‘uniform’, ‘distance’]
Output:
Predicted class label for

X_{t e s t}

:

\hat{y}

1: procedure KNN(

X_{t r a i n}

,

X_{t e s t}

,

a l g o r i t h m

,

l e a f_s i z e

,

m e t r i c

,

n_n e i g h b o r s

, p,

w e i g h t s

)
2:

d i s t a n c e s \leftarrow

Empty list
3: for each

(x_{i}, y_{i})

in

X_{t r a i n}

do
4:

d i s t a n c e \leftarrow

Calculate the distance between

x_{i}

and

X_{t e s t}

using

m e t r i c

5: Add

(d i s t a n c e, y_{i})

to

d i s t a n c e s

6: end for
7: Sort

d i s t a n c e s

in ascending order based on distance
8:

k_n e a r e s t \leftarrow

First

n_n e i g h b o r s

elements from

d i s t a n c e s

9:

l a b e l s \leftarrow

Extract labels from

k_n e a r e s t

10: if

w e i g h t s

is ‘distance’, then
11: Compute weights based on the distance for

l a b e l s

12: end if
13:

\hat{y} \leftarrow

Perform a weighted majority vote of

l a b e l s

using

w e i g h t s

14: return

\hat{y}

15: end procedure

3.5. Performance Metrics

The effectiveness and accuracy of the developed ML models in this research were evaluated using various performance metrics. These metrics, including the accuracy, recall, precision, and F1-score, provided valuable insights into different aspects of the classifiers’ performance. The evaluation relied on a confusion matrix, which is presented in Table 2. The confusion matrix allowed for a comprehensive examination of the classification results. True positives (TP) represented instances correctly predicted as the positive class, while true negatives (TN) represented instances correctly predicted as the negative class. False positives (FP) were instances incorrectly predicted as the positive class, and false negatives (FN) were instances incorrectly predicted as the negative class. This evaluation approach facilitated a thorough assessment of the accuracy and effectiveness of the model in the early detection of CKD.

Accuracy = \frac{TN + TP}{TN + FP + TP + FN} \times 100 %

(9)

Precision = \frac{TP}{TP + FP} \times 100 %

(10)

Recall = \frac{TP}{TP + FN} \times 100 %

(11)

F1-score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(12)

4. Results

An experimental study was conducted on the UCI CKD dataset, where the categorical features were encoded. The missing values were addressed using alternative imputation techniques. A novel sequential approach was implemented for data scaling, involving robust scaling, z-standardization, and min–max scaling in that order. To perform feature selection, we utilized the Boruta algorithm. The dataset was divided into training and testing sets using an 80:20 ratio. For constructing the models, we employed ML techniques such as k-nearest neighbor and Gaussian NB. To optimize the model parameters, a grid-search CV was utilized. All preprocessing, visualization, and analysis tasks were carried out using Python programming.

In Figure 4, the confusion matrices are presented, depicting the performance of the models. Table 3 provides the optimal hyperparameters obtained through the grid-search CV, along with the performance metrics including the accuracy, precision, recall, and F1-score. It shows that the k-nearest neighbors model achieved a 100% accuracy, precision, recall, and F1-score, indicating excellent performance.

Figure 5 displays the evaluation of the model through the area under the ROC curve and the precision–recall curve. The k-nearest neighbor model achieved the highest performance, indicating its superiority as the best model for the early detection of CKD.

To assess the generalization capability of the trained models, a rigorous 15-fold CV technique was employed. The results, as depicted in Figure 6, illustrate the accuracy of both models on each fold, providing valuable insights into their performance. The k-nearest neighbor algorithm demonstrated remarkable consistency across diverse folds, achieving an exceptional accuracy of 99.37%. This high score highlights the model’s impressive performance and robustness, indicating its ability to generalize well to unseen data. In contrast, the Gaussian NB achieved a slightly lower CV accuracy of 97.05%.

5. Discussion

CKD is a critical condition, and accurate diagnosis plays a pivotal role in improving patient outcomes. To address this, our paper focuses on proposing a comprehensive ML model for CKD prediction. However, in implementing ML techniques for medical diagnosis, we must be mindful of the potential risks and ethical considerations. Complex ML models may lack interpretability, raising concerns about trust and accountability in the healthcare domain. Additionally, biases in training data can lead to discriminatory outcomes, exacerbating healthcare disparities, and handling sensitive patient information raises privacy and data security issues. Despite these challenges, ML models offer benefits like accurate and personalized diagnoses, identifying rare conditions, and adapting to changing scenarios. Therefore, striking a balance between the risks and benefits is essential to harness ML’s potential for improved medical diagnosis while upholding ethical standards and patient wellbeing.

As we embark on improving CKD prediction, it is crucial to address the existing challenges in the field of ML-based medical diagnosis. Commonly used sampling techniques in existing studies to balance datasets and improve accuracy may introduce artificial data, limiting real-world applicability. Handling missing data is another significant challenge in medical datasets, with mean or mode imputation methods potentially introducing biases. Some studies focus on using a reduced set of features to improve accuracy, but this approach may not generalize well in real-world scenarios. Additionally, data scaling, often overlooked, is a critical preprocessing step that can significantly impact model performance. In our approach, we systematically address these challenges to enhance CKD prediction accuracy. By utilizing iterative imputation for missing data, introducing a novel sequential data scaling method, and employing the Boruta algorithm for feature selection, we aim to create a robust and reliable model for CKD prediction. Through grid-search CV, we optimize the k-nearest neighbor and Gaussian NB algorithms, further refining the model’s performance.

To evaluate the efficacy of our proposed model, we conducted extensive validation on the UCI CKD dataset. Remarkably, our approach achieved an outstanding accuracy, precision, recall, and F1-score, all reaching 100%. Additionally, we compared our model with existing ML models that were developed on the same dataset. The comparison presented in Table 4 demonstrates the superiority of our proposed model, showcasing its higher accuracy compared to previous studies.

Table 5 focuses on comparing our k-nearest neighborsand naïve Bayes models with other studies that also employed k-nearest neighbor and naïve Bayes algorithms. We evaluated the models’ performance using the same dataset to validate the effectiveness of our preprocessing steps. The results show that our models consistently outperformed the previous studies, highlighting the impact of our preprocessing techniques in enhancing prediction accuracy.

6. Conclusions

This study successfully developed a robust ML model for the early detection of CKD. The model’s exceptional performance, achieving 100% accuracy, percision, recall, and F1-score on the UCI CKD dataset, validates its reliability and potential for clinical application. By incorporating various preprocessing steps and the Boruta algorithm for feature selection, our proposed model demonstrates its robustness in accurately identifying CKD cases. The results obtained through multiple performance metrics further strengthen the confidence in its accuracy. The implementation of this model as a reliable and accurate tool for early CKD detection holds great promise for improving clinical decision making and ultimately enhancing patient outcomes. The potential impact of this research in advancing early diagnosis and management of CKD highlights its significance in addressing a critical global health challenge.

Limitations

The main limitation of this study was the reliance on a single dataset, the UCI CKD dataset, which contains a substantial amount of missing values. While we employed iterative imputation to estimate the missing data, it is crucial to acknowledge the uncertainty introduced by imputation methods, which may influence the model’s predictive capability. Additionally, the generalizability of our findings to other populations and real-world scenarios needs further investigation. The model’s adaptability to handle diverse data sources and missing data patterns should be carefully examined in future research. Furthermore, the retrospective nature of the performance evaluation raises questions about the model’s ability to predict CKD in real-time or prospective settings. Addressing these limitations will strengthen the model’s reliability and applicability for early CKD detection, making it a more effective tool for clinical use.

Author Contributions

D.A., conceptualization, data curation, methodology, software, validation, visualization, writing—original draft; M.S.A., conceptualization, methodology, validation, project administration, visualization, writing—original draft; A.M., funding acquisition, supervision, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to acknowledge the support of Prince Sultan University for paying the Article Processing Charges (APC) of this publication.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available at https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease (accessed on 10 June 2023).

Acknowledgments

We would like to extend our gratitude to the Prince Sultan University, Riyadh, Saudi Arabia, for facilitating the publication of this paper through the Theoretical and Applied Sciences Lab.

Conflicts of Interest

The authors declare that there are no conflict of interest regarding the publication of this paper.

References

New Global Kidney Health Report Sheds Light on Current Capacity around the World to Deliver Kidney Care. Available online: https://www.theisn.org/blog/2023/03/30/new-global-kidney-health-report-sheds-light-on-current-capacity-around-the-world-to-deliver-kidney-care/ (accessed on 20 June 2023).
Wadei, H.M.; Textor, S.C. The role of the kidney in regulating arterial blood pressure. Nat. Rev. Nephrol. 2012, 8, 602–609. [Google Scholar] [CrossRef] [PubMed]
Mukoyama, M.; Nakao, K. Hormones of the kidney. Basic Clin. Princ. 2005, 353–365. [Google Scholar] [CrossRef]
Webster, A.C.; Nagler, E.V.; Morton, R.L.; Masson, P. Chronic kidney disease. Lancet 2017, 389, 1238–1252. [Google Scholar] [CrossRef] [PubMed]
Kalantar-Zadeh, K.; Jafar, T.H.; Nitsch, D.; Neuen, B.L.; Perkovic, V. Chronic kidney disease. Lancet 2021, 398, 786–802. [Google Scholar] [CrossRef] [PubMed]
Hall, M.E.; do Carmo, J.M.; da Silva, A.A.; Juncos, L.A.; Wang, Z.; Hall, J.E. Obesity, hypertension, and chronic kidney disease. Int. J. Nephrol. Renov. Dis. 2014, 75–88. [Google Scholar] [CrossRef]
Ghaderian, S.B.; Hayati, F.; Shayanpour, S.; Mousavi, S.S.B. Diabetes and end-stage renal disease; a review article on new concepts. J. Ren. Inj. Prev. 2015, 4, 28. [Google Scholar]
Parmar, M.S. Chronic renal disease. BMJ 2002, 325, 85–90. [Google Scholar] [CrossRef]
Wagner, L.A.; Tata, A.L.; Fink, J.C. Patient safety issues in CKD: Core curriculum 2015. Am. J. Kidney Dis. 2015, 66, 159–169. [Google Scholar] [CrossRef]
Luyckx, V.A.; Al-Aly, Z.; Bello, A.K.; Bellorin-Font, E.; Carlini, R.G.; Fabian, J.; Garcia-Garcia, G.; Iyengar, A.; Sekkarie, M.; Van Biesen, W.; et al. Sustainable development goals relevant to kidney health: An update on progress. Nat. Rev. Nephrol. 2021, 17, 15–32. [Google Scholar] [CrossRef]
Hoste, E.A.; Kellum, J.A.; Selby, N.M.; Zarbock, A.; Palevsky, P.M.; Bagshaw, S.M.; Goldstein, S.L.; Cerdá, J.; Chawla, L.S. Global epidemiology and outcomes of acute kidney injury. Nat. Rev. Nephrol. 2018, 14, 607–625. [Google Scholar] [CrossRef]
Lin, M.Y.; Chiu, Y.W.; Lin, Y.H.; Kang, Y.; Wu, P.H.; Chen, J.H.; Luh, H.; Hwang, S.J.; iH3 Research Group. Kidney Health and Care: Current Status, Challenges, and Developments. J. Pers. Med. 2023, 13, 702. [Google Scholar] [CrossRef] [PubMed]
Chen, T.K.; Knicely, D.H.; Grams, M.E. Chronic kidney disease diagnosis and management: A review. JAMA 2019, 322, 1294–1304. [Google Scholar] [CrossRef] [PubMed]
Ferguson, M.A.; Waikar, S.S. Established and emerging markers of kidney function. Clin. Chem. 2012, 58, 680–689. [Google Scholar] [CrossRef] [PubMed]
Lopez-Giacoman, S.; Madero, M. Biomarkers in chronic kidney disease, from kidney function to kidney damage. World J. Nephrol. 2015, 4, 57. [Google Scholar] [CrossRef]
Shehab, M.; Abualigah, L.; Shambour, Q.; Abu-Hashem, M.A.; Shambour, M.K.Y.; Alsalibi, A.I.; Gandomi, A.H. Machine learning in medical applications: A review of state-of-the-art methods. Comput. Biol. Med. 2022, 145, 105458. [Google Scholar] [CrossRef]
Sanmarchi, F.; Fanconi, C.; Golinelli, D.; Gori, D.; Hernandez-Boussard, T.; Capodici, A. Predict, diagnose, and treat chronic kidney disease with machine learning: A systematic literature review. J. Nephrol. 2023, 36, 1101–1117. [Google Scholar]
Ibrahim, I.; Abdulazeez, A. The role of machine learning algorithms for diagnosing diseases. J. Appl. Sci. Technol. Trends 2021, 2, 10–19. [Google Scholar] [CrossRef]
Ghazal, T.M.; Hasan, M.K.; Alshurideh, M.T.; Alzoubi, H.M.; Ahmad, M.; Akbar, S.S.; Al Kurdi, B.; Akour, I.A. IoT for smart cities: Machine learning approaches in smart healthcare—A review. Future Internet 2021, 13, 218. [Google Scholar] [CrossRef]
Asif, D.; Bibi, M.; Arif, M.S.; Mukheimer, A. Enhancing Heart Disease Prediction through Ensemble Learning Techniques with Hyperparameter Optimization. Algorithms 2023, 16, 308. [Google Scholar] [CrossRef]
Siddique, S.; Chow, J.C. Machine learning in healthcare communication. Encyclopedia 2021, 1, 220–239. [Google Scholar] [CrossRef]
Krisanapan, P.; Tangpanithandee, S.; Thongprayoon, C.; Pattharanitima, P.; Cheungpasitporn, W. Revolutionizing Chronic Kidney Disease Management with Machine Learning and Artificial Intelligence. J. Clin. Med. 2023, 12, 3018. [Google Scholar] [CrossRef] [PubMed]
Swain, D.; Mehta, U.; Bhatt, A.; Patel, H.; Patel, K.; Mehta, D.; Acharya, B.; Gerogiannis, V.C.; Kanavos, A.; Manika, S. A Robust Chronic Kidney Disease Classifier Using Machine Learning. Electronics 2023, 12, 212. [Google Scholar] [CrossRef]
Ullah, Z.; Jamjoom, M. Early detection and diagnosis of chronic kidney disease based on selected predominant features. J. Healthc. Eng. 2023, 2023, 3553216. [Google Scholar] [CrossRef] [PubMed]
Farjana, A.; Liza, F.T.; Pandit, P.P.; Das, M.C.; Hasan, M.; Tabassum, F.; Hossen, M.H. Predicting Chronic Kidney Disease Using Machine Learning Algorithms. In Proceedings of the 2023 IEEE 13th Annual Computing and Communication Workshop and Conference, Las Vegas, NV, USA, 8–11 March 2023; pp. 1267–1271. [Google Scholar]
Islam, M.A.; Majumder, M.Z.H.; Hussein, M.A. Chronic kidney disease prediction based on machine learning algorithms. J. Pathol. Inform. 2023, 14, 100189. [Google Scholar] [CrossRef]
Hassan, M.M.; Hassan, M.M.; Mollick, S.; Khan, M.A.R.; Yasmin, F.; Bairagi, A.K.; Raihan, M.; Arif, S.A.; Rahman, A. A Comparative Study, Prediction and Development of Chronic Kidney Disease Using Machine Learning on Patients Clinical Records. Hum.-Centric Intell. Syst. 2023, 3, 92–104. [Google Scholar] [CrossRef]
Kaur, C.; Kumar, M.S.; Anjum, A.; Binda, M.B.; Mallu, M.R.; Al Ansari, M.S. Chronic Kidney Disease Prediction Using Machine Learning. J. Adv. Inf. Technol. 2023, 14, 384–391. [Google Scholar] [CrossRef]
Rubini, L.; Soundarapandian, P.; Eswaran, P. Chronic Kidney Disease. UCI Machine Learning Repository. 2015. Available online: https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease (accessed on 10 June 2023).
García, S.; Luengo, J.; Herrera, F. Data preprocessing in data mining. CA Cancer J. Clin. 2015, 72, 59–139. [Google Scholar]
Dong, Y.; Peng, C.Y.J. Principled missing data methods for researchers. SpringerPlus 2013, 2, 222. [Google Scholar] [CrossRef] [PubMed]
Hoque, G. A Better Way to Handle Missing Values in your Dataset: Using IterativeImputer (PART I). Towards Data Sci. 2021. Available online: https://towardsdatascience.com/a-better-way-to-handle-missing-values-in-your-dataset-using-iterativeimputer-9e6e84857d98 (accessed on 20 June 2023).
Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
Python Implementations of the Boruta All Relevant Feature Selection Method. Available online: https://github.com/scikit-learn-contrib/boruta_py (accessed on 20 June 2023).
Joseph, V.R. Optimal ratio for data splitting. Stat. Anal. Data Mining ASA Data Sci. J. 2022, 15, 531–538. [Google Scholar] [CrossRef]
Agrawal, T. Hyperparameter optimization using scikit-learn. In Hyperparameter Optimization in Machine Learning: Make Your Machine Learning and Deep Learning Models More Efficient; Apress: Berkeley, CA, USA, 2021; pp. 31–51. [Google Scholar]
Liashchynskyi, P.; Liashchynskyi, P. Hyperparameter optimization using scikit-learn. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar]
Alfaiz, N.S.; Fati, S.M. Enhanced credit card fraud detection model using machine learning. Electronics 2022, 11, 662. [Google Scholar] [CrossRef]
Kataria, A.; Singh, M.D. A review of data classification using k-nearest neighbour algorithm. Int. J. Emerg. Technol. Adv. Eng. 2013, 3, 354–360. [Google Scholar]
Cunningham, P.; Delany, S.J. k-Nearest neighbour classifiers-A Tutorial. ACM Comput. Surv. (CSUR) 2021, 54, 1–25. [Google Scholar] [CrossRef]
Nishat, M.M.; Faisal, F.; Dip, R.R.; Nasrullah, S.M.; Ahsan, R.; Shikder, F.; Asif, M.A.A.R.; Hoque, M.A. A comprehensive analysis on detecting chronic kidney disease by employing machine learning algorithms. Eai Endorsed Trans. Pervasive Health Technol. 2021, 7, e1. [Google Scholar] [CrossRef]
Khalid, H.; Khan, A.; Zahid Khan, M.; Mehmood, G.; Shuaib Qureshi, M. Machine Learning Hybrid Model for the Prediction of Chronic Kidney Disease. Comput. Intell. Neurosci. 2023, 2023, 9266889. [Google Scholar] [CrossRef] [PubMed]
Chittora, P.; Chaurasia, S.; Chakrabarti, P.; Kumawat, G.; Chakrabarti, T.; Leonowicz, Z.; Jasiński, M.; Jasiński, Ł.; Gono, R.; Jasińska, E.; et al. Prediction of chronic kidney disease-a machine learning perspective. IEEE Access 2021, 9, 17312–17334. [Google Scholar] [CrossRef]
Ekanayake, I.U.; Herath, D. Chronic kidney disease prediction using machine learning methods. In Proceedings of the 2020 Moratuwa Engineering Research Conference (MERCon), Moratuwa, Sri Lanka, 28–30 July 2020; Volume 9, pp. 260–265. [Google Scholar]
Almustafa, K.M. Prediction of chronic kidney disease using different classification algorithms. Inform. Med. Unlocked 2021, 24, 100631. [Google Scholar] [CrossRef]
Poonia, R.C.; Gupta, M.K.; Abunadi, I.; Albraikan, A.A.; Al-Wesabi, F.N.; Hamza, M.A. Intelligent diagnostic prediction and classification models for detection of kidney disease. Healthcare 2022, 10, 371. [Google Scholar] [CrossRef]

Figure 1. Proposed workflow.

Figure 2. The missing values of the dataset.

Figure 3. The workflow of grid-search CV.

Figure 4. The confusion matrix of the models. (a) Gaussian NB; (b) K-nearest neighbor.

Figure 5. The area under the ROC curve and the precision–recall curve of the models. (a) Area under the ROC curve; (b) Precision–recall curve.

Figure 6. The accuracy of each fold.

Table 1. The feature information of the UCI CKD dataset.

Features	Representative	Description
Age	age	Patient’s age in years
Blood Pressure	bp	Patient’s blood pressure in mmHg
Specific gravity	sg	The ratio between urine density and water density
Albumin	al	Protein percentage in blood plasma (0, 1, 2, 3, 4, 5)
Sugar	su	Sugar level in blood plasma (0, 1, 2, 3, 4, 5)
Red blood cells	rbc	Percentage of red blood cells in blood plasma (normal, abnormal)
Pus cell	pc	White blood cells in urine (normal, abnormal)
Pus cell clumps	pcc	Sign of bacterial infection (present, not present)
Bacteria	ba	Sign of bacterial existence in urine (present, not present)
Blood glucose random	bgr	A random test of glucose in the blood in mgs/dL
Blood urea	bu	Percentage of urea nitrogen in blood plasma in mgs/dL
Serum creatinine	sc	Creatinine level in patient muscles
Sodium	sod	Sodium mineral level in blood in mEq/L
Potassium	pot	Potassium mineral level in blood in mEq/L
Hemoglobin	hemo	Red protein responsible for oxygen transport in the blood in gms
Packed cell volume	pcv	Volume of blood cells in a blood sample
White blood cell count	wc	Count of white blood cells
Red blood cell count	rc	Count of red blood cells
Hypertension	htn	Continuously high blood pressure condition (yes, no)
Diabetes mellitus	dm	Impairment in insulin production or response (yes, no)
Coronary artery disease	cad	Heart condition affecting blood supply (yes, no)
Appetite	appet	The desire to eat food (good, poor)
Pedal edema	pe	Swelling of the patient’s body (yes, no)
Anemia	ane	Insufficient healthy red blood cells (yes, no)
Diagnosis	class	Presence of diagnosed CKD (ckd, notckd)

Table 2. Confusion matrix.

	Predict Positive	Predict Negative
Actual Positive	$T P$	$F N$
Actual Negative	$F P$	$T N$

Table 3. The optimal hyperparameter and the performance of the model with selected features.

Model	Optimal Hyperparameters	Accuracy	Precision	Recall	F1-Score
Gaussian NB	$p r i o r s$ = None, $v a r_s m o o t h i n g$ = $1 \times 10^{- 9}$	97.5%	100%	96.15%	98.03%
K-Nearest Neighbors	$a l g o r i t h m$ = auto, $l e a f_s i z e$ = 20, $m e t r i c$ = euclidean, $n_n e i g h b o r s$ = 3, $w e i g h t s$ = uniform	100%	100%	100%	100%

Table 4. Comparison of the proposed model with other studies on the UCI CKD dataset.

Authors	Model	Accuracy
S. Debabrata et al. [23]	Support Vector Machine	99.33%
Z. Ullah et al. [24]	K-nearest Neighbors	99.5%
A. Farjana et al. [25]	Light GBM	99%
M. A. Islam et al. [26]	Gradient Boosting	99%
M. M. Hassan [27]	Neural Network	100%
C. Kaur et al. [28]	Random Forest	96%
M. M. Nishat et al. [41]	Support Vector Machine	99.36%
Our proposed model	K-nearest Neighbors	100%

Table 5. Comparison of the k-nearest neighbor and naïve Bayes models with other studies on the UCI CKD dataset.

Authors	K-Nearest Neighbors	Naïve Bayes
Z. Ullah et al. [24]	99.5%	-
A. Farjana et al. [25]	67.05%	94.76%
M. A. Islam et al. [26]	65%	94%
C. Kaur et al. [28]	74%	-
M. M. Nishat et al. [41]	79.25%	96.5%
H. Khalid et al. [42]	-	93%
P. Chittora et al. [43]	76.10%	-
I. U. Ekanayake et al. [44]	98.09%	94.34%
K. M. Almustafa [45]	95.75%	95%
R. C. Poonia et al. [46]	66.25%	95%
This study	100%	97.5%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Arif, M.S.; Mukheimer, A.; Asif, D. Enhancing the Early Detection of Chronic Kidney Disease: A Robust Machine Learning Model. Big Data Cogn. Comput. 2023, 7, 144. https://doi.org/10.3390/bdcc7030144

AMA Style

Arif MS, Mukheimer A, Asif D. Enhancing the Early Detection of Chronic Kidney Disease: A Robust Machine Learning Model. Big Data and Cognitive Computing. 2023; 7(3):144. https://doi.org/10.3390/bdcc7030144

Chicago/Turabian Style

Arif, Muhammad Shoaib, Aiman Mukheimer, and Daniyal Asif. 2023. "Enhancing the Early Detection of Chronic Kidney Disease: A Robust Machine Learning Model" Big Data and Cognitive Computing 7, no. 3: 144. https://doi.org/10.3390/bdcc7030144

APA Style

Arif, M. S., Mukheimer, A., & Asif, D. (2023). Enhancing the Early Detection of Chronic Kidney Disease: A Robust Machine Learning Model. Big Data and Cognitive Computing, 7(3), 144. https://doi.org/10.3390/bdcc7030144

Article Menu

Enhancing the Early Detection of Chronic Kidney Disease: A Robust Machine Learning Model

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Data Collection

3.2. Preprocessing

3.2.1. Data Encoding

3.2.2. Data Imputation

3.2.3. Data Scaling

3.2.4. Feature Selection

3.3. Data Splitting

3.4. Model Traning

3.4.1. Hyperparameter Optimization

3.4.2. Naïve Bayes

3.4.3. K-Nearest Neighbor

3.5. Performance Metrics

4. Results

5. Discussion

6. Conclusions

Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI