Exploring Early Prediction of Chronic Kidney Disease Using Machine Learning Algorithms for Small and Imbalanced Datasets
Round 1
Reviewer 1 Report
Chronic kidney disease (CKD) is often diagnosed at a late stage. This study aims to help diagnose CKD early. In this study, the researcher needs to overcome the problem of imbalanced dataset and its size limit. Since the objective is to benefit developing countries, the researcher collects data for Brazil. The results showed that decision tree (DT), along with manual augmentation and the over-sampling technique of SMOTE, is capable of a very high degree of accuracy (98.99%). It can help design a diagnostic prediction system using imbalanced data and data with limited size to diagnose CKD at an early stage.
- First of all, caption next to equations are missing. Formula H(A) is missing an explanation for A and the variable n of Gini(n) does not appear to the right of the equal sign. It is doubted whether is the correct formula. FP’ in Precision should be corrected as “FP” and the FN’ in Recall should be corrected as “FN”. MCC should be corrected as .
- Four contributions of this study are summarized in the last paragraph of the introduction, the first of which is to propose a method for data oversampling. However, it is unclear whether the proposed method is manual augmentation as described in Preliminaries, or whether it refers to combining manual augmentation with automated augmentation.
Author Response
Dear reviewer,
We would like to thank you for the constructive comments. We have reviewed the overall article to improve readability and grammar. We present below a point-by-point response to your comments.
Best regards.
—-------------------------------------------------------------------------
- First of all, caption next to equations are missing.
Response to 1. We have fixed the missing caption problem.
- Formula H(A) is missing an explanation for A and the variable n of Gini(n) does not appear to the right of the equal sign. It is doubted whether is the correct formula.
Response to 2. We revised all the equations to correct them.
In the second paragraph of Section 2.7, we include an explanation for the variable A referring to the formula H(A).
"The rules are based on information gain, which uses the concept of entropy H to measure the randomness of a discrete random variable A (with domain a1, a2,..., an) [36]. Entropy is used to calculate the difficulty of predicting the target attribute, where the entropy of A can be calculated by:"
- P’ in Precision should be corrected as “FP” and the FN’ in Recall should be corrected as “FN”. MCC should be corrected as .
Response to 3. We revised all the equations to correct them.
- Four contributions of this study are summarized in the last paragraph of the introduction, the first of which is to propose a method for data oversampling. However, it is unclear whether the proposed method is manual augmentation as described in Preliminaries, or whether it refers to combining manual augmentation with automated augmentation.
Response to 4. In the last paragraph of the introduction, we have stated that the method relates to the combination of manual augmentation with automated augmentation.
Author Response File: Author Response.pdf
Reviewer 2 Report
This paper is prepared as a supplemental report to the experimental results of the authors' IEEE ACCESS paper.
The data used int this study is valuable and the importance of this research is understandable. Several machine learning and oversampling methods have been tried, and the experimental results are reported in detail. These are considered to be the strength parts of this paper.
On the other hand, the methods employed in this paper are generic and lack technical novelty. In addition, the paper is poorly organized and it is unclear what the point is. Therefore, an overall rewrite seems necessary. Detailed comments are as follows.
- The early prediction of CKD
The phrase "early prediction of CKD" appears several times in the paper, but there is no explanation of the relation between what this study does and early prediction of CKD. The fourth paragraph on p. 2 states that the study will address three issues, but there is no explanation of how this early prediction of CKD relates to the three issues. Specifically, what the early prediction of CKD is and what will be needed to address should be explained. - Contributions of this study
Regarding the contributions of this paper described in the third paragraph on p. 3, since the comparison with other studies which deal with data oversampling for limited size of data and imbalanced data was not sufficiently made in this paper, it is questionable to include data oversampling, comparison of oversampling methods, and comparison of ML methods as the contributions of this paper. The paper states that one of the characteristics of this study is the use of data on a limited scale. When dealing with a limited number of the data, building a highly accurate learning model is, as described in this paper, one of the major problems. Meanwhile, the paper did not discuss the issue of statistical significance which is strongly related to the limited size of the data. The paper should discuss the issue. - Related Work
The related work on p. 8, Section 3, does not mention any studies similar to the over sampling techniques and data conditions used in this paper. Since many studies deal with the problems of limited data, imbalanced data, and oversampling, the comparison with these studies should be made even though the type of the data used in this study is unique and has not been analyzed by other studies. In the last paragraph of this related work section, the authors should discuss the relationship between the current work and the related studies. - Notational Issues
p.7 in Section 2.8
In A(y, y^), yl^ should be yi^. Equation 1(.) is explained as I(.)
The comma (,) and period (.) at the end of the expression are used indistinguishably.
' is used in Recall and MCC expressions without definition.
There is no definition for FMI, which should be given. - Experimental Results
Please clarify what the discrimination performance was for the four classes: low, moderate, high and very high risk. When the limited number of imbalanced data are dealt, the discriminative performance of each class is very important; if macro F1-score was not calculated, please show the macro F1-score. Also, please show if overfitting has not occurred in a small number of data class. In addition, in Tables 3 to 5, the best performed results should be highlighted in a bold font so as to distinguish them from the others. - Figures 1, 2 and 3
Figures 1, 2, and 3 are all unclear and the texts in the figures are difficult to read. They should be replaced with clearer ones. The results of RPC should be provided as well. - Regarding Clinical Practice Context in Section 5
The contents of this section and Figures 3 and 4 are not mentioned in the Introduction and appear out of the blue. The content of this section should be explained in the Introduction. - Discussion in Section 6
It is not clear what results are discussed in this section. For example, the mean accuracy scores of 98.99% or 97.99% are mentioned, but they are not found in Tables 3 to 5. It is unclear how they were obtained. The second and third paragraphs should also clarify which results are being referred to. - Explanation table of abbreviations
There are many abbreviations used in this paper. Some of them are not explained until halfway. Please make an explanation table of the abbreviations in the paper to facilitate reference to their meanings.
Typo
Dts of AdaBoosted Dts should be DTs.
Author Response
Dear reviewer,
We would like to thank you for the constructive comments. We have reviewed the overall article to improve readability and grammar. We present below a point-by-point response to your comments.
Best regards.
—-------------------------------------------------------------------------
General comment:
On the other hand, the methods employed in this paper are generic and lack technical novelty. In addition, the paper is poorly organized and it is unclear what the point is. Therefore, an overall rewrite seems necessary.
Response to general comment: We review the article to improve organization and clarity. In the last paragraph of the introduction, we have included the following sentence:
"Therefore, the main technical novelty of this article relates to the presentation and evaluation of our oversampling approach that combines manual augmentation and automated augmentation."
However, we also improved the state-of-the-art by providing the comparison of data oversampling techniques; the comparison of validation methods; and the comparison of ML models to assist the CKD early prediction in developing countries using imbalanced and limited size datasets (last paragraph of the introduction). For example, in our previous study, we did not apply automated oversampling techniques, dynamic classifier selection, dynamic ensemble selection, and nested cross-validation.
Detailed comments are as follows.
- The early prediction of CKD
The phrase "early prediction of CKD" appears several times in the paper, but there is no explanation of the relation between what this study does and early prediction of CKD. The fourth paragraph on p. 2 states that the study will address three issues, but there is no explanation of how this early prediction of CKD relates to the three issues. Specifically, what the early prediction of CKD is and what will be needed to address should be explained.
Response to 1. We improved the description related to the early prediction. For example, we have included such an explanation as follows.
"The fourth problem is the early prediction of CKD using risk levels (low risk, moderate risk, high risk, and very high risk) and a reduced number of biomarkers. CKD datasets with risk level evaluation are very scarce and of limited size. The majority of available datasets are composed of binary classes. The analyses based on risk levels enable patients to have more detailed explanations about the evaluation results."
Besides, in the fourth paragraph of the introduction, we also included the following sentence:
"CKD early prediction is relevant to improve CKD screening and reduce public health costs."
- Contributions of this study
Regarding the contributions of this paper described in the third paragraph on p. 3, since the comparison with other studies which deal with data oversampling for limited size of data and imbalanced data was not sufficiently made in this paper, it is questionable to include data oversampling, comparison of oversampling methods, and comparison of ML methods as the contributions of this paper. The paper states that one of the characteristics of this study is the use of data on a limited scale. When dealing with a limited number of the data, building a highly accurate learning model is, as described in this paper, one of the major problems. Meanwhile, the paper did not discuss the issue of statistical significance which is strongly related to the limited size of the data. The paper should discuss the issue.
Response to 2. We improved the related works section (pages 9, 10, and 11). The defined new subsections and included new references to improve the article. Moreover, we included a subsection of Section 4, related to the statistical significance analysis (page 11).
- Related Work
The related work on p. 8, Section 3, does not mention any studies similar to the over sampling techniques and data conditions used in this paper. Since many studies deal with the problems of limited data, imbalanced data, and oversampling, the comparison with these studies should be made even though the type of the data used in this study is unique and has not been analyzed by other studies. In the last paragraph of this related work section, the authors should discuss the relationship between the current work and the related studies.
Response to 3. We improved the related works section (pages 9, 10, and 11). We discuss the relationship between our study and related works.
- Notational Issues
p.7 in Section 2.8
In A(y, y^), yl^ should be yi^. Equation 1(.) is explained as I(.)
The comma (,) and period (.) at the end of the expression are used indistinguishably.
' is used in Recall and MCC expressions without definition.
There is no definition for FMI, which should be given.
Response to 4. In the last paragraph of page 8, we included the definition of the Fowlkes-Mallows index (FMI) and its equation.
"Besides, the Fowlkes-Mallows index (FMI) is used to measure the similarity between two clusters, the measure varies between 0 and 1, where a high value indicates a good similarity. FMI is defined as the geometric mean between precision and recall, given by the equation:"
F MI = √ (TP/ TP + FP) * (TP/TP + FN)
Moreover, we reviewed all the equations to improve quality.
- Experimental Results
Please clarify what the discrimination performance was for the four classes: low, moderate, high and very high risk. When the limited number of imbalanced data are dealt, the discriminative performance of each class is very important; if macro F1-score was not calculated, please show the macro F1-score. Also, please show if overfitting has not occurred in a small number of data class. In addition, in Tables 3 to 5, the best performed results should be highlighted in a bold font so as to distinguish them from the others.
Response to 5. We improved the discussion about the discriminative performance of classes. We also calculated the macro F1-score and included the results in the main document and in the supplementary materials (https://bit.ly/3iwcwpK).
Besides, based on reference [26] (we included in the article), we used the multiple stratified CV and nested CV methods to increase confidence in the evaluation of models. In the last paragraph of Section 2.6, we included the following sentence.
"Besides investigating whether such methods satisfactorily control overfitting (by comparison), in this article, the evaluation results are relevant to increase confidence on the ML model embedded in our developed DSS (Section 5). Therefore, they enabled us to evaluate the quality of our approach."
[26] Bias in error estimation when using cross-validation for model selection.
- Figures 1, 2 and 3
Figures 1, 2, and 3 are all unclear and the texts in the figures are difficult to read. They should be replaced with clearer ones. The results of RPC should be provided as well.
Response to 6. We improved the quality of the figures and explanations. The results of PRC are provided in our supplementary materials (https://bit.ly/3iwcwpK).
- Regarding Clinical Practice Context in Section 5
The contents of this section and Figures 3 and 4 are not mentioned in the Introduction and appear out of the blue. The content of this section should be explained in the Introduction.
Response to 7. In the introduction, we included the following paragraph:
"Besides, to deploy our approach, we developed a decision support system (DSS) to embed the ML model with the highest performance. In this article, the development of a DSS was relevant to discuss a clinical practice context, showing how our approach can be reused in a real-world scenario."
Besides, we reviewed the whole article to improve the contextualization.
- Discussion in Section 6
It is not clear what results are discussed in this section. For example, the mean accuracy scores of 98.99% or 97.99% are mentioned, but they are not found in Tables 3 to 5. It is unclear how they were obtained. The second and third paragraphs should also clarify which results are being referred to.
Response to 8. We improved the first, second, third, and fourth paragraphs to clarify what results they refer to. More specifically, as clarified in the main article, we have provided supplementary materials that included the description of additional results (https://bit.ly/3iwcwpK).
- Explanation table of abbreviations
There are many abbreviations used in this paper. Some of them are not explained until halfway. Please make an explanation table of the abbreviations in the paper to facilitate reference to their meanings.
Response to 9. We included the tables of abbreviations (Table 1).
- Typo
Dts of AdaBoosted Dts should be DTs.
Response to 10. We corrected the typo.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors addressed my comments accordingly. The manuscript is recommended for publication.
Author Response
We would like to thank the reviewer for the revisions and contributions.
Reviewer 2 Report
With this revision, I believe that the requirements of the previous comment have been largely met. On the other hand, the following points still need improvement.
- Supplementary Materials:
Since some important results are taken from the supplementary materials, that makes it very difficult for the reader to read; please transfer all of Tables S2 to S4, or at least the parts of them that refer to figures in the text, to the text. - Need for explanation.
On p.16 “This is critical for low-income populations using the sub-system.”
Please add an explanation as to why this is the case. - Error Correction
On p.11
where the lighter the greater ..
On p.14
Figures 3 and 3 4
Supplementary Materials
Figs. S5 and S6 please change to clear diagrams.
Author Response
We would like to thank you for the revisions and contributions. Below, we provide a point-by-point response.
Supplementary Materials:
Since some important results are taken from the supplementary materials, that makes it very difficult for the reader to read; please transfer all of Tables S2 to S4, or at least the parts of them that refer to figures in the text, to the text.
Response: To summarize our findings, we present the decision tree results (from Tables S2, S3, and S4 of our Supplementary Materials) in Table 7.
Need for explanation.
On p.16 “This is critical for low-income populations using the sub-system.”
Please add an explanation as to why this is the case.
Response: We improved the explanation as follows. "This is critical for low-income populations using the sub-system because a very large number of biomarkers increases costs, that usually cannot be afforded by such people. Indeed, a reduced number of biomarkers can include more users for this type of DSS that would be possibly excluded due to their limited financial resources."
Error Correction
On p.11 where the lighter the greater ..
Response: We improved the sentence as follows: "The values are represented by means of colors. Thus, the lighter the color, the greater the correlation between the variables."
On p.14 Figures 3 and 3 4
Response: We fixed it.
Supplementary Materials Figs. S5 and S6 please change to clear diagrams.
Response: We improved the quality of the figures to increase readability. Thus, we included one more figure to increase readability.