Exploring Machine Learning for Predicting Cerebral Stroke: A Study in Discovery
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
General comment:
This manuscript describes a study for predicting cerebral stroke using state-of-the-art machine learning algorithms. The work is relevant in signal processing and data science for biomedical applications. Furthermore, the proposal is well-motivated and represents an advance in the knowledge for researchers and professionals working with advanced algorithms for biomedicine. The experimental framework is clear and the results are well supported. The manuscript is interesting and well-written. I have some points that should be addressed before the manuscript can be accepted.
Comment 1:
In section 3.4, the title has a typo in the word strokes (it says stokes).
Comment 2:
The authors claim to use ML techniques. However, there is no mention of unsupervised learning algorithms. For instance, what about data clustering and dimensionality reduction?
Comment 3:
It would be better to present the performance results graphically, instead of giving numbers as in Table 2.
Comment 4:
From the perspectives of the work, it should be important to add some statements about deep learning algorithms, and generative AI models for stroke prediction.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis research investigates the application of robust Machine Learning (ML) algorithms, including Logistic Regression (LR), Random Forest (RF), and K-Nearest Neighbor (KNN), to the prediction of cerebral strokes. The data is generated using the Synthetic Minority Over-sampling Technique (SMOTE), Adaptive Synthetic Sampling (ADASYN), and Random Over-Sampling Technique (ROSE) to address class imbalances to improve the accuracy of minority classes. To address the challenge of forecasting strokes from partial and imbalanced physiological data, this study introduces a novel hybrid ML approach.The research work reported is interesting in the community. Some suggestions are listed below to improve the manuscript's quality (major revision):
1. The manuscript's motivations should be further highlighted in the manuscript, e.g., what problems did the previous works exist? How to solve these problems?
2. The authors must clearly explain the difference(s) between the proposed method and similar works in the introduction.
3. The authors should further highlight the manuscript's innovations and contributions.
4. In the section of 1. Introduction,the main contributions of this paper should be further summarized and clearly demonstrated.
5. In this paper, all figures are missed, please add them into the revised paper.
6. The literature review is poor in this paper. I hope that the authors can add some new references in order to improve the reviews. For example, https://doi.org/10.1109/JIOT.2023.3296460; https://ieeexplore.ieee.org/document/8846596; http://dx.doi.org/10.1109/TCSS.2022.3152091 and so on.
7. At Line 135 and 136, "In the pursuit of exceptional precision, the dataset is thoughtfully partitioned into two segments: the training data, comprising 80%, and the testing data, making up the remaining 20%." Why the training data are 80% and the testing data are 20%? can 70% and 30% or 60% and 40%?
8. In the expression (1), what are the physical meanings of parameters, variables, and constants? Please provide them.
Minor editing of English language required
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors compared the performance of three machine learning models with three oversampling techniques for Harvard Dataverse Repository stroke dataset.
Below are my comments.
Abstract.
The data is generated using the Synthetic Minority Over-sampling Technique (SMOTE), 6 Adaptive Synthetic Sampling (ADASYN), and Random Over-Sampling Technique (ROSE) to address 7 class imbalances to improve the accuracy of minority classes. – What data were you sampling? No introduction of the dataset
To address the challenge of forecasting 8 strokes from partial and imbalanced physiological data, this study introduces a novel hybrid ML 9 approach. – unfold on the approach
Introduction
The scientific community places a strong emphasis on creating predictive models for stroke with the aim 54 of prevention, considering its significant societal impact. -using what type of data?
To facilitate the application of ML models in clinical practice, we selected data that physicians can readily monitor. – such as?
Related work
2nd paragraph is unclear – authors site 12, continue discussion of the paper, and site 13 (entirely different study)
Table 1 provides insights into previous works and their respective methodologies and accuracies, underscoring the ongoing advancements in this critical domain. – No Table 1 with this information
Furthermore, it’s noted that addressing the 3% missing information related to BMI 116 is essential to enhance execution assessment. – this sentence is disconnected from the remaining of the paragraph
Overall for related work, I suggest to unfold on ML method performance and source/type and size of data (e.g. EHR, -omics, etc.) used to build the models for every study cited (some has it, but many are missing).
Materials and Methods
The dataset encompasses 43,400 samples, charac terized as a standard class unbalanced type – What is standard class unbalanced type?
What is the case/control distribution of the dataset
No referenced figures are available
The relationship between body mass index and intermediate glucose level is so mini- 156 mal that it could be considered negligible. – why is this relevant for data analysis section?
Notably, only one conceivable outcome exists 157 for the correlation coefficient, demonstrating a negative but statistically insignificant assocation between BMI and stroke. This sentences should be in results
The entire section 3.1 is confusing and needs to be rewritten. It looks like authors tried to described a missing values distribution in the dataset, A simple table/barplot would be more efficient.
In this study, missing values are effectively imputed by leveraging the mean of other available values. – there are better methods (knn, iterative imputer)
To tackle data imbalance, three oversampling techniques are employed to refine the final output. What is the imbalance proportion?
In addition to oversampling techniques suggest to use class imbalanced sensitive metric on the original dataset (balanced accuracy, precision-recall curve). Depending on class imbalance degree this will provide more realistic estimate of a model.
Dataset have a little over 20 features, feature selection is unnesessary
3.4. Classification of Stokes using Machine Learning Models – I don’t think the manuscript needs this section. All three methods are well known and considered as a basic knowledge in ML field.
Same for the Evaluation Method section: correlation coefficient and performance metrics are common knowledge from the statistics text books.
I don’t think that the table 3 provides a fair comparison, it was done for different datasets (e.g. you can’t compare image and EHR models). Also, authors used an oversampling technique that could lead to inflated performance values.
All figures are missing.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper can be accepted now.
Comments on the Quality of English LanguageThis paper can be accepted now.
Author Response
There is no review. So, I have no attachment. Thanks.