Medical Data Transformations in Healthcare Systems with the Use of Natural Language Processing Algorithms
Round 1
Reviewer 1 Report
In this manuscript, the authors aim to analyse data from different healthcare sources (being experimented with 2 diverse healthcare dataset) by utilizing various ML methods and NLP to combat data heterogeneity and automate medical data transformation.
Overall, the research topic is well addressed and well written, whilst the captured results are quite well documented. However, the manuscript lacks a plethora of scientific details across the proposed methods, while also lacking the relevant scientific research evidence upon the suggested and implemented concepts, which is a shortcoming. Rather than those major drawbacks, the authors should address the following issues for improving both the content and the quality of their work:
Abstract
* The authors state “deploy selected machine learning algorithms on the problem cardiovascular disease diagnosis”. They have to explain the reason why they decided to apply ML upon such disease. In general, they have to better elaborate upon the challenges that exist and have led them on performing such research.
* The authors state “The paper assesses the efficiency of various approaches of machine learning when applied in healthcare field”. Such approaches should be stated in the Abstract part.
* The authors should also describe the overall captured results as well as their major extracted conclusions.
1. Introduction
* The Introduction part should be enhanced with additional details regarding the investigated problem. Among the additional details, I suggest to the authors to also include relevant statistics to verify the problem’s statement.
* The whole Introduction part lacks of the inclusion of relevant references that verify the provided statements. Based on the provided research statements, some relevant up-to-date references that could enhance the statements’ validity are the following:
(i) For the part of “Different types of approaches may not perform as accurately as desired in the field of healthcare, or perform better or worse on different types of data.”, the authors can include the works of:
- Zhang, Lida, et al. "DynEHR: Dynamic adaptation of models with data heterogeneity in electronic health records." 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). IEEE, 2021.
- Pérez Benito, Francisco Javier. Healthcare data heterogeneity and its contribution to machine learning performance. Diss. Universitat Politècnica de València, 2020.
- He, Jingrui. "Learning from Data Heterogeneity: Algorithms and Applications." IJCAI. 2017.
- Satti, Fahad Ahmed, et al. "Ubiquitous Health Profile (UHPr): a big data curation platform for supporting health data interoperability." Computing 102.11 (2020): 2409-2444.
(ii) For the part of “existing solutions cannot always be applied efficiently due to the lack of common data schema for healthcare organizations to rely on.”, the authors can include the works of:
- Dhayne, Houssein, et al. "In search of big medical data integration solutions-a comprehensive survey." IEEE Access 7 (2019): 91265-91290.
- Kiourtis, Athanasios, et al. "Gaining the semantic knowledge of healthcare data through syntactic models transformations." 2017 International Symposium on Computer Science and Intelligent Controls (ISCSIC). IEEE, 2017.
- Khnaisser, Christina, et al. "Using an ontology to derive a sharable and interoperable relational data model for heterogeneous healthcare data and various applications." Methods of Information in Medicine AAM (2022).
The same notion should be followed for the rest of the statements (e.g., “As a result, the task of automating the conversion of this data still remains important to this day.”, “Different possibilities of handling this task are still being explored”, “Natural language processing might be a possible solution to the problem of heterogeneity if it could be forced to analyse column labels in the data structure itself”, “Unit conversion within the context of heterogeneous datasets is yet another issue that requires exploration.”, “Another problem in regards to medical data is that different diagnoses is the multicollinearity, which can distort the end result due to the classifiers receiving the same data twice.”), including up-to-date remarkable references across the relevant text.
2. Related Work on Machine Learning for Medical System
* The authors state “The basic used algorithms are the Support Vector Machines (SVM), variations of Neural Networks, Logistic Regression and Naive Bayes Classifiers [17]”. Such statement is not totally valid. There exists a plethora of additional algorithms that are widely used, such as Random Forest, KNN, and XGBoost. The authors should better study this part and update the text correspondingly.
* The authors state “such as data heterogeneity and lack of good enough hardware”. To verify those cases, they should provide relevant references. Especially for the data heterogeneity part that is deeply investigated into this work, the authors should consider the following key research works:
- Ganie, Shahid Mohammad, Majid Bashir Malik, and Tasleem Arif. "Machine Learning Techniques for Big Data Analytics in Healthcare: Current Scenario and Future Prospects." Telemedicine: The Computer Transformation of Healthcare. Springer, Cham, 2022. 103-123.
- Kiourtis, Athanasios, et al. "Aggregating heterogeneous health data through an ontological common health language." 2017 10th International Conference on Developments in eSystems Engineering (DeSE). IEEE, 2017.
- Pfaff, Emily Rose, et al. "Fast Healthcare Interoperability Resources (FHIR) as a meta model to integrate common data models: development of a tool and quantitative validation study." JMIR medical informatics 7.4 (2019): e15199.
- Mavrogiorgou, Argyro, et al. "The road to the future of healthcare: Transmitting interoperable healthcare data through a 5G based communication platform." European, Mediterranean, and Middle Eastern Conference on Information Systems. Springer, Cham, 2018.
* The authors state “here is a psychological problem that lies in a certain amount of mistrust towards machine learning in fields as crucial as medicine.”. To verify those cases, they should provide relevant references.
* The authors state the “FIS structure”, and the “Mamdani and the Sugeno FIS”. They should explain such concepts to the paper or at least provide relevant references.
* In rows 96-126 the authors describe a variety of ML concepts, without however explaining how such concepts are relevant with the rest of the content of this Section. Maybe a linking sentence between rows 95-96 would resolve this issue.
* The authors state “such as be names…”. Do they mean “such as by names…”? If so, they should update the text.
* Rows 127-151 should be a unified paragraph. Moreover, this part should be enhanced with additional research works, since currently the provided works are very limited.
3. Health Data Formats in Medical Systems
* The authors have provided a figure (Figure 1) with an example of medical record with plain text. I suggest to the authors to also add a corresponding example (via a figure) for the tables and images formats, as well.
* The authors state “Deterministic Sensitiity Analisys and Probbilitistic Sensitiity Analisys”. They should correct the syntax of these words.
* Rows 219-229 should be a unified paragraph. Also, this part should be enhanced with additional research works, since currently the provided works are very limited for a journal work.
4. Proposal for Data Transformation to Resolve Incompatibilities Between Medical Formats
* A figure of the overall architecture of the followed approach could help in better understanding the whole process. The authors are kindly requested to provide such figure in their manuscript, making sure that all the processes analyzed in all the sub-sections of Section 4 are illustrated in the figure.
* The authors state “For these reasons, the three algorithms chosen to be tested in this study are TWNFI, TR-SVM and Naive Bayes classifier, as a component to evaluate cardiovascular disease risk based on transformed data that was taken from various sources and automatically transformed.”. Why did the authors choose to study cardiovascular disease risk? They should explain the reasons behind this.
* In rows 277-284 the authors describe the proposed data transformation and representation part, including however in their text a lot of insufficiencies (e.g., (i) the data is processed according to the type – which type?, (ii) format accepted by the system – what kind of format?). The authors should better elaborate on the technical details of this part.
* The example provided in Figure 2 should be explained in the text as well. Also, the content of Figure 2 is not clearly visible. The authors should update the quality of the figure.
* It would be easier for the reader to provide Figure 3 in a horizontal flow instead of a vertical one (the one that is provided). Also, the authors should depict in the figure the various flows that can be followed (as stated in the text).
* Figure 5 should be updated, since currently is seems to be broken. In any case, the authors should make sure that this figure depicts step-by-step the flow that is outlined in the text. Also, a horizontal view of the figure would be preferable (as Figure 3).
5. Prototype of Algorithms and Data Conversion
* Among lines 414-422 the authors should also outline the specifications of the device/system where the experiments were performed.
* In sub-Section 5.1, the authors could provide a table that would depict the features of each dataset accompanied by a short description per feature. Moreover, in the same sub-Section, the authors should make clear the type of data that those datasets include (i.e., text, tables, or images).
* The authors state “The data is normalized after import”. How such import was performed? Was it manually? Was it automatically?
* The authors state “During the testing, it was revealed that the first dataset contains outlier entries”. They should be more concrete upon those outliers, providing the relevant results. Currently it seems that this is just an assumption.
* The authors must better explain the variables of the two formulas that are provided in sub-Section 5.3.
* It seems to me that sub-Section 5.4 better fits in Section 4 rather than this Section. In general, the authors should better organize the content between the Section of the proposed approach (Section 4) and the Section of the developed prototype (Section 5).
* In sub-Section 5.5, the authors state “as well as the entire set”. They should state the records’ size for the 2 chosen datasets.
* In sub-Section 5.5, the authors state “The results are compared between the 4 approaches”. They should remind to the reader those 4 approaches.
6. Analysis of Experiment Results
* Figure 7 should be updated by providing its content with bigger fonts.
* Even if all the captured performance metrics are very clear, the result that was produced among the different algorithms is not made clear in this Section. The authors should better elaborate such concept in the discussion of the results.
* The authors should also include a part in this Section for summarizing the limitations of the conducted work. Maybe they could provide such information in sub-Section 6.7.
* Overall, the text should be revised completely, re-check grammar and revising writing style across the whole document. Native English speaker re-checking is desirable.
Author Response
The authors would like to thank the reviewers for their valuable comments that helped to improve the quality of the manuscript.
Reviewer #1: In this manuscript, the authors aim to analyse data from different healthcare sources (being experimented with 2 diverse healthcare dataset) by utilizing various ML methods and NLP to combat data heterogeneity and automate medical data transformation. Overall, the research topic is well addressed and well written, whilst the captured results are quite well documented. However, the manuscript lacks a plethora of scientific details across the proposed methods, while also lacking the relevant scientific research evidence upon the suggested and implemented concepts, which is a shortcoming. Rather than those major drawbacks, the authors should address the following issues for improving both the content and the quality of their work:
- The authors state “deploy selected machine learning algorithms on the problem cardiovascular disease diagnosis”. They have to explain the reason why they decided to apply ML upon such disease. In general, they have to better elaborate upon the challenges that exist and have led them on performing such research.
Explanation was added in Abstract.
- The authors state “The paper assesses the efficiency of various approaches of machine learning when applied in healthcare field”. Such approaches should be stated in the Abstract part.
Methods were added in Abstract
- The authors should also describe the overall captured results as well as their major extracted conclusions.
Results were added in Abstract
- The Introduction part should be enhanced with additional details regarding the investigated problem. Among the additional details, I suggest to the authors to also include relevant statistics to verify the problem’s statement.
Thank you for your comment however statistics issues is not the topic of this manuscript.
- The whole Introduction part lacks of the inclusion of relevant references that verify the provided statements. Based on the provided research statements, some relevant up-to-date references that could enhance the statements’ validity are the following:
References were added.
- (i) For the part of “Different types of approaches may not perform as accurately as desired in the field of healthcare, or perform better or worse on different types of data.”, the authors can include the works of:
References were added.
- (ii) For the part of “existing solutions cannot always be applied efficiently due to the lack of common data schema for healthcare organizations to rely on.”, the authors can include the works of:
References were added.
- The same notion should be followed for the rest of the statements (e.g., “As a result, the task of automating the conversion of this data still remains important to this day.”, “Different possibilities of handling this task are still being explored”, “Natural language processing might be a possible solution to the problem of heterogeneity if it could be forced to analyse column labels in the data structure itself”, “Unit conversion within the context of heterogeneous datasets is yet another issue that requires exploration.”, “Another problem in regards to medical data is that different diagnoses is the multicollinearity, which can distort the end result due to the classifiers receiving the same data twice.”), including up-to-date remarkable references across the relevant text.
References were added .
- The authors state “The basic used algorithms are the Support Vector Machines (SVM), variations of Neural Networks, Logistic Regression and Naive Bayes Classifiers [17]”. Such statement is not totally valid. There exists a plethora of additional algorithms that are widely used, such as Random Forest, KNN, and XGBoost. The authors should better study this part and update the text correspondingly.
More algorithms are listed here.
- The authors state “such as data heterogeneity and lack of good enough hardware”. To verify those cases, they should provide relevant references. Especially for the data heterogeneity part that is deeply investigated into this work, the authors should consider the following key research works:
References were added.
- The authors state “here is a psychological problem that lies in a certain amount of mistrust towards machine learning in fields as crucial as medicine.”. To verify those cases, they should provide relevant references.
References were added.
- The authors state the “FIS structure”, and the “Mamdani and the Sugeno FIS”. They should explain such concepts to the paper or at least provide relevant references.
Two paragraphs were added.
- In rows 96-126 the authors describe a variety of ML concepts, without however explaining how such concepts are relevant with the rest of the content of this Section. Maybe a linking sentence between rows 95-96 would resolve this issue.
Paragraphs were added.
- The authors state “such as be names…”. Do they mean “such as by names…”? If so, they should update the text.
Improved “by”.
- Rows 127-151 should be a unified paragraph. Moreover, this part should be enhanced with additional research works, since currently the provided works are very limited.
Improved and references were added.
- The authors have provided a figure (Figure 1) with an example of medical record with plain text. I suggest to the authors to also add a corresponding example (via a figure) for the tables and images formats, as well.
The comment is not clear for us. We provided a few examples in the manuscript.
- The authors state “Deterministic Sensitiity Analisys and Probbilitistic Sensitiity Analisys”. They should correct the syntax of these words.
Improved.
- Rows 219-229 should be a unified paragraph. Also, this part should be enhanced with additional research works, since currently the provided works are very limited for a journal work.
Improved and references were added.
- The authors state “For these reasons, the three algorithms chosen to be tested in this study are TWNFI, TR-SVM and Naive Bayes classifier, as a component to evaluate cardiovascular disease risk based on transformed data that was taken from various sources and automatically transformed.”. Why did the authors choose to study cardiovascular disease risk? They should explain the reasons behind this.
Two paragraphs were added at the beginning of section 4.
- In rows 277-284 the authors describe the proposed data transformation and representation part, including however in their text a lot of insufficiencies (e.g., (i) the data is processed according to the type – which type?, (ii) format accepted by the system – what kind of format?). The authors should better elaborate on the technical details of this part.
Information has been refined.
- The example provided in Figure 2 should be explained in the text as well. Also, the content of Figure 2 is not clearly visible. The authors should update the quality of the figure.
Paragraph was added in subsection 4.2.
- It would be easier for the reader to provide Figure 3 in a horizontal flow instead of a vertical one (the one that is provided). Also, the authors should depict in the figure the various flows that can be followed (as stated in the text).
We are sorry but horizontal flow of this figure will be out of the width of page.
- Figure 5 should be updated, since currently is seems to be broken. In any case, the authors should make sure that this figure depicts step-by-step the flow that is outlined in the text. Also, a horizontal view of the figure would be preferable (as Figure 3).
Because this comment is not clear for use, we remove figure 5 from he manuscript.
- Among lines 414-422 the authors should also outline the specifications of the device/system where the experiments were performed.
Information was added.
- In sub-Section 5.1, the authors could provide a table that would depict the features of each dataset accompanied by a short description per feature. Moreover, in the same sub-Section, the authors should make clear the type of data that those datasets include (i.e., text, tables, or images).
Description of both datasets was added.
- The authors state “The data is normalized after import”. How such import was performed? Was it manually? Was it automatically?
The datasets were downloaded from the websites.
- The authors state “During the testing, it was revealed that the first dataset contains outlier entries”. They should be more concrete upon those outliers, providing the relevant results. Currently it seems that this is just an assumption.
It was explained in the text.
- The authors must better explain the variables of the two formulas that are provided in sub-Section 5.3.
Explanation was added in the text
- It seems to me that sub-Section 5.4 better fits in Section 4 rather than this Section. In general, the authors should better organize the content between the Section of the proposed approach (Section 4) and the Section of the developed prototype (Section 5).
The structure of sections 4 and 5 was changed.
- In sub-Section 5.5, the authors state “as well as the entire set”. They should state the records’ size for the 2 chosen datasets.
The number of records was given in the text of the manuscript.
- In sub-Section 5.5, the authors state “The results are compared between the 4 approaches”. They should remind to the reader those 4 approaches.
The 4 approaches were given in the text of the manuscript
- Figure 7 should be updated by providing its content with bigger fonts.
Improved.
- Even if all the captured performance metrics are very clear, the result that was produced among the different algorithms is not made clear in this Section. The authors should better elaborate such concept in the discussion of the results.
The analysis of obtained results was presented in the manuscript.
- The authors should also include a part in this Section for summarizing the limitations of the conducted work. Maybe they could provide such information in sub-Section 6.7.
Paragraph added in subsection 6.7.
- Overall, the text should be revised completely, re-check grammar and revising writing style across the whole document. Native English speaker re-checking is desirable.
The manuscript was improved.
The authors would like to thank the reviewer for the valuable comments that helped to improve the quality of the manuscript.
Reviewer 2 Report
[Comment 1] Novelty
The authors need to prove their manuscript's novelty points (lines 46-52) by contrasting them with other related studies, while citing the specific reference(s).
[Comment 2] Solution methodology and results
[Subcomment 2a] (Figure 2) Why "increased" is not included into the table?
[Subcomment 2b] I believe that the authors should have provided examples for important parts in Sections 5.3-5.4 to ensure the readers understand how the method works.
[Comment 3] Reference and citation
[Subcomment 3a] (lines 4-9) Please cite the supporting references.
[Subcomment 3b] The citation numbers should be started from 1. Please revise all citation numbers.
[Subcomment 3c] The authors should have cited references that support why solution methods in lines 259-260 were selected. They must also explain about two different Naive Bayes (NB) methods used in the papers as well (not only providing references for the general NB).
[Subcomment 3d] For each detailed methodology (including the one in lines 297-311), the authors should clearly state whether they proposed it or whether it was totally copied from a reference (if copied, the authors must cite the reference).
[Subcomment 3e] (Section 5.2) I think the authors need to mention the .py filenames only when they provide the files on their online repository (that could be accessed by the readers).
[Comment 4] Writing quality, clarity, and errors
[Subcomment 4a] I cannot find the first affiliation below the title. Do the authors refer to the "current address"?
[Subcomment 4b] Please revise mistyped words: "artificual" -> "artificial". There are too many typing mistakes. I do not think the authors appropriately prepare this manuscript. Please seriously check the whole manuscript and be responsible before submitting its revision.
[Subcomment 4c] The manuscript still has grammatical mistakes. Please revise them, e.g., line 46.
[Subcomment 4d] Please enlarge all text in figures, tables, and equations to be as large as (or only slightly smaller than) texts in the main part of the manuscript to ensure readability.
[Subcomment 4e] I could not see Figure 5. Please revise it.
Comments for author File: Comments.pdf
Author Response
The authors would like to thank the reviewers for their valuable comments that helped to improve the quality of the manuscript.
Reviewer #2:
[Comment 1] Novelty
The authors need to prove their manuscript's novelty points (lines 46-52) by contrasting them with other related studies, while citing the specific reference(s).
More references were added.
[Comment 2] Solution methodology and results
[Subcomment 2a] (Figure 2) Why "increased" is not included into the table?
We are sorry but this comment is not clear for us.
[Subcomment 2b] I believe that the authors should have provided examples for important parts in Sections 5.3-5.4 to ensure the readers understand how the method works.
Some paragraphs were added in the manuscript.
[Comment 3] Reference and citation
[Subcomment 3a] (lines 4-9) Please cite the supporting references.
References were added, e.g. in Introduction and next sections.
[Subcomment 3b] The citation numbers should be started from 1. Please revise all citation numbers.
References were improved.
[Subcomment 3c] The authors should have cited references that support why solution methods in lines 259-260 were selected. They must also explain about two different Naive Bayes (NB) methods used in the papers as well (not only providing references for the general NB).
References were added.
[Subcomment 3d] For each detailed methodology (including the one in lines 297-311), the authors should clearly state whether they proposed it or whether it was totally copied from a reference (if copied, the authors must cite the reference).
We are sorry but this comment is not clear for us.
[Subcomment 3e] (Section 5.2) I think the authors need to mention the .py filenames only when they provide the files on their online repository (that could be accessed by the readers).
We gave the names to the file to show the level of complexity of the project.
[Comment 4] Writing quality, clarity, and errors
[Subcomment 4a] I cannot find the first affiliation below the title. Do the authors refer to the "current address"?
Such form of affiliation was made by the latex template of the journal – it is given in the last line.
[Subcomment 4b] Please revise mistyped words: "artificual" -> "artificial". There are too many typing mistakes. I do not think the authors appropriately prepare this manuscript. Please seriously check the whole manuscript and be responsible before submitting its revision.
The manuscript was checked.
[Subcomment 4c] The manuscript still has grammatical mistakes. Please revise them, e.g., line 46.
The manuscript was checked.
[Subcomment 4d] Please enlarge all text in figures, tables, and equations to be as large as (or only slightly smaller than) texts in the main part of the manuscript to ensure readability.
The text in figures, tables was enlarged.
[Subcomment 4e] I could not see Figure 5. Please revise it.
The figures was removed not to make the problems with seeing it.
The authors would like to thank the reviewer for the valuable comments that helped to improve the quality of the manuscript.
Reviewer 3 Report
The paper proposed different solutions regarding automating medical data transformation. Those solutions are tested on both natural and artificially created datasets. Selected algorithms used for diagnosis were implemented and tested.
Good references review and well explanation.
I have some issues to the paper:
Table 9 must be explained better
Section 4.4 - machine learning algorithm for missing data imputation should be used. That is why it's unclear to me regarding Boolean values.
Tables 10-11 - why the same results are obtained each time?
Author Response
The authors would like to thank the reviewers for their valuable comments that helped to improve the quality of the manuscript.
Reviewer #3:
The paper proposed different solutions regarding automating medical data transformation. Those solutions are tested on both natural and artificially created datasets. Selected algorithms used for diagnosis were implemented and tested. Good references review and well explanation. I have some issues to the paper:
- Table 9 must be explained better
The paragraph was added.
- Section 4.4 - machine learning algorithm for missing data imputation should be used. That is why it's unclear to me regarding Boolean values.
The paragraph was added. However, more extensive used of them is out of scope of this manuscript.
- Tables 10-11 - why the same results are obtained each time?
The results in both tables are similar but different.
The authors would like to thank the reviewer for the valuable comments that helped to improve the quality of the manuscript.
Round 2
Reviewer 1 Report
The authors have successfully addressed all the raised comments.
Author Response
The authors would like to thank the reviewer for the valuable comments that helped to improve the quality of the manuscript.
Reviewer 2 Report
[Comment 1] Novelty
(lines 58-64) The authors must contrast each novelty point with previous studies, while mentioning the reference number, e.g., “we consider …, while <reference> only considered…” to provide clarity about the novelty of the study.
[Comment 2] Solution methodology and results
[Subcomment 2a] (Figure 2) Why "increased" is not included into the table? -> I meant why “increased” word is not summarized in the figure. It should be an important adjective, instead of only listing the noun. If necessary, you can ignore this comment.
[Comment 3] Reference and citation
[Subcomment 3d] For each detailed methodology (lines 359-371), the authors should clearly state whether they proposed it or whether it was totally copied from a reference (if copied, the authors must cite the reference). -> When presenting steps of the proposed algorithm, it is necessary to show which step was proposed in previous studies or by the authors clearly (if some steps are taken exactly from a reference, it should be cited). This is necessary to clarify the contribution of the authors.
Author Response
The authors would like to thank the reviewer for the valuable comments that helped to improve the quality of the manuscript.
[Comment 1] Novelty
(lines 58-64) The authors must contrast each novelty point with previous studies, while mentioning the reference number, e.g., “we consider …, while <reference> only considered…” to provide clarity about the novelty of the study.
The explanation was given in Introduction. The references were added.
[Comment 2] Solution methodology and results
[Subcomment 2a] (Figure 2) Why "increased" is not included into the table? -> I meant why “increased” word is not summarized in the figure. It should be an important adjective, instead of only listing the noun. If necessary, you can ignore this comment.
Word/column “increased” could be added to the tables – the same as the another important features depending on the content of the tables. This figure is an example, illustrating the Transformation of Text in the framework of features extraction.
[Comment 3] Reference and citation
[Subcomment 3d] For each detailed methodology (lines 359-371), the authors should clearly state whether they proposed it or whether it was totally copied from a reference (if copied, the authors must cite the reference). -> When presenting steps of the proposed algorithm, it is necessary to show which step was proposed in previous studies or by the authors clearly (if some steps are taken exactly from a reference, it should be cited). This is necessary to clarify the contribution of the authors.
The presented set of steps for feature extraction was based on the NLP and NER concepts. The references to NLP and NER were added, if necessary.
All added/changed paragraphs in the manuscript were marked in blue.
Reviewer 3 Report
I haven't addition comments, thank you
Author Response
The authors would like to thank the reviewer for the valuable comments that helped to improve the quality of the manuscript.
Round 3
Reviewer 2 Report
Thank you for your revisions.