Medical Data Transformations in Healthcare Systems with the Use of Natural Language Processing Algorithms
Abstract
:1. Introduction
- Assesses the efficiency of various approaches of machine learning when applied in healthcare field, with particular emphasis on the heterogeneity of medical data sources;
- Uses existing Machine Learning classification algorithms with NLP for the medical data transformation on the problem of cardiovascular in healthcare systems;
- Proposes the solutions to the problem of plain text data transformation and data heterogeneity with the help of natural language processing.
2. Related Work on Machine Learning for Medical Systems
3. Health Data Formats in Medical Systems
- Syntactic heterogeneity happens if data sources are expressed in different languages;
- Conceptual heterogeneity, alternatively known as semantic heterogeneity or logical mismatch, stands for differences in modelling the same domain of interest;
- Terminological heterogeneity represents differences in names and labels for the same entities when referring from different data sources;
- Semiotic heterogeneity, or pragmatic heterogeneity, denotes different interpretation of entities by people.
4. Proposal for Data Transformation to Resolve Incompatibilities between Medical Formats
4.1. Data Transformation and Representation
4.2. Feature Extraction
- Find all entities in column names;
- For each entity, check the existing labels for similarity using stemming and/or lemmatization. If they do, move on to step 3, otherwise assume there is no equivalent feature, fill the missing values with defaults and proceed to step 6;
- If nearby words exist, perform semantic analysis. They might contain units. If the target dataset also has a unit specified in column name, utilize word vectors to establish a relation between them;
- Assign a dependency between the two features, then proceed to step 6;
- Repeat step 2 until all columns either have a dependency, or are marked to not have it;
- Save the dependency list for possible future usage and merge the tables in accordance with it.
4.3. Unit Discrepancy Identification and Automatic Conversion
4.4. Missing Values and Their Replacement
4.5. Multicollinearity and Its Detectability after Data Transformation
4.6. Experimental Approach to Unit Conversion
- Choose N samples from the dataset with the column that needs converting;
- For in N, choose k most similar samples in the dataset that is being merged with, according to features that either match or have already been transformed;
- Take the values of the feature that needs conversion and calculate an average between them. This will be the predicted “converted” result;
- Use the result from the previous step to and the value of to calculate an estimated conversion rate;
- Repeat steps 2–4 until all N samples have an estimated conversion rate;
- Calculate an average of the estimates and assign it as a final conversion rate.
4.7. Boolean-Non-Boolean Discrepancy and Possible Fixes
4.8. Possible Application of Polynomial for Dependent Trait Calculation
4.9. NLP Parser Details
- Load the program’s schema stored in file and NLP models, as well as second instances of these models with NER disabled. Look through the feature labels and their units in the schema;
- If all labels have been processed, proceed to step 5. Otherwise, load the next label and proceed to step 3;
- Check the label name and assigned unit in the file. If a recognized feature is a collection or a Boolean, load S label synonyms from the word vector model, else check a numerical value’s unit. If there is one, select the names of all the units with compatible dimensionality;
- Create extra entities and entity recognition rules based on the findings in step 3, replace the missing NER component in the previously loaded model instances with the entity rulers based on our newly created rules;
- Perform NER, group each recognized phrase by entity. Have the models that discovered respective entities compare it to the existing schema labels. If similarity reaches the given lower threshold value , its values are passed on for further selection as potential entries to fill the missing values;
- Selected values are now sorted by their similarity score to a certain feature label. The similarity score is also modified by a coefficient that is unique to the model that discovered the value. Values with scores of “1” and above are automatically assigned to the features. If there is more than one value like that, the one with maximum score is always assigned;
- If all the labels have assigned values, proceed to step 10. Otherwise, proceed to step 9;
- Process values with the score below “1” by comparing the highest score among a label’s assigned values to the upper threshold numbers. If the score is greater or equal than , where i is the number of the model that discovered the value, proceed to step 9. Otherwise, repeat with the next label;
- Parse the selected value depending on whether it is a numerical value, Boolean, or a connection of values. Numerical values (both with assigned units and without them) are processed using a special function that parses the string, collections are processed by finding a collection value most similar to the discovered one, and Booleans are parsed by discovering negation in the sentences;
- Assign the parsed result to the feature currently being processed;
- Return to step 8 if there are more feature labels to be analysed, otherwise finish parsing.
4.10. On the Possibility of a Medical Image Classifier
5. Prototype of Algorithms and Data Conversion
5.1. Datasets for Testing the Methods
- Age–Objective Feature–age–int (days);
- Height–Objective Feature–height–int (cm);
- Weight–Objective Feature–weight–float (kg);
- Gender–Objective Feature–gender–categorical code;
- Systolic blood pressure–Examination Feature–ap_hi–int;
- Diastolic blood pressure–Examination Feature–ap_lo–int;
- Cholesterol–Examination Feature–cholesterol–1: normal, 2: above normal, 3: well above normal;
- Glucose–Examination Feature–gluc–1: normal, 2: above normal, 3: well above normal;
- Smoking–Subjective Feature–smoke–binary;
- Alcohol intake–Subjective Feature–alco–binary;
- Physical activity–Subjective Feature–active–binary;
- Presence or absence of cardiovascular disease–Target Variable–cardio–binary.
- sbp–systolic blood pressure;
- tobacco–cumulative tobacco (kg);
- ldl–low density lipoprotein cholesterol;
- adiposity–a numeric vector;
- famhist–family history of heart disease, a factor with levels “Absent” and “Present”;
- typea–type-A behaviour;
- obesity–a numeric vector;
- alcohol–current alcohol consumption;
- age–age at onset;
- chd–response, coronary heart disease.
5.2. Implementation Structure
5.3. Implementation Details and Problems
5.4. NLP Parser Details
- Load the program’s schema stored in file and NLP models, as well as second instances of these models with NER disabled. Look through the feature labels and their units in the schema.
- If all labels have been processed, proceed to step 5. Otherwise, load the next label and proceed to step 3.
- Check the label name and assigned unit in the file. If a recognized feature is a collection or a Boolean, load S label synonyms from the word vector model, else check a numerical value’s unit. If there is one, select the names of all the units with compatible dimensionality.
- Create extra entities and entity recognition rules based on the findings in step 3, replace the missing NER component in the previously loaded model instances with the entity rulers based on our newly created rules.
- Perform NER, group each recognized phrase by entity. Have the models that discovered respective entities compare it to the existing schema labels. If similarity reaches the given lower threshold value , it’s values are passed on for further selection as potential entries to fill the missing values.
- Selected values are now sorted by their similarity score to a certain feature label. The similarity score is also modified by a coefficient that is unique to the model that discovered the value. Values with scores of “1” and above are automatically assigned to the features. If there is more than one value like that, the one with maximum score is always assigned.
- If all the labels have assigned values, proceed to step 10. Otherwise, proceed to step 9.
- Process values with the score below “1” by comparing the highest score among a label’s assigned values to the upper threshold numbers. If the score is greater or equal than , where i is the number of the model that discovered the value, proceed to step 9. Otherwise repeat with the next label.
- Parse the selected value depending on whether it is a numerical value, Boolean, or a connection of values. Numerical values (both with assigned units and without them) are processed using a special function that parses the string, collections are processed by finding a collection value most similar to the discovered one, and Booleans are parsed by discovering negation in the sentences.
- Assign the parsed result to the feature currently being processed.
- Return to step 8 if there are more feature labels to be analysed, otherwise finish parsing.
5.5. Testing Methods
6. Analysis of Experiment Results
6.1. Algorithm Accuracy
6.2. TWNFI Hyperparameters and Their Effect on Performance
6.3. NLP Transformation Accuracy
6.4. Unit Conversion Evaluation
6.5. Combined Methods for Data Merging
6.6. Results Discussion
6.7. Research Potential and Future Possibilities
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- Zhang, L.; Chen, X.; Chen, T.; Wang, Z.; Mortazavi, B.J. DynEHR: Dynamic adaptation of models with data heterogeneity in electronic health records. In Proceedings of the 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Athens, Greece, 27–30 July 2021. [Google Scholar]
- Benito, P.J.F. Healthcare Data Heterogeneity and Its Contribution to Machine Learning Performance. Ph.D. Thesis, Universitat Politècnica de València, Valencia, Spain, 2020. [Google Scholar]
- He, J. Learning from Data Heterogeneity: Algorithms and Applications. In Proceedings of the 26th International Joint Conference on Artificial Intelligence Early Career, Melbourne Australia, 19–25 August 2017; pp. 5126–5130. [Google Scholar]
- Satti, F.A.; Ali, T.; Hussain, J.; Wajahat, A.K.; Asad, M.K.; Sungyoung, L. Ubiquitous Health Profile (UHPr): A big data curation platform for supporting health data interoperability. Computing 2020, 102, 2409–2444. [Google Scholar] [CrossRef]
- Dhayne, H.; Haque, R.; Kilany, R.; Taher, Y. In Search of Big Medical Data Integration Solutions—A Comprehensive Survey. IEEE Access 2019, 7, 91265–91290. [Google Scholar] [CrossRef]
- Khnaisser, C.; Lavoie, L.; Fraikin, B.; Barton, A.; Dussault, S.; Burgun, A.; Ethier, J.F. Using an ontology to derive a sharable and interoperable relational data model for heterogeneous healthcare data and various applications. Methods Inf. Med. AAM 2022, 61, e73–e88. [Google Scholar] [CrossRef] [PubMed]
- Kiourtis, A.; Mavrogiorgou, A.; Kyriazis, D. Gaining the Semantic Knowledge of Healthcare Data through Syntactic Models Transformations. In Proceedings of the 2017 International Symposium on Computer Science and Intelligent Controls (ISCSIC), Budapest, Hungary, 20–22 October 2017; pp. 102–107. [Google Scholar]
- Litman, D.J. Automating the Conversion of Data: A Review of Recent Progress. IEEE Trans. Knowl. Data Eng. 2017, 29, 912–925. [Google Scholar]
- Barr, R.H. Natural language processing in healthcare data integration. In Proceedings of the AMIA Annual Symposium, Chicago, IL, USA, 10–14 November 2017; American Medical Informatics Association: Bethesda, MD, USA, 2007. [Google Scholar]
- Jurafsky, D.; Martin, J.H.; Martin, J.H. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Pearson Education: London, UK, 2019. [Google Scholar]
- Haverkort, J.W.M.; Reinders, M.J.T.; Houben, G.J.; Schadd, M.J.A.; Akkermans, H.A. Integrating heterogeneous datasets: Challenges and solutions. J. Database Manag. 2005, 16, 1–26. [Google Scholar]
- Zaki, M.M. Unit conversion in heterogeneous databases and data warehouses. IEEE Trans. Knowl. Data Eng. 2004, 16, 578–589. [Google Scholar]
- Ojha, P.; Sahu, S.K.; Chaturvedi, S. Multicollinearity: Issues, detection, and remedies. J. Big Data 2019, 6, 1–17. [Google Scholar]
- Schuemie, M.J.; Mallon, W.G.; Ryan, P.B. A review of multicollinearity in medical research. J. Clin. Epidemiol. 2011, 64, 945–953. [Google Scholar]
- Chaudhry, M.M.; Chawla, S. A Review of Multicollinearity Diagnosis and Remedial Measures in Multiple Regression Analysis. Res. J. Appl. Sci. Eng. Technol. 2016, 11, 650–657. [Google Scholar]
- Sniegula, A.; Poniszewska-Maranda, A.; Chomatek, L. Towards the Named Entity Recognition Methods in Biomedical Field; Chatzigeorgiou, A., Ed.; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 375–387. ISBN 978-3-30-38918-5. [Google Scholar] [CrossRef]
- Adelakun, O.F.; Ismail, H. Natural language processing for medical applications: A review. Int. J. Med. Inform. 2019, 122, 103398. [Google Scholar]
- Joty, S.H.; Goecke, R. Natural language processing in healthcare applications: A survey. IEEE Access 2020, 8, 55984–56007. [Google Scholar]
- Zhang, Q.; Wang, X. Natural language processing in healthcare: A survey of applications and challenges. IEEE Access 2020, 8, 151576–151594. [Google Scholar]
- Demeester, J.C.; Chapman, L.S.; Zhou, X. Natural language processing applications in the medical field. Artif. Intell. Med. 2015, 64, 123–139. [Google Scholar]
- Tsoukatos, T.K. Natural Language Processing Techniques in the Medical Field. Int. J. Comput. Linguist. Nat. Lang. Process. 2014, 1, 11–22. [Google Scholar]
- Krzeszewska, U.; Poniszewska-Mar, A.; Ochelska-Mierzejewska, J. Systematic comparison of vectorization methods in classification context. Appl. Sci. 2022, 12, 5119. [Google Scholar] [CrossRef]
- Aldahiri, A.; Alrashed, B.; Hussain, W. Trends in using IoT with machine learning in health prediction system. Forecasting 2021, 3, 181–206. [Google Scholar] [CrossRef]
- Ak, M.F. A comparative analysis of breast cancer detection and diagnosis using data visualization and machine learning applications. Healthcare 2020, 8, 111. [Google Scholar] [CrossRef]
- Garg, A.; Mago, V. Role of machine learning in medical research: A survey. Comput. Sci. Rev. 2021, 40, 100370. [Google Scholar] [CrossRef]
- Panch, T.; Szolovits, P.; Atun, R. Artificial intelligence, machine learning and health systems. J. Glob. Health 2018, 8, 020303. [Google Scholar] [CrossRef]
- Sciforce. Top AI algorithms for Healthcare. Available online: https://medium.com/sciforce/top-ai-algorithms-for-healthcare-aa5007ffa330 (accessed on 9 July 2020).
- Song, Q.; Kasabov, N. TWNFI—A transductive neuro-fuzzy inference system with weighted data normalization for personalized modelling. Neural Netw. 2006, 19, 1591–1596. [Google Scholar] [CrossRef]
- Kiourtis, A.; Mavrogiorgou, A.; Kyriazis, D. Aggregating Heterogeneous Health Data through an Ontological Common Health Language. In Proceedings of the 10th International Conference on Developments in eSystems Engineering (DeSE), Paris, France, 14–16 June 2017; pp. 175–181. [Google Scholar]
- Ganie, S.M.; Majid, B.M.; Tasleem, A. Machine Learning Techniques for Big Data Analytics in Healthcare: Current Scenario and Future Prospects. In Telemedicine: The Computer Transformation of Healthcare; Springer: Cham, Switzerland, 2022; pp. 103–123. [Google Scholar]
- Pfaff, E.R.; Champion, J.; Bradford, R.L.; Clark, M.; Xu, H.; Fecho, K.; Krishnamurthy, A.; Cox, S.; Chute, C.G.; Overby, T.C.; et al. Fast Healthcare Interoperability Resources (FHIR) as a Meta Model to Integrate Common Data Models: Development of a Tool and Quantitative Validation Study. JMIR Med. Inform. 2019, 16, e15199. [Google Scholar] [CrossRef] [PubMed]
- Mavrogiorgou, A.; Kiourtis, A.; Touloupou, M.; Kapassa, E.; Kyriazis, D.; Themistocleous, M. The Road to the Future of Healthcare: Transmitting Interoperable Healthcare Data through a 5G Based Communication Platform; Information Systems, EMCIS 2018; Lecture Notes in Business Information Processing; Themistocleous, M., Rupino da Cunha, P., Eds.; Springer: Cham, Switzerland, 2019; Volume 341. [Google Scholar]
- Punia, S.K.; Kumar, M.; Stephan, T.; Deverajan, G.G.; Patan, R. Performance analysis of machine learning algorithms for big data classification: Ml and ai-based algorithms for big data analysis. Int. J. Health Med. Commun. (IJEHMC) 2021, 12, 60–75. [Google Scholar] [CrossRef]
- Mohan, S.; Thirumalai, C.; Srivastava, G. Effective heart disease prediction using hybrid machine learning techniques. IEEE Access 2019, 7, 81542–81554. [Google Scholar] [CrossRef]
- Wang, L. Heterogeneous Data and Big Data Analytics. Autom. Control. Inf. Sci. 2017, 3, 8–15. [Google Scholar] [CrossRef] [Green Version]
- Sarker, I.H. Machine learning: Algorithms, real-world applications and research directions. SN Comput. Sci. 2021, 2, 1–21. [Google Scholar] [CrossRef]
- Mehbodniya, A.; Lazar, A.J.P.; Webber, J.; Sharma, D.K.; Jayagopalan, S.; Singh, P.; Rajan, R.; Pandya, S.; Sengan, S. Fetal health classification from cardiotocographic data using machine learning. Expert Syst. 2021, 39, e12899. [Google Scholar] [CrossRef]
- Halpern, J.Y.; Okonkowski, D.C. The Challenges of Machine Learning in Medicine. N. Engl. J. Med. 2018, 379, 1814–1816. [Google Scholar]
- Rayan, K.; Rajpurkar, P.; Topol, E.J. Self-supervised learning in medicine and healthcare. Nat. Biomed. Eng. 2022, 6, 1–7. [Google Scholar]
- Fei, W.; Casalino, L.P.; Khullar, D. Deep learning in medicine—Promise, progress, and challenges. JAMA Intern. Med. 2019, 179, 293–294. [Google Scholar]
- Razzak, M.I.; Naz, S.; Zaib, A. Deep learning for medical image processing: Overview, challenges and the future. In Classification in BioApp; Springer: Berlin/Heidelberg, Germany, 2018; pp. 323–350. [Google Scholar]
- Merler, S.; Jurman, G. Terminated Ramp–Support Vector Machines: A nonparametric data dependent kernel. Neural Netw. 2006, 19, 1597–1611. [Google Scholar] [CrossRef]
- Iroju, O.G.; Olaleke, J.O. A Systematic Review of Natural Language Processing in Healthcare. Int. J. Inf. Technol. Comput. Sci. 2015, 7, 44–50. [Google Scholar] [CrossRef] [Green Version]
- Israel, C.V.; Yu, W.; Cordova, J.J. Multiple fuzzy neural networks modeling with sparse data. In Proceedings of the International Conference on Fuzzy Systems, Barcelona, Spain, 18–23 July 2010; pp. 1–7. [Google Scholar] [CrossRef]
- Elshawi, R.; Maher, M.; Sakr, S. Automated Machine Learning: State-of-The-Art and Open Challenges. arXiv 2019, arXiv:1906.02287. [Google Scholar]
- Piedra, D.; Ferrer, A.; Gea, J. Text Mining and Medicine: Usefulness in Respiratory Diseases. Arch. Bronconeumol. 2014, 50, 113–119. [Google Scholar] [CrossRef] [PubMed]
- Mikhailidis, D.P.; Papamichael, C.M.; Banach, M. Machine learning techniques aiming to improve cardiovascular disease prevention and treatment: A review. Heart 2017, 103, 1733–1740. [Google Scholar]
- Fumera, G.; Roli, F. Machine learning techniques for cardiovascular disease prediction. Artif. Intell. Med. 2016, 71, 3–19. [Google Scholar]
- Malek, M.H.; Kavousi, M. Machine learning techniques in cardiovascular disease diagnosis and prognosis. BMC Med. Inform. Decis. Mak. 2016, 16, 1–11. [Google Scholar]
- Xu, Y.; Li, Y. Machine learning techniques for cardiovascular disease risk prediction: Progress and perspectives. Bioinformatics 2017, 33, 2044–2052. [Google Scholar]
- Xu, J.; Yao, X. Training-resampling based SVM for imbalanced classification. IEEE Trans. Neural Netw. Learn. Syst. 2017, 28, 1094–1105. [Google Scholar]
- Ingvaldsen, J.; Veres, C. Using the WordNet Ontology for Interpreting Medical Records. In Proceedings of the CAiSE, Riga, Latvia, 7–11 June 2004; pp. 355–358. [Google Scholar]
- Groot Koerkamp, B.; Weinstein, M.C.; Stijnen, T.; Heijenbrok-Kal, M.H.; Hunink, M.G. Uncertainty and patient heterogeneity in medical decision models. Medical decision-making. Int. J. Soc. Med. Decis. Making 2010, 30, 194–205. [Google Scholar] [CrossRef]
- Sindhu, C.S.; Hegde, N.P. A framework to handle data heterogeneity contextual to medical big data. In Proceedings of the 2015 IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Madurai, India, 10–12 December 2015; pp. 1–7. [Google Scholar] [CrossRef]
- Jiang, Z.; Zeng, J.; Zhang, S. Inter-training: Exploiting unlabelled data in multi-classifier systems. Knowl.-Based Syst. 2013, 45, 8–19. [Google Scholar] [CrossRef]
- Saltelli, A.; Ratto, M.; Andres, T.; Campolongo, F.; Cariboni, J.; Gatelli, D.; Saisana, M.; Tarantola, S. Global Sensitivity Analysis: The Primer; John Wiley & Sons: Chichester, UK, 2008. [Google Scholar]
- Ferson, S.; Ginzburg, L. Deterministic and probabilistic sensitivity analysis. Reliab. Eng. Syst. Saf. 2004, 83, 1–17. [Google Scholar]
- Saltelli, A.; Annoni, P.; Azzini, I.; Campolongo, F.; Ratto, M.; Tarantola, S. Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index. Comput. Phys. Commun. 2010, 81, 259–270. [Google Scholar] [CrossRef]
- Chawla, S.; Raghavan, V. TWNFI: Training with noisy feature injection for enhanced deep learning on imbalanced data. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3730–3738. [Google Scholar]
- Mitchell, T. Machine Learning; McGraw Hill: New York, NY, USA, 1997. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Pandas. The Pandas Development Team. pandas-dev/pandas. 2020. Available online: https://github.com/pandas-dev/pandas (accessed on 15 August 2020).
- Harris, C.R.; Millman, K.J.; van der Walt, S.J. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
- Honnibal, M.; Montani, I.; Van Lan-deghem, S.; Boyd, A. spaCy:Industrial-strength Natural Language Processing inPython. Documentation. 2020. Available online: https://zenodo.org/record/7445599#.Y7UVLBVBxPY (accessed on 15 August 2020).
- Neumann, M.; King, D.; Beltagy, I.; Ammar, W. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. In Proceedings of the 18th BioNLP Workshop and Shared Task (BioNLP@ACL 2019), Florence, Italy, 1 August 2019; pp. 319–327. [Google Scholar] [CrossRef]
- Trask, A.; Michalak, P.; Liu, J. sense2vec—A Fast and Accurate Method for Word Sense Disambiguation in Neural Word Embeddings. arXiv 2015, arXiv:1511.06388. [Google Scholar]
- Seabold, S.; Perktold, J. Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference (SCIPY’2010), Austin, TX, USA, 28 June–3 July 2010; pp. 92–96. [Google Scholar]
- Cardiovascular Disease Dataset. Available online: https://kaggle.com/sulianova/cardiovascular-disease-dataset (accessed on 9 April 2020).
- Cardiovascular Disease. Available online: https://kaggle.com/yassinehamdaoui1/cardiovascular-disease (accessed on 9 August 2020).
TR-SVM | TWNFI | NB Classifier (Gaussian) | NB Classifier (Complement) | |
---|---|---|---|---|
Accuracy percentage | 75% | 50% | 42% | 60% |
Avg. fitting time (ms) | 106,353.96 | 27,047.08 | 0.6966 | 0.9453 |
Avg. prediction time (ms) | 53,244.02 | 1.4250 | 0.1470 | 0.1038 |
Mean square error (for 50 experiments) | 0.28 | 0.5 | 0.58 | 0.4 |
TR-SVM | TWNFI | NB Classifier (Gaussian) | NB Classifier (Complement) | |
---|---|---|---|---|
Accuracy percentage | 70% | 66% | 54% | 52% |
Avg. fitting time (ms) | 4,629,342.37 | 20,456.25 | 1.2287 | 1.2396 |
Avg. prediction time (ms) | 425,968.14 | 0.5530 | 0.1445 | 0.07388 |
Mean square error (for 50 experiments) | 0.3 | 0.34 | 0.46 | 0.48 |
TR-SVM | TWNFI | NB Classifier (Gaussian) | NB Classifier (Complement) | |
---|---|---|---|---|
Accuracy percentage | 70% | 64% | 46% | 58% |
Avg. fitting time (ms) | 12,611,525.95 | 5604.68 | 1.9520 | 2.070 |
Avg. prediction time (ms) | 543,569.71 | 0.4675 | 0.1426 | 0.0760 |
Mean square error (for 50 experiments) | 0.3 | 0.36 | 0.54 | 0.42 |
TR-SVM | TWNFI | NB Classifier (Gaussian) | NB Classifier (Complement) | |
---|---|---|---|---|
Accuracy percentage | 68% | 76% | 54% | 40% |
Avg. fitting time (ms) | 33,368,058.12 | 13,610.57 | 4.0899 | 4.1547 |
Avg. prediction time (ms) | 388,989.52 | 0.4662 | 0.1471 | 0.0814 |
Mean square error (for 50 experiments) | 0.32 | 0.24 | 0.46 | 0.6 |
TR-SVM | TWNFI | NB Classifier (Gaussian) | NB Classifier (Complement) | |
---|---|---|---|---|
Accuracy percentage | N/A | 64% | 58% | 48% |
Avg. fitting time (ms) | N/A | 10,901.85 | 15.105 | 14.413 |
Avg. prediction time (ms) | N/A | 0.4151 | 0.1628 | 0.0863 |
Mean square error (for 50 experiments) | N/A | 0.36 | 0.42 | 0.52 |
TR-SVM | TWNFI | NB Classifier (Gaussian) | NB Classifier (Complement) | |
---|---|---|---|---|
Accuracy percentage | N/A | 78% | 56% | 48% |
Avg. fitting time (ms) | N/A | 12,122.58 | 110.80 | 102.51 |
Avg. prediction time (ms) | N/A | 0.4547 | 0.2075 | 0.1328 |
Mean square error (for 50 experiments) | N/A | 0.22 | 0.44 | 0.52 |
Number of Similar Samples | Accuracy Percentage (ms) | Avg. Fitting Time (ms) | Avg. Prediction Time (ms) | Mean Square Error (for 50 Experiments) |
---|---|---|---|---|
15 | 64% | 5845.22 | 0.3516 | 0.36 |
20 | 76% | 8022.30 | 0.3315 | 0.24 |
25 | 78% | 9807.30 | 0.3514 | 0.22 |
30 | 66% | 11,973.35 | 0.3356 | 0.34 |
35 | 72% | 15,024.77 | 0.3408 | 0.28 |
Dthr | Accuracy Percentage | Avg. Fitting Time (ms) | Avg. Prediction Time (ms) | Mean Square Error (for 50 Experiments) |
---|---|---|---|---|
0.15 | 62% | 20,404.17 | 6.1129 | 0.38 |
0.20 | 76% | 9,877.32 | 0.3771 | 0.24 |
0.25 | 84% | 9601 | 0.3363 | 0.16 |
0.30 | 78% | 9464.78 | 0.3582 | 0.22 |
0.35 | 70% | 9691.50 | 0.3607 | 0.3 |
0.40 | 80% | 9561.18 | 0.3308 | 0.2 |
0.45 | 56% | 9698.76 | 0.3391 | 0.44 |
0.50 | 66% | 9483.88 | 0.3377 | 0.34 |
0.55 | 76% | 9712.92 | 0.3506 | 0.24 |
0.60 | 78% | 9490.59 | 0.3625 | 0.22 |
0.65 | 64% | 9626.10 | 0.3368 | 0.36 |
0.70 | 80% | 9321.10 | 0.3298 | 0.2 |
0.75 | 74% | 9425.70 | 0.3219 | 0.26 |
0.80 | 82% | 9703.71 | 0.3397 | 0.18 |
0.85 | 74% | 9561.92 | 0.3265 | 0.26 |
0.90 | 74% | 9360.46 | 0.3209 | 0.26 |
0.95 | 68% | 9613.46 | 0.3274 | 0.32 |
1.0 | 68% | 9658.99 | 0.3435 | 0.32 |
Sample Number | Real Sample | Actual Sample | Comp. Time (ms) | Accuracy per Sample |
---|---|---|---|---|
1 | [18,262.5, 2, 168.0, 62.0, 110.0, 80.0, 0, 0, 0, 0, 1] | [18,393, 2, 168, 62, 110, 80, 0, 0, 0, 0, 1] | 2929.77 | 100% |
2 | [14,610.0, 2, 165.0, 60.0, 120.0, 80.0, 1, 1, 0, 0, 0] | [14,791, 2, 165, 60, 120, 80, 0, 0, 0, 0, 0] | 3030.48 | 82% |
3 | [21,184.5, 1, 170.0, 75.0, 130.0, 70.0, 0, 0, 0, 0, 0] | [21,296, 1, 170, 75, 130, 70, 0, 0, 0, 0, 0] | 2992.86 | 100% |
4 | [23,010.75, 2, 151.0, 92.0, 130.0, 90.0, 0, 1, 0, 0, 0] | [23,204, 1, 151, 92, 130, 90, 0, 0, 0, 0, 0] | 2992.74 | 82% |
5 | [15,705.75, 2, 185.0, 88.0, 133.0, 89.0, 1, 1, 0, 0, 0] | [15,946, 2, 185, 88, 133, 89, 1, 1, 0, 0, 1] | 2630.13 | 91% |
6 | [20,454.0, 2, 100.0, 78.0, 140.0, 90.0, 1, 1, 1, 0, 0] | [20,627, 2, 168, 78, 140, 90, 1, 1, 1, 0, 1] | 2844.75 | 82% |
7 | [21,915.0, 2, 176.0, 74.0, 120.0, 80.0, 0, 1, 0, 1, 0] | [22,111, 1, 176, 74, 120, 80, 0, 0, 0, 0, 1] | 2955.90 | 64% |
8 | [14,244.75, 2, 167.0, 66.0, 110.0, 70.0, 0, 0, 0, 0, 1] | [14,493, 1, 167, 66, 110, 70, 0, 0, 0, 0, 1] | 2875.22 | 91% |
9 | [23,376.0, 2, 169.0, 73.0, 140.0, 90.0, 0, 1, 0, 0, −1] | [23,376, 2, 169, 73, 140, 90, 0, 0, 0, 0, 1] | 2620.89 | 82% |
10 | [18,993.0, 2, 175.0, 53.0, 140.0, −1, 1, 1, 0, 0, 1] | [19,081, 2, 175, 53, 140, 80, 0, 0, 1, 0, 1] | 2492.70 | 64% |
11 | [21,549.75, 2, 174.0, 82.0, 120.0, 80.0, 1, 1, 0, 0, 1] | [21,665, 2, 174, 82, 120, 80, 0, 0, 0, 0, 1] | 2809.15 | 82% |
12 | [16,436.25, 2, 170.0, 68.0, 150.0, 90.0, 1, 1, 0, 0, 0] | [16,608, 1, 170, 68, 150, 90, 1, 0, 0, 0, 1] | 3074.92 | 73% |
13 | [−1, 1, 157.0, −1, 1, 130.0, 1, 1, 0, 0, 1] | [22,608, 1, 157, 70, 130, 90, 0, 0, 0, 0, 1] | 3003.69 | 45% |
14 | [23,376.0, 2, 1, 1, −1, −1, 1, 1, 0, 0, −1] | [23,389, 1, 163, 63, 120, 80, 1, 1, 0, 0, 0] | 2118.30 | 45% |
15 | [19,358.25, 2, 171.0, 79.0, 80.0, −1, 0, 1, 0, 0, 0] | [19,668, 2, 171, 79, 120, 80, 0, 0, 0, 0, 1] | 2475.30 | 64% |
16 | [20,454.0, 1, 180.0, 75.0, 1, −1, 0, 1, 0, 0, 1] | [20,554, 2, 180, 75, 120, 80, 0, 0, 0, 0, 1] | 2632.36 | 64% |
17 | [14,610.0, 2, 170.0, 68.0, 120.0, −1, 0, 1, 0, 0, 0] | [14,798, 2, 170, 68, 120, 80, 0, 0, 0, 0, 0] | 2524.41 | 82% |
18 | [23,010.75, 1, 155.0, 56.0, 120.0, 80.0, 0, 1, 0, 0, −1] | [23,191, 1, 155, 56, 120, 80, 0, 0, 0, 0, 1] | 3003.80 | 82% |
19 | [21,184.5, 2, 166.0, 101.0, 140.0, 90.0, 1, 1, 0, 1, 0] | [21,270, 1, 166, 101, 140, 90, 1, 0, 0, 0, 1] | 2910.77 | 64% |
20 | [23,010.75, 1, 164.0, 82.0, 130.0, 70.0, 1, 0, 0, 0, 0] | [23,343, 1, 164, 82, 130, 70, 1, 0, 0, 0, 1] | 2980.47 | 91% |
Initialization time (ms): | 8030.26 | |||
Total accuracy percentage: | 76% |
Number of Similar Samples | Predicted Values | Actual Values |
---|---|---|
50 | 374.54, 2.55, 0.47 | 365.25, 2.54, 0.45 |
100 | 368.52, 2.56, 0.46 | 365.25, 2.54, 0.45 |
250 | 367.6, 2.54, 0.46 | 365.25, 2.54, 0.45 |
500 | 373.15, 2.54, 0.46 | 365.25, 2.54, 0.45 |
1000 | 375.35, 2.53, 0.47 | 365.25, 2.54, 0.45 |
2500 | 369.11, 2.55, 0.48 | 365.25, 2.54, 0.45 |
Number of Similar Samples | Predicted Values | Actual Values |
---|---|---|
50 | 370.20, 2.54, 0.48 | 365.25, 2.54, 0.45 |
100 | 370.82, 2.53, 0.46 | 365.25, 2.54, 0.45 |
250 | 371.17, 2.54, 0.47 | 365.25, 2.54, 0.45 |
500 | 372.60, 2.54, 0.44 | 365.25, 2.54, 0.45 |
1000 | 367.74, 2.55, 0.47 | 365.25, 2.54, 0.45 |
2500 | 370.81, 2.58, 0.47 | 365.25, 2.54, 0.45 |
Similar Samples | Before Transformation | After Transformation |
---|---|---|
50 | [23,143.48, 1.0, 163.92, 73.96, 135.0, 80.0, 1.0, 2.0, 0.0, 0.0, 0.0] | [22,431.0, 1.0, 163.0, 72.0, 135.0, 80.0, 1.0, 2.0, 0.0, 0.0, 0.0] |
100 | [23,058.68, 1.0, 158.69, 128.300, 140.0, 90.0, 2.0, 2.0, 0.0, 0.0, 1.0] | [22,601.0, 1.0, 158.0, 126.0, 140.0, 90.0, 2.0, 2.0, 0.0, 0.0, 1.0] |
250 | [20,540.0, 1.0, 170.0, 72.0, 120.0, 80.0, 2.0, 1.0, 0.0, 0.0, 1.0] | [20740.02, 1.0, 169.60, 72.238, 120.0, 80.0, 2.0, 1.0, 0.0, 0.0, 1.0] |
500 | [19,401.56, 2.0, 182.63, 106.37, 180.0, 90.0, 3.0, 1.0, 0.0, 1.0, 0.0] | [19,066.0, 2.0, 183.0, 105.0, 180.0, 90.0, 3.0, 1.0, 0.0, 1.0, 0.0] |
1000 | [21,300.94, 1.0, 170.77, 71.54, 120.0, 80.0, 2.0, 1.0, 0.0, 0.0, 1.0] | [20,540.0, 1.0, 170.0, 72.0, 120.0, 80.0, 2.0, 1.0, 0.0, 0.0, 1.0] |
2500 | [20,737.45, 1.0, 171.37, 75.71, 120.0, 80.0, 2.0, 1.0, 0.0, 0.0, 1.0] | [20,540.0, 1.0, 170.0, 72.0, 120.0, 80.0, 2.0, 1.0, 0.0, 0.0, 1.0] |
Number of Steps N | Predicted Values | Actual Values | Computation Time |
---|---|---|---|
10 | 361.79, 2.57, 0.51 | 365.25, 2.54, 0.45 | 14,794.72 |
20 | 377.86, 2.56, 0.43 | 365.25, 2.54, 0.45 | 27,955.9273 |
30 | 380.79, 2.53, 0.47 | 365.25, 2.54, 0.45 | 40,537.82 |
40 | 373.96, 2.60, 0.47 | 365.25, 2.54, 0.45 | 57,885.87 |
50 | 370.09, 2.55, 0.45 | 365.25, 2.54, 0.45 | 67,754.48 |
100 | 368.08, 2.58, 0.46 | 365.25, 2.54, 0.45 | 143,455.42 |
250 | 373.50, 2.55, 0.46 | 365.25, 2.54, 0.45 | 328,075.21 |
Number of Similar Samples | Predicted Values | Actual Values |
---|---|---|
50 | 523.12 | 365.25 |
100 | 494.48 | 365.25 |
200 | 498.29 | 365.25 |
250 | 542.89 | 365.25 |
300 | 532.68 | 365.25 |
500 | 471.80 | 365.25 |
1000 | 509.07 | 365.25 |
2500 | 503.09 | 365.25 |
TR-SVM | TWNFI | NB Classifier (Gaussian) | NB Classifier (Complement) | |
---|---|---|---|---|
Accuracy percentage | 65% | 58% | 59% | 64% |
Avg. fitting time (ms) | 1,644,799.63 | 19,001.99 | 0.9101 | 0.9827 |
Avg. prediction time (ms) | 625,832.98 | 0.7940 | 0.1361 | 0.0728 |
Mean square error | 0.35 | 0.42 | 0.41 | 0.36 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Poniszewska-Marańda, A.; Vynogradnyk, E.; Marańda, W. Medical Data Transformations in Healthcare Systems with the Use of Natural Language Processing Algorithms. Appl. Sci. 2023, 13, 682. https://doi.org/10.3390/app13020682
Poniszewska-Marańda A, Vynogradnyk E, Marańda W. Medical Data Transformations in Healthcare Systems with the Use of Natural Language Processing Algorithms. Applied Sciences. 2023; 13(2):682. https://doi.org/10.3390/app13020682
Chicago/Turabian StylePoniszewska-Marańda, Aneta, Elina Vynogradnyk, and Witold Marańda. 2023. "Medical Data Transformations in Healthcare Systems with the Use of Natural Language Processing Algorithms" Applied Sciences 13, no. 2: 682. https://doi.org/10.3390/app13020682
APA StylePoniszewska-Marańda, A., Vynogradnyk, E., & Marańda, W. (2023). Medical Data Transformations in Healthcare Systems with the Use of Natural Language Processing Algorithms. Applied Sciences, 13(2), 682. https://doi.org/10.3390/app13020682