A Method for Reducing Training Time of ML-Based Cascade Scheme for Large-Volume Data Analysis
Abstract
:1. Introduction
- We modified the existing ML-based cascade scheme, its training, and its application algorithms to enhance the efficiency of analyzing large volumes of biomedical data by using PCA at each level of the cascade. The number of principal components that replace the original inputs of the task was chosen to ensure 95% variance coverage. This approach resulted in a significant reduction in training procedure duration of the modified scheme while preserving, and, in some cases, improved the accuracy of intelligent analysis of biomedical datasets.
- We conducted a comparison of the effectiveness of the modified ML-based cascade scheme with the existing one and found a significant reduction in its training time and a small but non-negligible improvement in the accuracy. The latter can be explained by the increase in the generalization properties of the improved scheme due to the reduction in the number of independent attributes in the large dataset processed by machine learning methods at each level of the cascade.
2. Materials and Methods
2.1. The Training Mode Algorithm
2.2. The Application Mode Algorithm
3. Results
3.1. Dataset Description
3.2. Dataset Preprocessing
3.3. Results of the Modified ML-Based Cascade Scheme Based on Its Optimal Parameters
- The accuracy of the method’s performance.
- The speed of the method’s operation.
- The generalization properties of the method.
- Increasing the number of levels of the modified ML-based cascade scheme enhanced the accuracy in both modes, but only up to a certain point. Beyond this, the method’s accuracy began to decline.
- The peak accuracy in the application mode was achieved when using a cascade scheme with four levels. However, the training mode’s accuracy for this cascade configuration was not the highest (with the difference being less than 0.02). The latter phenomenon can be explained by the fact that the four-level cascade processed data in smaller increments compared with the three-level cascade, which optimized for higher accuracy during training. While this enhances the method’s generalization properties, it may marginally reduce the training accuracy.
- The modified ML-based cascade scheme with four levels demonstrated the highest generalization properties among all other implementations. This is evident from the smallest difference in the accuracy of the method between the training and application modes.
- The use of a cascade scheme with five levels significantly deteriorated the generalization properties of the method. The application accuracy dropped by almost 3%.
- The use of a cascade scheme with six levels demonstrated overfitting. The application accuracy here surpassed the training accuracy. Therefore, further increasing the number of cascade levels did not seem appropriate.
4. Discussion
- The method from [33] (classical SGD using a balanced dataset) required the shortest training time. However, this method was found to have the lowest accuracy in solving the classification task, albeit with satisfactory generalization properties.
- The method from [29] achieved higher accuracy but with poorer generalization. This was because the former employs a nonlinear input expansion based on the Kolmogorov–Gabor polynomial, which, according to Cover’s theorem, enhances the classification accuracy. However, such an algorithm with a significant expansion of inputs entails a considerable increase in the training procedure duration. Specifically, the method from [29] demonstrates an over thirty-five-fold increase in training duration compared with the method from [33].
- The initial ML-based cascade scheme from [25] demonstrated an almost threefold decrease in training duration compared with [29]. This was attributed to the significantly smaller amount of data processed by each cascade classifier. However, due to the substantial increase in the dimensionality of the task, this method showed nearly a 13-fold increase in training time compared with [33]. It is worth noting that in this case, the optimal number of cascade levels from [25] was five.
- Additionally, the method from [25] demonstrated the lowest generalization properties among all the considered methods. The difference between the training and application accuracies reached 5%. This is once again explained by the high dimensionality of the task due to the application of the Kolmogorov–Gabor polynomial at each level of the cascade scheme.
- The modified ML-based cascade scheme demonstrated a significant reduction in the training procedure duration (more than 6.6 times) compared with the base method from [25]. Thus, the goal of this article was achieved.
- Furthermore, due to the significant reduction in dimensionality of the already nonlinear input data (after applying the Kolmogorov–Gabor polynomial) through the use of PCA, it was possible to substantially improve the generalization properties of the investigated cascade scheme. Specifically, the difference between the training and application accuracy was only 1%.
- The dimensionality reduction procedure implemented in this work, which was achieved by selecting the number of principal components that accounted for 95% of the variance, enabled the automated operation of the modified ML-based cascade scheme.
- However, one of the most significant advantages of substantial dimensionality reduction in the task was that it not only preserved but also enhanced the accuracy of the modified scheme compared with the existing one. According to the F1-score, the accuracy of the modified ML-based cascade scheme was increased by 1%. This was made possible by transitioning from a large number of nonlinearly expanded inputs, as in [25], to the space of principal components and discarding more insignificant ones according to the procedure proposed in this paper. Specifically, the average number of independent attributes across all four levels of the cascade scheme after using PCA (which was intended to provide 95% variance) was 42, which were processed by the classifiers at each level. For comparison, the average number of independent variables that reached the classifiers at each level of the initial ML-based cascade scheme from [25] was 435.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Mohammed, M.A.; Akawee, M.M.; Saleh, Z.H.; Hasan, R.A.; Ali, A.H.; Sutikno, T. The Effectiveness of Big Data Classification Control Based on Principal Component Analysis. Bull. Electr. Eng. Inform. 2023, 12, 427–434. [Google Scholar] [CrossRef]
- Krak, I.; Barmak, O.; Manziuk, E. Using Visual Analytics to Develop Human and Machine-centric Models: A Review of Approaches and Proposed Information Technology. Comput. Intell. 2022, 38, 921–946. [Google Scholar] [CrossRef]
- Apio, A.L.; Kissi, J.; Achampong, E.K. A Systematic Review of Artificial Intelligence-Based Methods in Healthcare. Int. J. Public Health 2023, 12, 1259. [Google Scholar] [CrossRef]
- Krak, Y.; Barmak, O.V.; Mazurets, O. The Practice Implementation of the Information Technology for Automated Definition of Semantic Terms Sets in the Content of Educational Materials. Probl. Program. 2018, 2139, 245–254. [Google Scholar] [CrossRef]
- Manziuk, E.; Barmak, O.; Krak, I.; Mazurets, O. Formal Model of Trustworthy Artificial Intelligence Based on Standardization. In Proceedings of the IntelITSIS’2021: 2nd International Workshop on Intelligent Information Technologies and Systems of Information Security, Khmelnytskyi, Ukraine, 24–26 March 2021; Volume 2853, pp. 190–197. [Google Scholar]
- Berezsky, O.; Pitsun, O.; Liashchynskyi, P.; Derysh, B.; Batryn, N. Computational Intelligence in Medicine. In Lecture Notes in Data Engineering, Computational Intelligence, and Decision Making; Babichev, S., Lytvynenko, V., Eds.; Lecture Notes on Data Engineering and Communications Technologies; Springer International Publishing: Cham, Switzerland, 2023; Volume 149, pp. 488–510. ISBN 978-3-031-16202-2. [Google Scholar]
- Chumachenko, D.; Piletskiy, P.; Sukhorukova, M.; Chumachenko, T. Predictive Model of Lyme Disease Epidemic Process Using Machine Learning Approach. Appl. Sci. 2022, 12, 4282. [Google Scholar] [CrossRef]
- Liu, W.; Fan, H.; Xia, M. Tree-Based Heterogeneous Cascade Ensemble Model for Credit Scoring. Int. J. Forecast. 2023, 39, 1593–1614. [Google Scholar] [CrossRef]
- Zomchak, L.; Melnychuk, V. Creditworthiness of Individual Borrowers Forecasting with Machine Learning Methods. In Advances in Artificial Systems for Medicine and Education VI; Hu, Z., Ye, Z., He, M., Eds.; Lecture Notes on Data Engineering and Communications Technologies; Springer Nature: Cham, Switzerland, 2023; Volume 159, pp. 553–561. ISBN 978-3-031-24467-4. [Google Scholar]
- Bilski, J.; Smoląg, J.; Kowalczyk, B.; Grzanek, K.; Izonin, I. Fast Computational Approach to the Levenberg-Marquardt Algorithm for Training Feedforward Neural Networks. J. Artif. Intell. Soft Comput. Res. 2023, 13, 45–61. [Google Scholar] [CrossRef]
- Ji, J.; Li, J. Tri-Objective Optimization-Based Cascade Ensemble Pruning for Deep Forest. Pattern Recognit. 2023, 143, 109744. [Google Scholar] [CrossRef]
- Bisikalo, O.V.; Kovtun, V.V.; Kovtun, O.V. Modeling of the Estimation of the Time to Failure of the Information System for Critical Use. In Proceedings of the 2020 10th International Conference on Advanced Computer Information Technologies (ACIT), Deggendorf, Germany, 16–18 September 2020; pp. 140–143. [Google Scholar]
- Mochurad, L.; Shchur, G. Parallelization of Cryptographic Algorithm Based on Different Parallel Computing Technologies. In Proceedings of the IT&AS’2021: Symposium on Information Technologies & Applied Sciences, Bratislava, Slovakia, 5 March 2021; Volume 2824, p. 2029. [Google Scholar]
- Samaan, S.S.; Jeiad, H.A. Feature-Based Real-Time Distributed Denial of Service Detection in SDN Using Machine Learning and Spark. Bull. Electr. Eng. Inform. 2023, 12, 2302–2312. [Google Scholar] [CrossRef]
- Mochurad, L.; Hladun, Y.; Zasoba, Y.; Gregus, M. An Approach for Opening Doors with a Mobile Robot Using Machine Learning Methods. Big Data Cogn. Comput. 2023, 7, 69. [Google Scholar] [CrossRef]
- Ali, A.H.; Alhayali, R.A.I.; Mohammed, M.A.; Sutikno, T. An Effective Classification Approach for Big Data with Parallel Generalized Hebbian Algorithm. Bull. Electr. Eng. Inform. 2021, 10, 3393–3402. [Google Scholar] [CrossRef]
- Xu, S.; Tang, Q.; Jin, L.; Pan, Z. A Cascade Ensemble Learning Model for Human Activity Recognition with Smartphones. Sensors 2019, 19, 2307. [Google Scholar] [CrossRef] [PubMed]
- Ganguli, C.; Shandilya, S.K.; Nehrey, M.; Havryliuk, M. Adaptive Artificial Bee Colony Algorithm for Nature-Inspired Cyber Defense. Systems 2023, 11, 27. [Google Scholar] [CrossRef]
- Nehrey, M.; Hnot, T. Data Science Tools Application for Business Processes Modelling in Aviation: In Advances in Computer and Electrical Engineering; Shmelova, T., Sikirda, Y., Rizun, N., Kucherov, D., Eds.; IGI Global: Hershey, PA, USA, 2019; pp. 176–190. ISBN 978-1-5225-7588-7. [Google Scholar]
- van der Maaten, L.; Hinton, G. Visualizing Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
- Tkachenko, R.; Izonin, I. Model and Principles for the Implementation of Neural-Like Structures Based on Geometric Data Transformations. In Advances in Computer Science for Engineering and Education; Hu, Z., Petoukhov, S., Dychka, I., He, M., Eds.; Springer International Publishing: Cham, Switzerland, 2019; Volume 754, pp. 578–587. ISBN 978-3-319-91007-9. [Google Scholar]
- Wang, Y.; Zhao, Y. Arbitrary Spatial Trajectory Reconstruction Based on a Single Inertial Sensor. IEEE Sens. J. 2023, 23, 10009–10022. [Google Scholar] [CrossRef]
- Izonin, I.; Tkachenko, R.; Gurbych, O.; Kovac, M.; Rutkowski, L.; Holoven, R. A Non-Linear SVR-Based Cascade Model for Improving Prediction Accuracy of Biomedical Data Analysis. Math. Biosci. Eng. 2023, 20, 13398–13414. [Google Scholar] [CrossRef] [PubMed]
- Izonin, I.; Tkachenko, R.; Holoven, R.; Yemets, K.; Havryliuk, M.; Shandilya, S.K. SGD-Based Cascade Scheme for Higher Degrees Wiener Polynomial Approximation of Large Biomedical Datasets. Mach. Learn. Knowl. Extr. 2022, 4, 1088–1106. [Google Scholar] [CrossRef]
- Mulesa, O.; Snytyuk, V.; Myronyuk, I. Optimal alternative selection models in a multi-stage decision-making process. EUREKA Phys. Eng. 2019, 6, 43–50. [Google Scholar] [CrossRef]
- Mulesa, O.; Geche, F.; Batyuk, A.; Buchok, V. Development of Combined Information Technology for Time Series Prediction. In Advances in Intelligent Systems and Computing II; Shakhovska, N., Stepashko, V., Eds.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2018; Volume 689, pp. 361–373. ISBN 978-3-319-70580-4. [Google Scholar]
- Syed Ahmad, S.S.; Azmi, E.F.; Kasmin, F.; Othman, Z. Dimentionality Reduction Based on Binary Cooperative Particle Swarm Optimization. Indones. J. Electr. Eng. Comput. Sci. 2019, 15, 1382. [Google Scholar] [CrossRef]
- Izonin, I.; Gregušml, M.; Tkachenko, R.; Logoyda, M.; Mishchuk, O.; Kynash, Y. SGD-Based Wiener Polynomial Approximation for Missing Data Recovery in Air Pollution Monitoring Dataset. In Advances in Computational Intelligence; Rojas, I., Joya, G., Catala, A., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2019; Volume 11506, pp. 781–793. ISBN 978-3-030-20520-1. [Google Scholar]
- Sambir, A.; Yakovyna, V.; Seniv, M. Recruiting Software Architecture Using User Generated Data. In Proceedings of the 2017 XIIIth International Conference on Perspective Technologies and Methods in MEMS Design (MEMSTECH), Lviv, Ukraine, 20–23 April 2017; pp. 161–163. [Google Scholar]
- Yakovyna, V.; Uhrynovskyi, B. User-Perceived Response Metrics in Android OS for Software Aging Detection. In Proceedings of the 2020 IEEE 15th International Conference on Computer Sciences and Information Technologies (CSIT), Zbarazh, Ukraine, 23–26 September 2020; pp. 436–439. [Google Scholar]
- CDC—2021 BRFSS Survey Data and Documentation. Available online: https://www.cdc.gov/brfss/annual_data/annual_2021.html (accessed on 4 March 2024).
- Sklearn.Linear_Model.SGDRegressor—Scikit-Learn 0.20.2 Documentation. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html (accessed on 8 February 2019).
- Kharchuk, V.; Oleksiv, I. The Intellectual Structure of Sustainable Leadership Studies: Bibliometric Analysis. In Advances in Intelligent Systems, Computer Science and Digital Economics IV; Hu, Z., Wang, Y., He, M., Eds.; Lecture Notes on Data Engineering and Communications Technologies; Springer Nature: Cham, Switzerland, 2023; Volume 158, pp. 430–442. ISBN 978-3-031-24474-2. [Google Scholar]
- Wasilczuk, J.; Chukhray, N.; Karyy, O.; Halkiv, L. Entrepreneurial Competencies and Intentions among Students of Technical Universities. Probl. Perspect. Manag. 2021, 19, 10–21. [Google Scholar] [CrossRef]
- Duriagina, Z.A.; Tkachenko, R.O.; Trostianchyn, A.M.; Lemishka, I.A.; Kovalchuk, A.M.; Kulyk, V.V.; Kovbasyuk, T.M. Determination of the Best Microstructureand Titanium Alloy Powders Propertiesusing Neural Network. J. Achiev. Mater. Manuf. Eng. 2018, 1, 25–31. [Google Scholar] [CrossRef]
- Argyroudis, S.A.; Mitoulis, S.A.; Chatzi, E.; Baker, J.W.; Brilakis, I.; Gkoumas, K.; Vousdoukas, M.; Hynes, W.; Carluccio, S.; Keou, O.; et al. Digital Technologies Can Enhance Climate Resilience of Critical Infrastructure. Clim. Risk Manag. 2022, 35, 100387. [Google Scholar] [CrossRef]
- Fedushko, S.; Molodetska, K.; Syerov, Y. Analytical Method to Improve the Decision-Making Criteria Approach in Managing Digital Social Channels. Heliyon 2023, 9, e16828. [Google Scholar] [CrossRef]
Performance Indicator | Training Mode | Test Mode |
---|---|---|
Precision | 0.8 | 0.794 |
Recall | 0.799 | 0.794 |
F1-score | 0.799 | 0.794 |
Training time (s) | 0.27 | * |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Izonin, I.; Muzyka, R.; Tkachenko, R.; Dronyuk, I.; Yemets, K.; Mitoulis, S.-A. A Method for Reducing Training Time of ML-Based Cascade Scheme for Large-Volume Data Analysis. Sensors 2024, 24, 4762. https://doi.org/10.3390/s24154762
Izonin I, Muzyka R, Tkachenko R, Dronyuk I, Yemets K, Mitoulis S-A. A Method for Reducing Training Time of ML-Based Cascade Scheme for Large-Volume Data Analysis. Sensors. 2024; 24(15):4762. https://doi.org/10.3390/s24154762
Chicago/Turabian StyleIzonin, Ivan, Roman Muzyka, Roman Tkachenko, Ivanna Dronyuk, Kyrylo Yemets, and Stergios-Aristoteles Mitoulis. 2024. "A Method for Reducing Training Time of ML-Based Cascade Scheme for Large-Volume Data Analysis" Sensors 24, no. 15: 4762. https://doi.org/10.3390/s24154762
APA StyleIzonin, I., Muzyka, R., Tkachenko, R., Dronyuk, I., Yemets, K., & Mitoulis, S. -A. (2024). A Method for Reducing Training Time of ML-Based Cascade Scheme for Large-Volume Data Analysis. Sensors, 24(15), 4762. https://doi.org/10.3390/s24154762