Machine Learning Algorithms Associate Case Numbers with SARS-CoV-2 Variants Rather Than with Impactful Mutations
Abstract
:1. Introduction
2. Materials and Methods
2.1. Data Retrieval
2.2. Encoding Genetic Data
2.3. ML Training and Testing
2.4. Explaining Models Output
2.5. Mapping k-mers Back to the Genome
3. Results
3.1. Both Days Ahead and k-mer Lengths Affect Accuracy
3.2. Best Performances Are State-Specific
3.3. Mapping k-mers Back to the Genome
3.4. What the Models Learn
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- May, R.M. Stability and Complexity in Model Ecosystems; Princeton University Press: Princeton, NJ, USA, 1973; Volume 6. [Google Scholar]
- Saad-Roy, C.M.; Wagner, C.E.; Baker, R.E.; Morris, S.E.; Farrar, J.; Graham, A.L.; Levin, S.A.; Mina, M.J.; Metcalf, C.J.E.; Grenfell, B.T. Immune life history, vaccination, and the dynamics of SARS-CoV-2 over the next 5 years. Science 2020, 370, 811–818. [Google Scholar] [CrossRef] [PubMed]
- Arora, P.; Kumar, H.; Panigrahi, B.K. Prediction and analysis of COVID-19 positive cases using deep learning models: A descriptive case study of India. Chaos Solitons Fractals 2020, 139, 110017. [Google Scholar] [CrossRef] [PubMed]
- Alqahtani, F.; Abotaleb, M.; Kadi, A.; Makarovskikh, T.; Potoroko, I.; Alakkari, K.; Badr, A. Hybrid deep learning algorithm for forecasting SARS-CoV-2 daily infections and death cases. Axioms 2022, 11, 620. [Google Scholar] [CrossRef]
- Fokas, A.; Dikaios, N.; Kastis, G. Mathematical models and deep learning for predicting the number of individuals reported to be infected with SARS-CoV-2. J. R. Soc. Interface 2020, 17, 20200494. [Google Scholar] [CrossRef]
- Chimmula, V.K.R.; Zhang, L. Time series forecasting of COVID-19 transmission in Canada using LSTM networks. Chaos Solitons Fractals 2020, 135, 109864. [Google Scholar] [CrossRef]
- Shastri, S.; Singh, K.; Kumar, S.; Kour, P.; Mansotra, V. Time series forecasting of Covid-19 using deep learning models: India-USA comparative case study. Chaos Solitons Fractals 2020, 140, 110227. [Google Scholar] [CrossRef]
- Hassanien, A.E.; Dey, N.; Elghamrawy, S. (Eds.) Big Data Analytics and Artificial Intelligence against COVID-19: Innovation Vision and Approach; Studies in Big Data; Springer International Publishing: Cham, Switzerland, 2020; Volume 78. [Google Scholar]
- Izquierdo-Lara, R.; Elsinga, G.; Heijnen, L.; Munnink, B.B.O.; Schapendonk, C.M.; Nieuwenhuijse, D.; Kon, M.; Lu, L.; Aarestrup, F.M.; Lycett, S.; et al. Monitoring SARS-CoV-2 circulation and diversity through community wastewater sequencing, the Netherlands and Belgium. Emerg. Infect. Dis. 2021, 27, 1405. [Google Scholar] [CrossRef]
- Davies, N.G.; Abbott, S.; Barnard, R.C.; Jarvis, C.I.; Kucharski, A.J.; Munday, J.D.; Pearson, C.A.; Russell, T.W.; Tully, D.C.; Washburne, A.D.; et al. Estimated transmissibility and impact of SARS-CoV-2 lineage B. 1.1. 7 in England. Science 2021, 372, eabg3055. [Google Scholar] [CrossRef]
- Long, G.S.; Hussen, M.; Dench, J.; Aris-Brosou, S. Identifying genetic determinants of complex phenotypes from whole genome sequence data. BMC Genom. 2019, 20, 470. [Google Scholar] [CrossRef]
- Reinhart, A.; Brooks, L.; Jahja, M.; Rumack, A.; Tang, J.; Agrawal, S.; Al Saeed, W.; Arnold, T.; Basu, A.; Bien, J.; et al. An open repository of real-time COVID-19 indicators. Proc. Natl. Acad. Sci. USA 2021, 118, e2111452118. [Google Scholar] [CrossRef]
- Shu, Y.; McCauley, J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance 2017, 22, 30494. [Google Scholar] [CrossRef]
- Katoh, K.; Standley, D.M. MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Mol. Biol. Evol. 2013, 30, 772–780. [Google Scholar] [CrossRef]
- Capella-Gutiérrez, S.; Silla-Martínez, J.M.; Gabaldón, T. trimAl: A tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 2009, 25, 1972–1973. [Google Scholar] [CrossRef]
- Sammut, C.; Webb, G.I. (Eds.) TF–IDF. In Encyclopedia of Machine Learning; Springer: Boston, MA, USA, 2010; pp. 986–987. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org (accessed on 12 April 2022).
- Manica, M.; Litvinova, M.; De Bellis, A.; Guzzetta, G.; Mancuso, P.; Vicentini, M.; Venturelli, F.; Bisaccia, E.; Bento, A.I.; Poletti, P.; et al. Estimation of the incubation period and generation time of SARS-CoV-2 Alpha and Delta variants from contact tracing data. Epidemiol. Infect. 2023, 151, e5. [Google Scholar] [CrossRef]
- O’Malley, T.; Bursztein, E.; Long, J.; Chollet, F.; Jin, H.; Invernizzi, L. KerasTuner. 2019. Available online: https://github.com/keras-team/keras-tuner (accessed on 12 April 2022).
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 2017; pp. 4765–4774. [Google Scholar]
- Hillen, H.S.; Kokic, G.; Farnung, L.; Dienemann, C.; Tegunov, D.; Cramer, P. Structure of replicating SARS-CoV-2 polymerase. Nature 2020, 584, 154–156. [Google Scholar] [CrossRef]
- Lucas, B.; Vahedi, B.; Karimzadeh, M. A spatiotemporal machine learning approach to forecasting COVID-19 incidence at the county level in the USA. Int. J. Data Sci. Anal. 2022, 15, 247–266. [Google Scholar] [CrossRef]
- Gorkhali, R.; Koirala, P.; Rijal, S.; Mainali, A.; Baral, A.; Bhattarai, H.K. Structure and function of major SARS-CoV-2 and SARS-CoV proteins. Bioinform. Biol. Insights 2021, 15, 11779322211025876. [Google Scholar] [CrossRef]
- Wu, Y.; Kang, L.; Guo, Z.; Liu, J.; Liu, M.; Liang, W. Incubation period of COVID-19 caused by unique SARS-CoV-2 strains: A systematic review and meta-analysis. JAMA Netw. Open 2022, 5, e2228008. [Google Scholar] [CrossRef]
- Presti, A.L.; Rezza, G.; Stefanelli, P. Selective pressure on SARS-CoV-2 protein coding genes and glycosylation site prediction. Heliyon 2020, 6, e05001. [Google Scholar] [CrossRef]
- Safari, I.; Elahi, E. Evolution of the SARS-CoV-2 genome and emergence of variants of concern. Arch. Virol. 2022, 167, 293–305. [Google Scholar] [CrossRef] [PubMed]
- Whata, A.; Chimedza, C. Deep Learning for SARS COV-2 Genome Sequences. IEEE Access 2021, 9, 59597–59611. [Google Scholar] [CrossRef] [PubMed]
- Singh, O.P.; Vallejo, M.; El-Badawy, I.M.; Aysha, A.; Madhanagopal, J.; Faudzi, A.A.M. Classification of SARS-CoV-2 and non-SARS-CoV-2 using machine learning algorithms. Comput. Biol. Med. 2021, 136, 104650. [Google Scholar] [CrossRef] [PubMed]
- Deif, M.A.; Solyman, A.A.; Kamarposhti, M.A.; Band, S.S.; Hammam, R.E. A deep bidirectional recurrent neural network for identification of SARS-CoV-2 from viral genome sequences. Math. Biosci. Eng 2021, 18, 8933–8950. [Google Scholar] [CrossRef]
- Câmara, G.B.; Coutinho, M.G.; Silva, L.M.d.; Gadelha, W.V.d.N.; Torquato, M.F.; Barbosa, R.d.M.; Fernandes, M.A. Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification. Sensors 2022, 22, 5730. [Google Scholar] [CrossRef]
- Yan, L.; Zhang, H.T.; Goncalves, J.; Xiao, Y.; Wang, M.; Guo, Y.; Sun, C.; Tang, X.; Jing, L.; Zhang, M.; et al. An interpretable mortality prediction model for COVID-19 patients. Nat. Mach. Intell. 2020, 2, 283–288. [Google Scholar] [CrossRef]
(A) | Random Forest | ||
MN | TX | MN + TX | |
Estimators | 1960 | 1810 | 1960 |
Criterion | Absolute Error | Absolute Error | Squared Error |
Depth | False | True | False |
Maximum depth | N/A | 233 | N/A |
Minimum sample split | 2 | 78 | 2 |
Minimum sample leaf | 1 | 1 | 1 |
Maximum features | Auto | Auto | Auto |
(B) | Feed-Forward Neural Network | ||
MN | TX | MN + TX | |
Number of layer | 3 | 1 | 3 |
Activation function | ReLU | Softplus | ReLU |
Dropout | False | False | False |
Unit layer 1 | 512 | 512 | 224 |
Unit layer 2 | 512 | N/A | 8 |
Unit layer 3 | 152 | N/A | 512 |
Learning rate | 0.0001 | 0.0203 | 0.0001 |
(C) | Final Results | ||
RF | FFNN | ||
MN | 87.82% | 88.15% | |
TX | 91.25% | 93.66% | |
MN + TX | 70.20% | 75.32% |
Variable | RF/MN | RF/TX | FFNN/MN | FFNN/TX |
---|---|---|---|---|
Intercept | 0.1698 ** | 0.1039 ** | 0.1438 ** | 0.1481 ** |
Effect of a mutation | 0.1762 ** | 0.3132 ** | 0.0686 * | 0.1138 ** |
Adjusted | 0.0469 | 0.0698 | 0.0133 | 0.0296 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Vilain, M.; Aris-Brosou, S. Machine Learning Algorithms Associate Case Numbers with SARS-CoV-2 Variants Rather Than with Impactful Mutations. Viruses 2023, 15, 1226. https://doi.org/10.3390/v15061226
Vilain M, Aris-Brosou S. Machine Learning Algorithms Associate Case Numbers with SARS-CoV-2 Variants Rather Than with Impactful Mutations. Viruses. 2023; 15(6):1226. https://doi.org/10.3390/v15061226
Chicago/Turabian StyleVilain, Matthieu, and Stéphane Aris-Brosou. 2023. "Machine Learning Algorithms Associate Case Numbers with SARS-CoV-2 Variants Rather Than with Impactful Mutations" Viruses 15, no. 6: 1226. https://doi.org/10.3390/v15061226
APA StyleVilain, M., & Aris-Brosou, S. (2023). Machine Learning Algorithms Associate Case Numbers with SARS-CoV-2 Variants Rather Than with Impactful Mutations. Viruses, 15(6), 1226. https://doi.org/10.3390/v15061226