Fast Phylogeny of SARS-CoV-2 by Compression
Abstract
:1. Introduction
2. Method
- The first 1700 bytes of the selected sequence where the compressed size alone is 445 and paired with itself (concatenated) is 460, yielding an NCD of 0.0337078651685393.
- For the full virus sequence, (approximately 28,000 bases) the compressed size alone is 7181 and when paired with itself, it has a compressed size of 7208, yielding an NCD of 0.00375992201643225.
2.1. Validation of the Method
2.2. How We Used the Method
3. Materials
Selection of a SARS-CoV-2 Virus Isolate as Standard for the NCDs
4. Results
4.1. Sorted NCDs
4.2. Thirty-Seven Viruses Close to the Selected SARS-CoV-2 Virus
5. The Pangolin Connection
6. Discussion
7. Mathematical Derivation of the NCD
7.1. Kolmogorov Complexity
7.2. Information Distance
7.3. Normalized Compression Distance
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ksiazek, T.G.; Erdman, D.; Goldsmith, C.S.; Zaki, S.R.; Peret, T.; Emery, S.; Tong, S.; Urbani, C.; Comer, J.A.; Lim, W.; et al. A Novel Coronavirus Associated with Severe Acute Respiratory Syndrome. N. Engl. J. Med. 2003, 348, 1953–1996. [Google Scholar] [CrossRef] [PubMed]
- Paraskevis, D.; Kostaki, E.G.; Magiorkinis, G.; Panayiotakopoulos, G.; Sourvinos, G.; Tsiodras, S. Full-genome evolutionary analysis of the novel corona virus (2019-nCoV) rejects the hypothesis of emergence as a result of a recent recombination event. Infect. Genet. Evol. 2020, 79, 104212. [Google Scholar] [CrossRef] [PubMed]
- Randhawa, G.S.; Soltysiak, M.P.M.; el Roz, H.; de Souza, C.P.E.; Hill, K.A.; Kari, L. Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study. PLoS ONE 2020, 15, e0232391. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Vinga, S.; Almeida, J. Alignment-free sequence comparison—A review. Bioinformatics 2003, 19, 513–523. [Google Scholar] [CrossRef]
- Wikipedia: Alignment-Free Sequence Analysis. Available online: https://en.wikipedia.org/wiki/Alignment-free_sequence_analysis (accessed on 2 July 2020).
- Zielezinski, A.; Vinga, S.; Almeida, J.; Karlowski, W.M. Alignment-free sequence comparison: Benefits, applications, and tools. Genome Biol. 2017, 18, 186. [Google Scholar] [CrossRef] [Green Version]
- Wang, L.; Jiang, T. On the complexity of multiple sequence alignment. J. Comp. Biol. 1994, 1, 337–348. [Google Scholar] [CrossRef] [Green Version]
- Rannala1, B.; Yang, Z. Phylogenetic Inference Using Whole Genomes. Annu. Rev. Genom. Hum. Genet. 2008, 9, 217–231. [Google Scholar] [CrossRef] [Green Version]
- Wikipedia: PHYLIP. Available online: https://en.wikipedia.org/wiki/PHYLIP (accessed on 4 July 2020).
- Cilibrasi, R.L. The CompLearn Toolkit. 2003. Available online: https://complearn.orgwww.complearn.org (accessed on 2 July 2020).
- Cilibrasi, R.L.; Vitányi, P.M.B. Clustering by compression. IEEE Trans. Inf. Theory 2005, 51, 1523–1545. [Google Scholar] [CrossRef] [Green Version]
- Bennett, C.H.; Gács, P.; Li, M.; Vitányi, P.M.B.; Zurek, W. Information Distance. IEEE Trans. Inf. Theory 1998, 44, 1407–1423. [Google Scholar] [CrossRef]
- Li, M.; Chen, X.; Li, X.; Ma, B.; Vitányi, P.M.B. The similarity metric. IEEE Trans. Inf. Theory 2004, 50, 3250–3264. [Google Scholar] [CrossRef]
- Cleary, J.G.; Witten, I.H. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 1984, 32, 396–402. [Google Scholar] [CrossRef] [Green Version]
- Cebrian, M.; Alfonseca, M.; Ortega, A. Common pitfalls using the normalized compression distance: What to watch out for in a compressor. Commun. Inf. Syst. 2005, 5, 367–384. [Google Scholar]
- Cebrian, M.; Alfonseca, M.; Ortega, A. The normalized compression distance is resistant to noise. IEEE Trans. Inform. Theory 2007, 53, 1895–1900. [Google Scholar] [CrossRef]
- Kryukov, K.; Ueda, M.T.; Nakagawa, S.; Imanishi, T. Sequence Compression Benchmark (SCB) database–A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences. GigaScience 2020, 9, giaa072. [Google Scholar] [CrossRef]
- Keogh, E.J.; Lonardi, S.; Rtanamahatana, C.A.; Wei, L.; Lee, S.H.; Handley, J. Compression-based data mining of sequential data. Data Min. Knowl. Discov. 2007, 14, 99–129. [Google Scholar] [CrossRef]
- Barthel, D.; Hirst, J.D.; Blazewics, J.; Burke, E.K.; Krasnogar, N. ProCKSI: A decision support system for Protein (Structure) Comparison, Knowledge, Similarity and Information. BMC Bioinform. 2007, 8, 416. [Google Scholar] [CrossRef] [Green Version]
- Contreras, I.; Arnaldo, I.; Krasnogor, N.; Hidalgo, J.I. Blind optimisation problem instance classification via enhanced universal similarity metric. Memetic Comput. 2014, 6, 263–276. [Google Scholar] [CrossRef]
- Krasnogar, N.; Pelta, D.A. Measuring the similarity of protein structures by means of the universal similarity metric. Bioinformatics 2004, 20, 1015–1021. [Google Scholar] [CrossRef] [Green Version]
- Nykter, M.; Price, N.D.; Aldana, M.; Ramsey, S.A.; Kauffman, S.A.; Hood, L.E.; Yli-Harja, O.; Shmulevich, I. Gene expression dynamics in the macrophage exhibit criticality. Proc. Natl. Acad. Sci. USA 2008, 105, 1897–1900. [Google Scholar] [CrossRef] [Green Version]
- GISAID. Available online: www.gisaid.org (accessed on 17 July 2020).
- Peñarrubiaa, L.; Ruiza, M.; Porcoa, R.; Raob, S.N.; Juanola-Falgaronaa, M.; Manisseroc, D.; López-Fontanalsa, M.; Pareja, J. Multiple assays in a real-time RT-PCR SARS-CoV-2 panel can mitigate the risk of loss of sensitivity by new genomic variants during the COVID-19 outbreak. Int. J. Infect. Dis. 2020, 97, 225–229. [Google Scholar] [CrossRef]
- Ceraolo, C.; Giorgi, F.M. Genomic variance of the 2019-nCoV coronavirus. J. Med. Virol. 2020, 92, 522–528. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, T.; Wu, Q.; Zhang, Z. Probable Pangolin origin of SARS-CoV-2 associated with the COVID-19 outbreak. Curr. Biol. 2020, 30, 1346–1351. [Google Scholar] [CrossRef] [PubMed]
- Li, X.; Giorgi, E.E.; Marichannegowda, M.H.; Foley, B.; Xiao, C.; Kong, X.-P.; Chen, Y.; Gnanakaran, S.; Korber, B.; Gao, F. Emergence of SARS-CoV-2 through recombination and strong purifying selection. Sci. Adv. 2020, 6, eabb9153. [Google Scholar] [CrossRef] [PubMed]
- Luis, A.D.; Hayman, D.T.S.; O’Shea, T.J.; Cryan, P.M.; Gilbert, A.T.; Pulliam, J.R.C.; Mills, J.N.; Timonin, M.E.; Willis, C.K.R.; Cunningham, A.A.; et al. A comparison of bats and rodents as reservoirs of zoonotic viruses: Are bats special? Proc. R. Soc. B Biol. Sci. 2013, 280, 28020122753. [Google Scholar] [CrossRef] [Green Version]
- Turing, A.M. On computable numbers, with an application to the Entscheidungsproblem. Proc. Lond. Math. Soc. 1936, 42, 230–265, Erratum in Proc. Lond. Math. Soc. 1937, 43, 544–546. [Google Scholar] [CrossRef]
- Li, M.; Vitányi, P.M.B. An Introduction to Kolmogorov Complexity and Its Applications; Springer: New York, NY, USA, 2008. [Google Scholar]
- Kolmogorov, A.N. Three approaches to the quantitative definition of information. Probl. Inform. Transm. 1965, 1, 1–7. [Google Scholar] [CrossRef]
- Shannon, C.E. The mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef] [Green Version]
- Gács, P. On the symmetry of algorithmic information. Sov. Math. Dokl. 1974, 15, 1265, Erratum in 1974, 15, 1480. [Google Scholar]
- Vitányi, P.M.B.; Balbach, F.J.; Cilibrasi, R.L.; Li, M. Normalized information distance. In Information Theory and Statistical Learning; Emmert-Streib, F., Dehmer, M., Eds.; Springer: New York, NY, USA, 2009; pp. 45–82. [Google Scholar]
- Li, M.; Badger, J.H.; Chen, X.; Kwong, S.; Kearney, P.; Zhang, H. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics 2001, 17, 149–154. [Google Scholar] [CrossRef] [Green Version]
- Terwijn, S.A.; Torenvliet, L.; Vitányi, P.M.B. Nonapproximability of the Normalized Information Distance. J. Comput. Syst. Sci. 2011, 77, 738–742. [Google Scholar] [CrossRef] [Green Version]
- Vitányi, P.M.B. Similarity and denoising. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2013, 371, 20120091. [Google Scholar] [CrossRef] [PubMed]
- Cilibrasi, R.L.; Vitányi, P.M.B. A fast quartet tree heuristic for hierarchical clustering. Pattern Recognit. 2011, 44, 662–677. [Google Scholar] [CrossRef] [Green Version]
NCD | Virus Name |
---|---|
0.00362117 | selected_SARS_CoV_2_EPI_ISL_471246 |
0.0111034 | MN908947.3_alt._SARS_CoV_2 |
0.44486 | BetaCoV/bat/Yunnan/RaTG13/2013|EPI_ISL_402131_EPI_ISL_402131 |
0.788416 | MG772933.1_bat_SL_CoVZC45 |
0.791082 | MG772934.1_bat_SL_CoVZXC21 |
0.917493 | KF569996_Coronaviridae_785 |
0.917563 | KC881006_Coronaviridae_783 |
0.91801 | KC881005_Coronaviridae_782 |
0.918257 | FJ882963_Coronaviridae_726 |
0.918381 | AY278554_Riboviria_2953 |
0.918447 | EU371561_Riboviria_3205 |
0.918497 | AY278488_Riboviria_2951 |
0.918531 | AY278741_Riboviria_2954 |
0.918553 | AY278491_Riboviria_2952 |
0.918565 | NC_004718_Coronaviridae_806 |
0.918597 | EU371563_Riboviria_3207 |
0.918605 | FJ882935_Coronaviridae_722 |
0.918658 | AY357075_Riboviria_2979 |
0.918669 | EU371559_Riboviria_3203 |
0.918691 | DQ640652_Riboviria_3098 |
0.918724 | EU371562_Riboviria_3206 |
0.918796 | AY350750_Riboviria_2977 |
0.918829 | AY864805_Riboviria_3030 |
0.918945 | FJ882945_Coronaviridae_724 |
0.919072 | EU371560_Riboviria_3204 |
0.919117 | AY864806_Riboviria_3031 |
0.919182 | EU371564_Riboviria_3208 |
0.919221 | AY394850_Riboviria_2981 |
0.919244 | FJ882942_Coronaviridae_723 |
0.919486 | AY357076_Riboviria_2980 |
0.91954 | FJ882954_Coronaviridae_725 |
0.91993 | KF367457_Coronaviridae_784 |
0.920486 | AY515512_Riboviria_2987 |
0.921053 | JX993988_Coronaviridae_779 |
0.923045 | GQ153543_Coronaviridae_751 |
0.923151 | GQ153542_Coronaviridae_750 |
0.92541 | DQ648857_Riboviria_3101 |
0.925706 | GQ153547_Coronaviridae_755 |
0.925802 | JX993987_Coronaviridae_778 |
0.925844 | GQ153544_Coronaviridae_752 |
0.925951 | GQ153548_Coronaviridae_756 |
0.925982 | GQ153545_Coronaviridae_753 |
0.926013 | GQ153540_Coronaviridae_748 |
0.92613 | GQ153546_Coronaviridae_754 |
0.92615 | GQ153539_Coronaviridae_747 |
0.92615 | GQ153541_Coronaviridae_749 |
0.926681 | DQ412043_Riboviria_3074 |
0.931577 | DQ412042_Riboviria_3073 |
0.932533 | DQ648856_Riboviria_3100 |
0.952228 | NC_014470_Coronaviridae_823 |
0.994546 | NC_025217_Coronaviridae_835 |
0.994897 | NC_034440_Coronaviridae_847 |
0.994986 | FJ938057_Coronaviridae_734 |
0.994986 | AY646283_Riboviria_3003 |
0.995078 | EF065512_Riboviria_3126 |
0.995078 | EF065511_Riboviria_3125 |
0.995078 | EF065510_Riboviria_3124 |
0.995086 | EF065506_Riboviria_3120 |
0.995086 | EF065507_Riboviria_3121 |
0.995086 | EF065505_Riboviria_3119 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cilibrasi, R.L.; Vitányi, P.M.B. Fast Phylogeny of SARS-CoV-2 by Compression. Entropy 2022, 24, 439. https://doi.org/10.3390/e24040439
Cilibrasi RL, Vitányi PMB. Fast Phylogeny of SARS-CoV-2 by Compression. Entropy. 2022; 24(4):439. https://doi.org/10.3390/e24040439
Chicago/Turabian StyleCilibrasi, Rudi L., and Paul M. B. Vitányi. 2022. "Fast Phylogeny of SARS-CoV-2 by Compression" Entropy 24, no. 4: 439. https://doi.org/10.3390/e24040439
APA StyleCilibrasi, R. L., & Vitányi, P. M. B. (2022). Fast Phylogeny of SARS-CoV-2 by Compression. Entropy, 24(4), 439. https://doi.org/10.3390/e24040439