PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types
Abstract
:1. Introduction
2. Materials and Methods
2.1. Entity Types, Channels, and Associations
2.2. Text Mining of Scientific Literature
2.3. Annotated Genomes and Isolates
2.4. Environmental Samples
2.5. Sequence Search
2.6. Back-End Server and Front-End Implementation
3. Results
3.1. The PREGO Web Resource
3.2. PREGO in Action
3.2.1. Which Environments Are Related to a Taxon?
3.2.2. Which Biological Processes and Molecular Functions Are Related to a Taxon?
3.2.3. Which Taxa Are Related to a Biological Process?
3.2.4. Are There Any Associations between Environments and Biological Processes?
3.3. PREGO Contents
4. Discussion
4.1. PREGO Contents
4.2. Related Tools’ Functionality and Content
4.3. PREGO Next Steps
- prego_gathering_data https://github.com/lab42open-team/prego_gathering_data
- prego_daemons https://github.com/lab42open-team/prego_daemons
- prego_mappings https://github.com/lab42open-team/prego_mappings
- prego_statistics https://github.com/lab42open-team/prego_statistics
- tagger https://github.com/larsjuhljensen/tagger, BSD 2-Clause “Simplified” License
- mamba https://github.com/larsjuhljensen/mamba, BSD 2-Clause “Simplified” License
- tagger dictionary https://download.jensenlab.org/ and there in: https://download.jensenlab.org/prego_dictionary.tar.gz, CC-BY 4.0 license.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
HTS | High Throughput Sequencing |
MAGs | Metagenome-Assembled Genomes |
SAGs | Single Amplified Genomes |
GSC | Genomic Standards Consortium |
NMDC | National Microbiome Data Collaborative |
GO | Gene Ontology |
GOmf | Gene Ontology (molecular function) |
GObp | Gene Ontology (biological process) |
ENVO | Environmental Ontology |
NCBI | National Center for Biotechnology Information |
LPSN | List of Prokaryotic names with Standing in Nomenclature |
PMC | PubMed Central |
NER | Named Entity Recognition |
API | Application Programming Interface |
FTP | File Transfer Protocol |
GTDB | Genome Taxonomy DataBase |
OTU | Operational Taxonomic Unit |
SRMs | sulfate-reducing microorganisms |
Appendix A
Mappings
Appendix B
Daemons
Appendix C
Appendix C.1. Scoring
- Which associations are more thrustworthy?
- Which associations are more relevant to the user’s query?
Y = y | ||||
---|---|---|---|---|
X = x | Yes | No | Total | |
Yes | cx,y | cx,0 | cx,. | |
No | c0,y | c0,0 | c0,. | |
Total | c.,y | c.,0 | c.,. |
Appendix C.2. Literature Channel
Appendix C.3. Environmental Samples Channel
Appendix D
Bulk Download
Channel | Link | md5sum | Size (in GB, Zipped) |
---|---|---|---|
Literature | https://prego.hcmr.gr/download/literature.tar.gz | literature.tar.gz.md5 | 5.4 |
Environmental samples | https://prego.hcmr.gr/download/environmental_samples.tar.gz | environmental_samples.tar.gz.md5 | 0.69 |
Annotated genomes and isolates | https://prego.hcmr.gr/download/annotated_genomes_isolates.tar.gz | annotated_genomes_isolates.tar.gz.md5 | 0.26 |
References
- Falkowski, P.G.; Fenchel, T.; Delong, E.F. The Microbial Engines That Drive Earth’s Biogeochemical Cycles. Science 2008, 320, 1034–1039. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bar-On, Y.M.; Phillips, R.; Milo, R. The Biomass Distribution on Earth. Proc. Natl. Acad. Sci. USA 2018, 115, 6506–6511. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Delgado-Baquerizo, M.; Maestre, F.T.; Reich, P.B.; Jeffries, T.C.; Gaitan, J.J.; Encinar, D.; Berdugo, M.; Campbell, C.D.; Singh, B.K. Microbial Diversity Drives Multifunctionality in Terrestrial Ecosystems. Nat. Commun. 2016, 7, 10541. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Röttjers, L.; Faust, K. From Hairballs to Hypotheses–Biological Insights from Microbial Networks. FEMS Microbiol. Rev. 2018, 42, 761–780. [Google Scholar] [CrossRef] [Green Version]
- Morris, A.; Meyer, K.; Bohannan, B. Linking Microbial Communities to Ecosystem Functions: What We Can Learn from Genotype–Phenotype Mapping in Organisms. Philos. Trans. R. Soc. B Biol. Sci. 2020, 375, 20190244. [Google Scholar] [CrossRef] [Green Version]
- Biggs, M.B.; Medlock, G.L.; Kolling, G.L.; Papin, J.A. Metabolic Network Modeling of Microbial Communities. Wiley Interdiscip. Rev. Syst. Biol. Med. 2015, 7, 317–334. [Google Scholar] [CrossRef] [Green Version]
- Hall, E.K.; Bernhardt, E.S.; Bier, R.L.; Bradford, M.A.; Boot, C.M.; Cotner, J.B.; del Giorgio, P.A.; Evans, S.E.; Graham, E.B.; Jones, S.E.; et al. Understanding How Microbiomes Influence the Systems They Inhabit. Nat. Microbiol. 2018, 3, 977–982. [Google Scholar] [CrossRef]
- Jensen, L.J.; Saric, J.; Bork, P. Literature Mining for the Biologist: From Information Retrieval to Biological Discovery. Nat. Rev. Genet. 2006, 7, 119–129. [Google Scholar] [CrossRef]
- Delmont, T.O.; Malandain, C.; Prestat, E.; Larose, C.; Monier, J.-M.; Simonet, P.; Vogel, T.M. Metagenomic Mining for Microbiologists. ISME J. 2011, 5, 1837–1843. [Google Scholar] [CrossRef] [Green Version]
- Raes, J.; Bork, P. Molecular Eco-Systems Biology: Towards an Understanding of Community Function. Nat. Rev. Microbiol. 2008, 6, 693–699. [Google Scholar] [CrossRef] [Green Version]
- Nilsson, R.H.; Anslan, S.; Bahram, M.; Wurzbacher, C.; Baldrian, P.; Tedersoo, L. Mycobiome Diversity: High-Throughput Sequencing and Identification of Fungi. Nat. Rev. Microbiol. 2019, 17, 95–109. [Google Scholar] [CrossRef] [PubMed]
- Pesant, S.; Not, F.; Picheral, M.; Kandels-Lewis, S.; Le Bescot, N.; Gorsky, G.; Iudicone, D.; Karsenti, E.; Speich, S.; Troublé, R.; et al. Open Science Resources for the Discovery and Analysis of Tara Oceans Data. Sci. Data 2015, 2, 150023. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gilbert, J.A.; Jansson, J.K.; Knight, R. The Earth Microbiome project: Successes and aspirations. BMC Biol. 2014, 12, 69. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shu, W.-S.; Huang, L.-N. Microbial Diversity in Extreme Environments. Nat. Rev. Microbiol. 2021, 1–17. [Google Scholar] [CrossRef]
- Yilmaz, P.; Kottmann, R.; Field, D.; Knight, R.; Cole, J.R.; Amaral-Zettler, L.; Gilbert, J.A.; Karsch-Mizrachi, I.; Johnston, A.; Cochrane, G.; et al. Minimum Information about a Marker Gene Sequence (MIMARKS) and Minimum Information about Any (x) Sequence (MIxS) Specifications. Nat. Biotechnol. 2011, 29, 415–420. [Google Scholar] [CrossRef] [Green Version]
- Wood-Charlson, E.M.; Auberry, D.; Blanco, H.; Borkum, M.I.; Corilo, Y.E.; Davenport, K.W.; Deshpande, S.; Devarakonda, R.; Drake, M.; Duncan, W.D.; et al. The National Microbiome Data Collaborative: Enabling Microbiome Science. Nat. Rev. Microbiol. 2020, 18, 313–314. [Google Scholar] [CrossRef]
- Vangay, P.; Burgin, J.; Johnston, A.; Beck, K.L.; Berrios, D.C.; Blumberg, K.; Canon, S.; Chain, P.; Chandonia, J.-M.; Christianson, D.; et al. Microbiome Metadata Standards: Report of the National Microbiome Data Collaborative’s Workshop and Follow-On Activities. mSystems 2021, 6, e01194-20. [Google Scholar] [CrossRef]
- Walls, R.L.; Deck, J.; Guralnick, R.; Baskauf, S.; Beaman, R.; Blum, S.; Bowers, S.; Buttigieg, P.L.; Davies, N.; Endresen, D.; et al. Semantics in Support of Biodiversity Knowledge Discovery: An Introduction to the Biological Collections Ontology and Related Ontologies. PLoS ONE 2014, 9, e89606. [Google Scholar] [CrossRef] [Green Version]
- Buttigieg, P.L.; Pafilis, E.; Lewis, S.E.; Schildhauer, M.P.; Walls, R.L.; Mungall, C.J. The Environment Ontology in 2016: Bridging Domains with Increased Scope, Semantic Density, and Interoperation. J. Biomed. Semant. 2016, 7, 57. [Google Scholar] [CrossRef] [Green Version]
- Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene Ontology: Tool for the Unification of Biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef] [Green Version]
- Gene Ontology Consortium. The Gene Ontology Resource: Enriching a GOld Mine. Nucleic Acids Res. 2021, 49, D325–D334. [Google Scholar] [CrossRef] [PubMed]
- Dixon, H.B.F. IUPAC-IUBMB Joint Commission on Biochemical Nomenclature (JCBN) and Nomenclature Committee of IUBMB (NC-IUBMB), Newsletter 1999. Eur. J. Biochem. 1999, 264, 607–609. [Google Scholar]
- Caspi, R.; Billington, R.; Keseler, I.M.; Kothari, A.; Krummenacker, M.; Midford, P.E.; Ong, W.K.; Paley, S.; Subhraveti, P.; Karp, P.D. The MetaCyc Database of Metabolic Pathways and Enzymes—A 2019 Update. Nucleic Acids Res. 2020, 48, D445–D453. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Schoch, C.L.; Ciufo, S.; Domrachev, M.; Hotton, C.L.; Kannan, S.; Khovanskaya, R.; Leipe, D.; Mcveigh, R.; O’Neill, K.; Robbertse, B.; et al. NCBI Taxonomy: A Comprehensive Update on Curation, Resources and Tools. Database J. Biol. Databases Curation 2020, 2020, baaa062. [Google Scholar] [CrossRef] [PubMed]
- Parte, A.C.; Sardà Carbasse, J.; Meier-Kolthoff, J.P.; Reimer, L.C.; Göker, M. List of Prokaryotic Names with Standing in Nomenclature (LPSN) Moves to the DSMZ. Int. J. Syst. Evol. Microbiol. 2020, 70, 5607–5612. [Google Scholar] [CrossRef] [PubMed]
- Mitchell, A.L.; Almeida, A.; Beracochea, M.; Boland, M.; Burgin, J.; Cochrane, G.; Crusoe, M.R.; Kale, V.; Potter, S.C.; Richardson, L.J.; et al. MGnify: The Microbiome Analysis Resource in 2020. Nucleic Acids Res. 2020, 48, D570–D578. [Google Scholar] [CrossRef]
- Chen, I.-M.A.; Chu, K.; Palaniappan, K.; Ratner, A.; Huang, J.; Huntemann, M.; Hajek, P.; Ritter, S.; Varghese, N.; Seshadri, R.; et al. The IMG/M Data Management and Analysis System v.6.0: New Tools and Advanced Capabilities. Nucleic Acids Res. 2021, 49, D751–D763. [Google Scholar] [CrossRef]
- Wilke, A.; Bischof, J.; Harrison, T.; Brettin, T.; D’Souza, M.; Gerlach, W.; Matthews, H.; Paczian, T.; Wilkening, J.; Glass, E.M.; et al. A RESTful API for Accessing Microbial Community Data for MG-RAST. PLoS Comput. Biol. 2015, 11, e1004008. [Google Scholar] [CrossRef]
- Roberts, R.J. PubMed Central: The GenBank of the Published Literature. Proc. Natl. Acad. Sci. USA 2001, 98, 381–382. [Google Scholar] [CrossRef] [Green Version]
- Harmston, N.; Filsell, W.; Stumpf, M.P. What the Papers Say: Text Mining for Genomics and Systems Biology. Hum. Genom. 2010, 5, 17–29. [Google Scholar] [CrossRef] [Green Version]
- Pafilis, E.; Frankild, S.P.; Fanini, L.; Faulwetter, S.; Pavloudi, C.; Vasileiadou, A.; Arvanitidis, C.; Jensen, L.J. The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text. PLoS ONE 2013, 8, e65390. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Pafilis, E.; Buttigieg, P.L.; Ferrell, B.; Pereira, E.; Schnetzer, J.; Arvanitidis, C.; Jensen, L.J. EXTRACT: Interactive Extraction of Environment Metadata and Term Suggestion for Metagenomic Sample Annotation. Database 2016, 2016, baw005. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Von Mering, C.; Jensen, L.J.; Snel, B.; Hooper, S.D.; Krupp, M.; Foglierini, M.; Jouffre, N.; Huynen, M.A.; Bork, P. STRING: Known and Predicted Protein–Protein Associations, Integrated and Transferred across Organisms. Nucleic Acids Res. 2005, 33, D433–D437. [Google Scholar] [CrossRef] [PubMed]
- Franceschini, A.; Szklarczyk, D.; Frankild, S.; Kuhn, M.; Simonovic, M.; Roth, A.; Lin, J.; Minguez, P.; Bork, P.; von Mering, C.; et al. STRING v9.1: Protein-Protein Interaction Networks, with Increased Coverage and Integration. Nucleic Acids Res. 2013, 41, D808–D815. [Google Scholar] [CrossRef] [Green Version]
- Gomez-Cabrero, D.; Abugessaisa, I.; Maier, D.; Teschendorff, A.; Merkenschlager, M.; Gisel, A.; Ballestar, E.; Bongcam-Rudloff, E.; Conesa, A.; Tegnér, J. Data Integration in the Era of Omics: Current and Future Challenges. BMC Syst. Biol. 2014, 8, I1. [Google Scholar] [CrossRef]
- Cavicchioli, R.; Ripple, W.J.; Timmis, K.N.; Azam, F.; Bakken, L.R.; Baylis, M.; Behrenfeld, M.J.; Boetius, A.; Boyd, P.W.; Classen, A.T.; et al. Scientists’ Warning to Humanity: Microorganisms and Climate Change. Nat. Rev. Microbiol. 2019, 17, 569–586. [Google Scholar] [CrossRef] [Green Version]
- D’Hondt, K.; Kostic, T.; McDowell, R.; Eudes, F.; Singh, B.K.; Sarkar, S.; Markakis, M.; Schelkle, B.; Maguin, E.; Sessitsch, A. Microbiome Innovations for a Sustainable Future. Nat. Microbiol. 2021, 6, 138–142. [Google Scholar] [CrossRef]
- Conde-Pueyo, N.; Vidiella, B.; Sardanyés, J.; Berdugo, M.; Maestre, F.T.; De Lorenzo, V.; Solé, R. Synthetic Biology for Terraformation Lessons from Mars, Earth, and the Microbiome. Life 2020, 10, 14. [Google Scholar] [CrossRef] [Green Version]
- Baltoumas, F.A.; Zafeiropoulou, S.; Karatzas, E.; Koutrouli, M.; Thanati, F.; Voutsadaki, K.; Gkonta, M.; Hotova, J.; Kasionis, I.; Hatzis, P.; et al. Biomolecule and Bioentity Interaction Databases in Systems Biology: A Comprehensive Review. Biomolecules 2021, 11, 1245. [Google Scholar] [CrossRef]
- Reimer, L.C.; Vetcininova, A.; Carbasse, J.S.; Söhngen, C.; Gleim, D.; Ebeling, C.; Overmann, J. BacDive in 2019: Bacterial Phenotypic Data for High-Throughput Biodiversity Analysis. Nucleic Acids Res. 2019, 47, D631–D636. [Google Scholar] [CrossRef] [Green Version]
- Shaaban, H.; Westfall, D.A.; Mohammad, R.; Danko, D.; Bezdan, D.; Afshinnekoo, E.; Segata, N.; Mason, C.E. The Microbe Directory: An Annotated, Searchable Inventory of Microbes’ Characteristics. Gates Open Res. 2018, 2, 3. [Google Scholar] [CrossRef] [PubMed]
- Kosina, S.M.; Greiner, A.M.; Lau, R.K.; Jenkins, S.; Baran, R.; Bowen, B.P.; Northen, T.R. Web of Microbes (WoM): A Curated Microbial Exometabolomics Database for Linking Chemistry and Microbes. BMC Microbiol. 2018, 18, 115. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Microbial Interaction Network Database. Available online: http://www.microbialnet.org/mind_home.html (accessed on 21 December 2021).
- Tang, Y.; Dai, T.; Su, Z.; Hasegawa, K.; Tian, J.; Chen, L.; Wen, D. A Tripartite Microbial-Environment Network Indicates How Crucial Microbes Influence the Microbial Community Ecology. Microb. Ecol. 2020, 79, 342–356. [Google Scholar] [CrossRef] [PubMed]
- Koutrouli, M.; Karatzas, E.; Paez-Espino, D.; Pavlopoulos, G.A. A Guide to Conquer the Biological Network Era Using Graph Theory. Front. Bioeng. Biotechnol. 2020, 8, 34. [Google Scholar] [CrossRef] [PubMed]
- Li, K.; Hu, J.; Li, T.; Liu, F.; Tao, J.; Liu, J.; Zhang, Z.; Luo, X.; Li, L.; Deng, Y.; et al. Microbial Abundance and Diversity Investigations along Rivers: Current Knowledge and Future Directions. Wiley Interdiscip. Rev. Water 2021, 8, e1547. [Google Scholar] [CrossRef]
- Jensen, L.J. One Tagger, Many Uses: Illustrating the Power of Ontologies in Dictionary-Based Named Entity Recognition. bioRxiv 2016, 067132. [Google Scholar] [CrossRef] [Green Version]
- Sayers, E.W.; Beck, J.; Bolton, E.E.; Bourexis, D.; Brister, J.R.; Canese, K.; Comeau, D.C.; Funk, K.; Kim, S.; Klimke, W.; et al. Database Resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2021, 49, D10–D17. [Google Scholar] [CrossRef]
- Pafilis, E.; Frankild, S.P.; Schnetzer, J.; Fanini, L.; Faulwetter, S.; Pavloudi, C.; Vasileiadou, K.; Leary, P.; Hammock, J.; Schulz, K.; et al. ENVIRONMENTS and EOL: Identification of Environment Ontology Terms in Text and the Annotation of the Encyclopedia of Life. Bioinformatics 2015, 31, 1872–1874. [Google Scholar] [CrossRef]
- Mukherjee, S.; Stamatis, D.; Bertsch, J.; Ovchinnikova, G.; Sundaramurthi, J.C.; Lee, J.; Kandimalla, M.; Chen, I.-M.A.; Kyrpides, N.C.; Reddy, T.B.K. Genomes OnLine Database (GOLD) v.8: Overview and Updates. Nucleic Acids Res. 2021, 49, D723–D733. [Google Scholar] [CrossRef]
- De la Cuesta-Zuluaga, J.; Ley, R.E.; Youngblut, N.D. Struo: A Pipeline for Building Custom Databases for Common Metagenome Profilers. Bioinformatics 2020, 36, 2314–2315. [Google Scholar] [CrossRef]
- Parks, D.H.; Chuvochina, M.; Chaumeil, P.-A.; Rinke, C.; Mussig, A.J.; Hugenholtz, P. A Complete Domain-to-Species Taxonomy for Bacteria and Archaea. Nat. Biotechnol. 2020, 38, 1079–1086. [Google Scholar] [CrossRef] [PubMed]
- Quast, C.; Pruesse, E.; Yilmaz, P.; Gerken, J.; Schweer, T.; Yarza, P.; Peplies, J.; Glöckner, F.O. The SILVA Ribosomal RNA Gene Database Project: Improved Data Processing and Web-Based Tools. Nucleic Acids Res. 2013, 41, D590–D596. [Google Scholar] [CrossRef] [PubMed]
- Guillou, L.; Bachar, D.; Audic, S.; Bass, D.; Berney, C.; Bittner, L.; Boutte, C.; Burgaud, G.; de Vargas, C.; Decelle, J.; et al. The Protist Ribosomal Reference Database (PR2): A Catalog of Unicellular Eukaryote Small Sub-Unit RRNA Sequences with Curated Taxonomy. Nucleic Acids Res. 2013, 41, D597–D604. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Del Campo, J.; Kolisko, M.; Boscaro, V.; Santoferrara, L.F.; Nenarokov, S.; Massana, R.; Guillou, L.; Simpson, A.; Berney, C.; de Vargas, C.; et al. EukRef: Phylogenetic Curation of Ribosomal RNA to Enhance Understanding of Eukaryotic Diversity and Distribution. PLoS Biol. 2018, 16, e2005849. [Google Scholar] [CrossRef] [Green Version]
- Suter, L.; Polanowski, A.M.; Clarke, L.J.; Kitchener, J.A.; Deagle, B.E. Capturing Open Ocean Biodiversity: Comparing Environmental DNA Metabarcoding to the Continuous Plankton Recorder. Mol. Ecol. 2021, 30, 3140–3157. [Google Scholar] [CrossRef]
- Leray, M.; Ho, S.-L.; Lin, I.-J.; Machida, R.J. MIDORI Server: A Webserver for Taxonomic Assignment of Unknown Metazoan Mitochondrial-Encoded Sequences Using a Curated Database. Bioinformatics 2018, 34, 3753–3754. [Google Scholar] [CrossRef]
- Nilsson, R.H.; Larsson, K.-H.; Taylor, A.F.S.; Bengtsson-Palme, J.; Jeppesen, T.S.; Schigel, D.; Kennedy, P.; Picard, K.; Glöckner, F.O.; Tedersoo, L.; et al. The UNITE Database for Molecular Identification of Fungi: Handling Dark Taxa and Parallel Taxonomic Classifications. Nucleic Acids Res. 2019, 47, D259–D264. [Google Scholar] [CrossRef]
- Pavloudi, C.; Oulas, A.; Vasileiadou, K.; Kotoulas, G.; Troch, M.D.; Friedrich, M.W.; Arvanitidis, C. Diversity and Abundance of Sulfate-Reducing Microorganisms in a Mediterranean Lagoonal Complex (Amvrakikos Gulf, Ionian Sea) Derived from DsrB Gene. Aquat. Microb. Ecol. 2017, 79, 209–219. [Google Scholar] [CrossRef] [Green Version]
- Westergaard, D.; Stærfeldt, H.-H.; Tønsberg, C.; Jensen, L.J.; Brunak, S. A Comprehensive and Quantitative Comparison of Text-Mining in 15 Million Full-Text Articles versus Their Corresponding Abstracts. PLoS Comput. Biol. 2018, 14, e1005962. [Google Scholar] [CrossRef] [Green Version]
- Ferguson, C.; Araújo, D.; Faulk, L.; Gou, Y.; Hamelers, A.; Huang, Z.; Ide-Smith, M.; Levchenko, M.; Marinos, N.; Nambiar, R.; et al. Europe PMC in 2020. Nucleic Acids Res. 2020, 49, D1507–D1514. [Google Scholar] [CrossRef]
- Zafeiropoulos, H.; Viet, H.Q.; Vasileiadou, K.; Potirakis, A.; Arvanitidis, C.; Topalis, P.; Pavloudi, C.; Pafilis, E. PEMA: A Flexible Pipeline for Environmental DNA Metabarcoding Analysis of the 16S/18S Ribosomal RNA, ITS, and COI Marker Genes. GigaScience 2020, 9, giaa022. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Karatzas, E.; Baltoumas, F.A.; Panayiotou, N.A.; Schneider, R.; Pavlopoulos, G.A. Arena3Dweb: Interactive 3D Visualization of Multilayered Networks. Nucleic Acids Res. 2021, 49, W36–W45. [Google Scholar] [CrossRef] [PubMed]
- Baltoumas, F.A.; Zafeiropoulou, S.; Karatzas, E.; Paragkamian, S.; Thanati, F.; Iliopoulos, I.; Eliopoulos, A.G.; Schneider, R.; Jensen, L.J.; Pafilis, E.; et al. OnTheFly2.0: A Text-Mining Web Application for Automated Biomedical Entity Recognition, Document Annotation, Network and Functional Enrichment Analysis. NAR Genom. Bioinform. 2021, 3, lqab090. [Google Scholar] [CrossRef] [PubMed]
- Thanati, F.; Karatzas, E.; Baltoumas, F.A.; Stravopodis, D.J.; Eliopoulos, A.G.; Pavlopoulos, G.A. FLAME: A Web Tool for Functional and Literature Enrichment Analysis of Multiple Gene Lists. Biology 2021, 10, 665. [Google Scholar] [CrossRef] [PubMed]
- Zoppi, J.; Guillaume, J.-F.; Neunlist, M.; Chaffron, S. MiBiOmics: An Interactive Web Application for Multi-Omics Data Exploration and Integration. BMC Bioinform. 2021, 22, 6. [Google Scholar] [CrossRef] [PubMed]
- Sinclair, L.; Ijaz, U.Z.; Jensen, L.J.; Coolen, M.J.L.; Gubry-Rangin, C.; Chroňáková, A.; Oulas, A.; Pavloudi, C.; Schnetzer, J.; Weimann, A.; et al. Seqenv: Linking Sequences to Environments through Text Mining. PeerJ 2016, 4, e2690. [Google Scholar] [CrossRef] [Green Version]
- Xue, C.-X.; Lin, H.; Zhu, X.-Y.; Liu, J.; Zhang, Y.; Rowley, G.; Todd, J.D.; Li, M.; Zhang, X.-H. DiTing: A Pipeline to Infer and Compare Biogeochemical Pathways from Metagenomic and Metatranscriptomic Data. Front. Microbiol. 2021, 12, 2118. [Google Scholar] [CrossRef]
- Zafeiropoulos, H.; Gioti, A.; Ninidakis, S.; Potirakis, A.; Paragkamian, S.; Angelova, N.; Antoniou, A.; Danis, T.; Kaitetzidou, E.; Kasapidis, P.; et al. 0s and 1s in Marine Molecular Research: A Regional HPC Perspective. GigaScience 2021, 10. [Google Scholar] [CrossRef]
- Binder, J.X.; Pletscher-Frankild, S.; Tsafou, K.; Stolte, C.; O’Donoghue, S.I.; Schneider, R.; Jensen, L.J. COMPARTMENTS: Unification and Visualization of Protein Subcellular Localization Evidence. Database 2014, 2014, bau012. [Google Scholar] [CrossRef] [Green Version]
- Pletscher-Frankild, S.; Pallejà, A.; Tsafou, K.; Binder, J.X.; Jensen, L.J. DISEASES: Text mining and data integration of disease–gene associations. Methods 2015, 74, 83–89. [Google Scholar] [CrossRef]
Source | # Items | Data Type | Metadata | License |
---|---|---|---|---|
MEDLINE and PubMed | 33 million | abstracts (text) | no | NLM Copyright |
PubMed Central OA Subset | 2.7 million | full article (text) | no | CC for Commercial, non-commercial |
JGI IMG | 9644 | Isolates Annotated genomes | yes | JGI Data Policy |
Struo | 21,276 | Annotated genomes | no | MIT, CC BY-SA 4.0 |
BioProject | 18,752 | Annotated genomes with abstracts (text) | yes | INSDC policy |
MG-RAST | 16,096 | marker gene samples | yes | CC0 |
7965 | metagenomic samples | yes | CC0 | |
MGnify | 10,500 | marker gene samples | yes | CC-BY, CC0 |
Channel | Source | Taxonomy | Environments | Biological Processes | Molecular Function | |
---|---|---|---|---|---|---|
Literature | MEDLINE PubMed—PMC OA | Strains | 8929 | 1077 | 15,079 | 7318 |
Species | 240,377 | |||||
Total | 342,506 | |||||
Environmental samples | MG-RAST amplicon | Strains | 1392 | 162 | - | - |
Species | 4324 | |||||
Total | 5859 | |||||
MG-RAST metagenome | Strains | 2522 | 258 | - | 3839 | |
Species | 4406 | |||||
Total | 7157 | |||||
MGnify amplicon | Strains | 2 | 216 | 11 | - | |
Species | 1471 | |||||
Total | 2955 | |||||
Annotated Genomes and Isolates | JGI IMGisolates | Strains | 2398 | 241 | - | 3670 |
Species | 11,203 | |||||
Total | 13,849 | |||||
STRUO | Strains | 6 | - | - | 2789 | |
Species | 19,289 | |||||
Total | 19,325 | |||||
BioProject | Strains | 5754 | 309 | 626 | - | |
Species | 3373 | |||||
Total | 9393 | |||||
Total | All | Strains | 12,840 | 1090 | 15,091 | 7971 |
Species | 258,352 | |||||
Total | 364,508 |
Channel | Source | Environments—Processes | Environments—Functions | Taxonomy | Taxa—Environments | Taxa—Processes | Taxa—Function |
---|---|---|---|---|---|---|---|
Literature | MEDLINE PubMed—PMC OA | 883,997 | 422,579 | Strains | 69,968 | 590,630 | 384,079 |
Species | 778,877 | 3,501,635 | 1,961,920 | ||||
Total | 1,669,608 | 7,969,310 | 4,613,827 | ||||
Environmental samples | MG-RAST amplicon | - | - | Strains | 13,645 | - | - |
Species | 39,007 | ||||||
Total | 53,439 | ||||||
MG-RAST metagenome | - | 620,846 | Strains | 262,106 | - | 8,626,328 | |
Species | 103,913 | 10,715,548 | |||||
Total | 372,301 | 19,950,096 | |||||
MGnify amplicon | - | - | Strains | 18 | - | ||
Species | 30,122 | 351 | - | ||||
Total | 111,976 | 2097 | |||||
Annotated Genomes and Isolates | JGI IMGisolates | - | - | Strains | 8229 | - | 3,461,693 |
Species | 42,141 | 13,216,559 | |||||
Total | 50,888 | 16,821,850 | |||||
STRUO | - | - | Strains | - | - | 1803 | |
Species | 4,070,195 | ||||||
Total | 4,079,312 | ||||||
BioProject | - | - | Strains | 3263 | 7473 | ||
Species | 4187 | 4294 | |||||
Total | 7641 | 12,169 | |||||
Total | All | 883,997 | 1,043,425 | Strains | 357,229 | 598,103 | 12,473,903 |
Species | 998,247 | 3,506,280 | 29,964,222 | ||||
Total | 2,265,853 | 7,983,576 | 45,465,085 |
Functionality | BacDive | Web of Microbes | NMDC | PREGO |
---|---|---|---|---|
manual curation | high | high | intermediate | low |
literature integration | limited | no | no | yes |
environment—taxa associations | yes | yes | yes | yes |
environment—process/function associations | no | no | no | yes |
process/function—taxa associations | yes | yes | yes | yes |
phenotypic data | yes | no | no | no |
data origin | original, integration | original | original, integration | integration |
spatial coordinates | yes | no | yes | no |
application programming interface | yes | no | yes | yes |
bulk download | limited | yes | yes | yes |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zafeiropoulos, H.; Paragkamian, S.; Ninidakis, S.; Pavlopoulos, G.A.; Jensen, L.J.; Pafilis, E. PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types. Microorganisms 2022, 10, 293. https://doi.org/10.3390/microorganisms10020293
Zafeiropoulos H, Paragkamian S, Ninidakis S, Pavlopoulos GA, Jensen LJ, Pafilis E. PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types. Microorganisms. 2022; 10(2):293. https://doi.org/10.3390/microorganisms10020293
Chicago/Turabian StyleZafeiropoulos, Haris, Savvas Paragkamian, Stelios Ninidakis, Georgios A. Pavlopoulos, Lars Juhl Jensen, and Evangelos Pafilis. 2022. "PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types" Microorganisms 10, no. 2: 293. https://doi.org/10.3390/microorganisms10020293
APA StyleZafeiropoulos, H., Paragkamian, S., Ninidakis, S., Pavlopoulos, G. A., Jensen, L. J., & Pafilis, E. (2022). PREGO: A Literature and Data-Mining Resource to Associate Microorganisms, Biological Processes, and Environment Types. Microorganisms, 10(2), 293. https://doi.org/10.3390/microorganisms10020293