Predicting the Association of Metabolites with Both Pathway Categories and Individual Pathways
Abstract
:1. Introduction
2. Materials and Methods
3. Results
3.1. Overall Model Performance across All Pathways
3.2. Model Performance Per Pathway in the Combined Dataset
3.2.1. Distribution of Pathway Statistics
3.2.2. Comparing Pathway Category Size to MCC
3.3. Impact of MCC When Filtering Pathways from the Training Set by Pathway Size Thresholds
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Voet, D.; Voet, J.G.; Pratt, C.W. Fundamentals of Biochemistry: Life at the Molecular, 5th ed.; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
- Berg, J.M.; Tymoczko, J.L.; Gatto, G.J.; Stryer, L. Biochemistry, 9th ed.; W. H. Freeman: New York, NY, USA, 2019. [Google Scholar]
- Nelson, D.L.; Cox, M.M. Principles of Biochemistry, 8th ed.; W. H. Freeman: New York, NY, USA, 2021. [Google Scholar]
- Kanehisa, M.; Furumichi, M.; Sato, Y.; Kawashima, M.; Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023, 51, D587–D592. [Google Scholar] [CrossRef] [PubMed]
- Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef] [PubMed]
- Kanehisa, M. Toward understanding the origin and evolution of cellular organisms. Protein Sci. 2019, 28, 1947–1951. [Google Scholar] [CrossRef] [PubMed]
- Caspi, R.; Billington, R.; Keseler, I.M.; Kothari, A.; Krummenacker, M.; Midford, P.E.; Ong, W.K.; Paley, S.; Subhraveti, P.; Karp, P.D. The MetaCyc database of metabolic pathways and enzymes—A 2019 update. Nucleic Acids Res. 2020, 48, D445–D453. [Google Scholar] [CrossRef] [PubMed]
- Milacic, M.; Beavers, D.; Conley, P.; Gong, C.; Gillespie, M.; Griss, J.; Haw, R.; Jassal, B.; Matthews, L.; May, B.; et al. The reactome pathway knowledgebase 2024. Nucleic Acids Res. 2024, 52, D672–D678. [Google Scholar] [CrossRef]
- Du, B.-X.; Zhao, P.-C.; Zhu, B.; Yiu, S.-M.; Nyamabo, A.K.; Yu, H.; Shi, J.-Y. MLGL-MP: A Multi-Label Graph Learning framework enhanced by pathway interdependence for Metabolic Pathway prediction. Bioinformatics 2022, 38, i325–i332. [Google Scholar] [CrossRef]
- Baranwal, M.; Magner, A.; Elvati, P.; Saldinger, J.; Violi, A.; Hero, A.O. A deep learning architecture for metabolic pathway prediction. Bioinformatics 2020, 36, 2547–2553. [Google Scholar] [CrossRef]
- Hu, L.-L.; Chen, C.; Huang, T.; Cai, Y.-D.; Chou, K.-C. Predicting biological functions of compounds based on chemical-chemical interactions. PLoS ONE 2011, 6, e29491. [Google Scholar] [CrossRef]
- Yang, Z.; Liu, J.; Wang, Z.; Wang, Y.; Feng, J. Multi-Class Metabolic Pathway Prediction by Graph Attention-Based Deep Learning Method. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Republic of Korea, 16–19 December 2020; pp. 126–131. [Google Scholar]
- Huckvale, E.D.; Moseley, H.N.B. A cautionary tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement. PLoS ONE 2024, 19, e0299583. [Google Scholar] [CrossRef]
- Huckvale, E.D.; Powell, C.D.; Jin, H.; Moseley, H.N.B. Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites. Metabolites 2023, 13, 1120. [Google Scholar] [CrossRef]
- Jin, H.; Moseley, H.N.B. md_harmonize: A Python Package for Atom-Level Harmonization of Public Metabolic Databases. Metabolites 2023, 13, 1199. [Google Scholar] [CrossRef] [PubMed]
- Huckvale, E.D.; Moseley, H.N.B. Predicting the pathway involvement of metabolites based on combined metabolite and pathway features. Metabolites 2024, 14, 266. [Google Scholar] [CrossRef] [PubMed]
- Guo, X.; Yin, Y.; Dong, C.; Yang, G.; Zhou, G. On the class imbalance problem. In Proceedings of the 2008 Fourth International Conference on Natural Computation, Jinan, China, 18–20 October 2008; pp. 192–201. [Google Scholar]
- Huckvale, E.; Moseley, H.N.B. kegg_pull: A software package for the RESTful access and pulling from the Kyoto Encyclopedia of Gene and Genomes. BMC Bioinform. 2023, 24, 78. [Google Scholar] [CrossRef] [PubMed]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD ’16, San Francisco, CA, USA, 13–17 August 2016; ACM Press: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
- Verstraeten, G.; Van den Poel, D. Using Predicted Outcome Stratified Sampling to Reduce the Variability in Predictive Performance of a One-Shot Train-and-Test Split for Individual Customer Predictions; Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 06/360; Ghent University, Faculty of Economics and Business Administration: Ghent, Belgium, 2006; Volume 214. [Google Scholar]
- Chicco, D.; Starovoitov, V.; Jurman, G. The benefits of the matthews correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in binary classification assessment. IEEE Access 2021, 9, 47112–47124. [Google Scholar] [CrossRef]
- Rossum, G.V.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009; ISBN 1441412697. [Google Scholar]
- The pandas development team pandas-dev/pandas: Pandas 1.0.3. Zenodo 2020. [CrossRef]
- Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
- Collette, A. Python and HDF5; O’Reilly: Springfield, MO, USA, 2013. [Google Scholar]
- Falcon, W.; Borovec, J.; Wälchli, A.; Eggert, N.; Schock, J.; Jordan, J.; Skafte, N.; Ir1dXD; Bereznyuk, V.; Harris, E.; et al. PyTorchLightning/pytorch-lightning: 0.7.6 release. Zenodo 2020. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar] [CrossRef]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining—KDD ’19, Anchorage, AK, USA, 4–8 August 2019; ACM Press: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Müller, A.; Nothman, J.; Louppe, G.; et al. Scikit-learn: Machine Learning in Python. arXiv 2012, arXiv:1201.0490. [Google Scholar] [CrossRef]
- Chamberlin, D. SQL. In Encyclopedia of Database Systems; Liu, L., Özsu, M.T., Eds.; Springer: Boston, MA, USA, 2009; pp. 2753–2760. ISBN 978-0-387-35544-3. [Google Scholar]
- Raasveldt, M.; Mühleisen, H. Duckdb: An embeddable analytical database. In Proceedings of the 2019 International Conference on Management of Data, Amsterdam, The Netherlands, 30 June–5 July 2019; ACM: New York, NY, USA, 2019; pp. 1981–1984. [Google Scholar]
- Salesforce. Tableau Public; Salesforce: San Francisco, CA, USA, 2024. [Google Scholar]
- Waskom, M. seaborn: Statistical data visualization. JOSS 2021, 6, 3021. [Google Scholar] [CrossRef]
- Hunter, J.D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0 Contributors SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed]
- Huckvale, E.D.; Moseley, H.N.B. gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments. arXiv 2024, arXiv:2404.01473. [Google Scholar]
Dataset | # Metabolite Features | # Pathway Features | # Metabolites | # Pathways | # Entries |
---|---|---|---|---|---|
L2 | 14,655 | 5435 | 5683 | 12 | 68,196 |
L3 | 14,655 | 8951 | 5683 | 172 | 977,476 |
Combined | 14,655 | 8977 | 5683 | 184 | 1,045,672 |
Dataset | Scaled | Mean MCC | Standard Deviation |
---|---|---|---|
Combined | True | 0.800 | 0.021 |
False | 0.771 | 0.009 | |
L2 only | True | 0.728 | 0.029 |
False | 0.784 | 0.013 | |
L3 only | True | 0.655 | 0.031 |
False | 0.618 | 0.048 |
Test Set | Scaled | MCC |
---|---|---|
L2 only | True | 0.891 |
False | 0.850 | |
L3 only | True | 0.726 |
False | 0.703 |
Dataset | Scaled | Resource | Unit | Amount |
---|---|---|---|---|
Combined | True | Compute time | Hours | 131.8 |
GPU RAM | Gigabytes | 4.3 | ||
RAM | Gigabytes | 13.6 | ||
False | Compute time | Hours | 131.1 | |
GPU RAM | Gigabytes | 4.6 | ||
RAM | Gigabytes | 14.1 | ||
L3 only | True | Compute time | Hours | 171.3 |
GPU RAM | Gigabytes | 2.9 | ||
RAM | Gigabytes | 12.6 | ||
False | Compute time | Hours | 127.8 | |
GPU RAM | Gigabytes | 4.1 | ||
RAM | Gigabytes | 16.6 |
Pathway Size Metric | Correlation Coefficient/Scale | p-Value |
---|---|---|
# Compounds | 0.442 Spearman/regular | 3.463 × 10−11 |
# Non-hydrogen Atoms | 0.501 Spearman/regular | 4.234 × 10−14 |
# Compounds | 0.572 Pearson/log10 | 2.177 × 10−17 |
# Non-hydrogen Atoms | 0.584 Pearson/log10 | 3.160 × 10−18 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Huckvale, E.D.; Moseley, H.N.B. Predicting the Association of Metabolites with Both Pathway Categories and Individual Pathways. Metabolites 2024, 14, 510. https://doi.org/10.3390/metabo14090510
Huckvale ED, Moseley HNB. Predicting the Association of Metabolites with Both Pathway Categories and Individual Pathways. Metabolites. 2024; 14(9):510. https://doi.org/10.3390/metabo14090510
Chicago/Turabian StyleHuckvale, Erik D., and Hunter N. B. Moseley. 2024. "Predicting the Association of Metabolites with Both Pathway Categories and Individual Pathways" Metabolites 14, no. 9: 510. https://doi.org/10.3390/metabo14090510
APA StyleHuckvale, E. D., & Moseley, H. N. B. (2024). Predicting the Association of Metabolites with Both Pathway Categories and Individual Pathways. Metabolites, 14(9), 510. https://doi.org/10.3390/metabo14090510