Evaluating Feature Impact Prior to Phylogenetic Analysis Using Machine Learning Techniques
Abstract
:1. Introduction
- Perform maximum parsimony to extract a phylogenetic tree [1];
- Identify all branches and their ancestral values;
- Determine the number of mutations (or changes) for each feature as represented on the phylogenetic tree;
- Develop a model to predict feature quality before performing phylogenetic analysis.
1.1. Identify Subgroups and Their Quality
1.2. Leveraging Machine Learning for Phylogenetic Analysis of Historical Scripts
2. Background
2.1. Established Phylogenetic Methods
2.2. Machine Learning in Phylogenetics
2.3. Challenges and Future Directions
3. Methods
3.1. Data Representation
3.1.1. DS1 Direct Use of Binary Dataset
3.1.2. DS2: Feature Extraction from DS1
3.1.3. DS3: Normalization DS2
3.2. Cross-Validation and Data Transformation
3.3. Classification Phase
3.4. Experimental Setup
4. Results
4.1. Model Performance on Original Dataset
- DNNs consistently delivered strong performances across all datasets and fold sizes, with AUC values ranging from 0.87 to 0.95 and EER values between 0.12 and 0.19.
- SVMs also demonstrated robust performances, particularly for and , with AUC values as high as 0.96 and 0.95, respectively, at k = 4. However, its performance when applied to was slightly lower, with AUC values between 0.88 and 0.92.
- RFs displayed more variable performances, with AUC values ranging from 0.82 to 0.91 across datasets. Despite this variability, RF maintained relatively low EER values, particularly for .
4.2. Validation Using External Dataset
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Semple, C.; Steel, M. Phylogenetics; Oxford University Press on Demand: Oxford, UK, 2003. [Google Scholar]
- Salman, O.A.; Hosszú, G. Cladistic Analysis of the Evolution of Some Aramaic and Arabic Script Varieties. Int. J. Appl. Evol. Comput. (IJAEC) 2021, 12, 18–38. [Google Scholar] [CrossRef]
- Salman, O.A.; Hosszú, G. Enhanced Phylogenetic Inference through Optimized Feature Selection and Computational Efficiency Analysis. Acta Polytech. Hung. 2024. under review. [Google Scholar]
- Salman, O.A.; Hosszú, G.; Kovács, F. A new feature selection algorithm for evolutionary analysis of Aramaic and Arabic script variants. Int. J. Intell. Eng. Inform. 2022, 10, 313–331. [Google Scholar] [CrossRef]
- Salman, O.A.; Hosszú, G. Optimised feature dimension reduction method and its impact on the search for optimal trees. In Proceedings of the Workshop on the Advances of Information Technology, Budapest, Hungary, 6–7 February 2023; BME Department of Control Engineering and Information Technology: Budapest, Hungary, 2023. [Google Scholar]
- Salman, O.A.; Hosszú, G. A Phenetic Approach to Selected Variants of Arabic and Aramaic Scripts. Int. J. Data Anal. 2022, 3, 1–23. [Google Scholar] [CrossRef]
- Salman, O.A.; Hosszú, G. Phylogenetic Inference Using Advanced Feature Selection. In Proceedings of the 2023 14th IEEE International Conference on Cognitive Infocommunications (CogInfoCom), Budapest, Hungary, 22–23 September 2023; pp. 000173–000178. [Google Scholar]
- Salman, O.A.; Hosszú, G. Phylogenetic modelling scripts for identifying script versions. Procedia Comput. Sci. 2024, 239, 1417–1424. [Google Scholar] [CrossRef]
- Salman, O.A.; Hosszú, G. Using distance-based methods to calculate optimal and suboptimal parsimony trees. In Proceedings of the Workshop on the Advances of Information Technology, WAIT 2024, Budapest, Hungary, 6–7 February 2023; BME Department of Control Engineering and Information Technology: Budapest, Hungary, 2024. [Google Scholar]
- Wu, C.H.; Chen, H.-L.; Chen, S.-C. Gene classification artificial neural system. Int. J. Artif. Intell. Tools 1995, 4, 501–510. [Google Scholar] [CrossRef]
- Mo, Y.K.; Hahn, M.; Smith, M.L. Applications of Machine Learning in Phylogenetics. Mol. Phylogenetics Evol. 2024, 196, 108066. [Google Scholar] [CrossRef]
- Zhou, Y.; Zheng, H.; Huang, X.; Hao, S.; Li, D.; Zhao, J. Graph neural networks: Taxonomy, advances, and trends. ACM Trans. Intell. Syst. Technol. (TIST) 2022, 13, 1–54. [Google Scholar] [CrossRef]
- Available online: https://github.com/OsamaAliSalman/Extended_Arabic-Aramaic-DataSet.git (accessed on 2 August 2024).
- Halgaswaththa, T.; Atukorale, A.S.; Jayawardena, M.; Weerasena, J. Neural network based phylogenetic analysis. In Proceedings of the 2012 International Conference on Biomedical Engineering (ICoBE), Penang, Malaysia, 27–28 February 2012; pp. 155–160. [Google Scholar]
- Suvorov, A.; Schrider, D.R. Reliable estimation of tree branch lengths using deep neural networks. bioRxiv 2022. [Google Scholar] [CrossRef]
- Philippe, H.; Zhou, Y.; Brinkmann, H.; Rodrigue, N.; Delsuc, F. Heterotachy and long-branch attraction in phylogenetics. BMC Evol. Biol. 2005, 5, 1–8. [Google Scholar] [CrossRef]
- Sullivan, J.; Swofford, D.L. Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 2001, 50, 723–729. [Google Scholar] [CrossRef] [PubMed]
- Azouri, D.; Abadi, S.; Mansour, Y.; Mayrose, I.; Pupko, T. Harnessing machine learning to boost heuristic strategies for phylogenetic-tree search. Prepr. Res. Sq. 2020. [Google Scholar] [CrossRef]
- Bernardini, G.; van Iersel, L.; Julien, E.; Stougie, L. Constructing phylogenetic networks via cherry picking and machine learning. Algorithms Mol. Biol. 2023, 18, 13. [Google Scholar] [CrossRef] [PubMed]
- Zou, Z.; Zhang, H.; Guan, Y.; Zhang, J. Deep residual neural networks resolve quartet molecular phylogenies. Mol. Biol. Evol. 2020, 37, 1495–1507. [Google Scholar] [CrossRef] [PubMed]
- Layne, E.; Dort, E.N.; Hamelin, R.; Li, Y.; Blanchette, M. Supervised learning on phylogenetically distributed data. Bioinformatics 2020, 36 (Suppl. 2), i895–i902. [Google Scholar] [CrossRef]
- Smith, M.L.; Hahn, M.W. Phylogenetic inference using generative adversarial networks. Bioinformatics 2023, 39, btad543. [Google Scholar] [CrossRef]
- Abadi, S.; Avram, O.; Rosset, S.; Pupko, T.; Mayrose, I. ModelTeller: Model selection for optimal phylogenetic reconstruction using machine learning. Mol. Biol. Evol. 2020, 37, 3338–3352. [Google Scholar] [CrossRef]
- Lipták, P.; Attila, K. Constructing unrooted phylogenetic trees with reinforcement learning. Studia Univ. Babeș-Bolyai Inform. 2021, 37–53. [Google Scholar] [CrossRef]
- Kalyaanamoorthy, S.; Minh, B.Q.; Wong, T.K.; Von Haeseler, A.; Jermiin, L.S. ModelFinder: Fast model selection for accurate phylogenetic estimates. Nat. Methods 2017, 14, 587–589. [Google Scholar] [CrossRef]
- Wang, Z.; Sun, J.; Gao, Y.; Xue, Y.; Zhang, Y.; Li, K.; Zhang, W.; Zhang, C.; Zu, J.; Zhang, L. Fusang: A framework for phylogenetic tree inference via deep learning. Nucleic Acids Res. 2023, 51, 10909–10923. [Google Scholar] [CrossRef]
- Tang, X.; Zepeda-Nuñez, L.; Yang, S.; Zhao, Z.; Solís-Lemus, C. Novel symmetry-preserving neural network model for phylogenetic inference. Bioinform. Adv. 2024, 4, vbae022. [Google Scholar] [CrossRef] [PubMed]
- Tadist, K.; Najah, S.; Nikolov, N.S.; Roose, L. Feature selection methods and genomic big data: A systematic review. J. Big Data 2019, 6, 79. [Google Scholar] [CrossRef]
- Kaur, A.; Sarmadi, M. Comparative Analysis of Data Preprocessing Methods, Feature Selection Techniques and Machine Learning Models for Improved Classification and Regression Performance on Imbalanced Genetic Data. arXiv 2024, arXiv:2402.14980. [Google Scholar]
- Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
- Bradley, A.P. The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
- Jain, A.K.; Ross, A.; Prabhakar, S. An Introduction to Biometric Recognition. IEEE Trans. Circuits Syst. Video Technol. 2004, 14, 4–20. [Google Scholar] [CrossRef]
- Daugman, J. How Iris Recognition Works. IEEE Trans. Circuits Syst. Video Technol. 2004, 14, 21–30. [Google Scholar] [CrossRef]
- Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Felsenstein, J. Inferring Phylogenies; Sinauer Associates: Sunderland, MA, USA, 2004. [Google Scholar]
- Han, J.; Pei, J.; Kamber, M. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2011. [Google Scholar]
- Hoffmann, K.; Bouckaert, R.; Greenhill, S.J.; Kühnert, D. Bayesian phylogenetic analysis of linguistic data using BEAST. J. Lang. Evol. 2021, 6, 119–135. [Google Scholar] [CrossRef]
Notation | Description |
---|---|
The total number of taxa in DS1. | |
The total number of features in DS1. | |
or DS1. | |
features in DS1, a measure of randomness in the features. | |
taxa in DS2 before normalization. | |
and its being saved it in DS3. | |
A small constant added to probabilities to avoid undefined log calculations during entropy computation. | |
A value could be either 0 or 1 to determine if the equation will run for 0’s or 1’s. |
Equation | Description |
---|---|
feature of DS1. | |
the total count of K’s | |
feature | |
DNN | Three hidden layer there sizes: [15 | 8 | 4] Mean squared error: 0.001 Learning rate: 0.001 Actvation function to hidden layers (tansig) Actvation function for output node is (logsig) |
SVM | Kernel function: radial basis function (RBF) Box constraint: 30 Kernel scale: 10 |
RF | Number of trees: 300 Max number of splits: 50 Number of variables to sample: all Minimum leaf size: 5 |
No. Features | Tree Length | Optimal Tree | CI | Time Sec | |
---|---|---|---|---|---|
6 | 97 | 229 | 2 | 0.424 | 369.3 |
5 | 95 | 217 | 3 | 0.438 | 184.7 |
4 | 88 | 181 | 2 | 0.486 | 28.8 |
3 | 78 | 140 | 3 | 0.557 | 1.33 |
2 | 60 | 86 | 2 | 0.698 | 0.03 |
1 | 32 | 32 | 1 | 1 | 0.005 |
Fold | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Acc | FRR | FAR | Acc | FRR | FAR | Acc | FRR | FAR | ||
DNN | 2 | 81.12 | 0.09 | 0.32 | 79.74 | 0.08 | 0.35 | 84.26 | 0.1 | 0.24 |
3 | 80.45 | 0.09 | 0.3 | 85.07 | 0.1 | 0.21 | 83.9 | 0.13 | 0.25 | |
4 | 83.66 | 0.06 | 0.31 | 90.3 | 0.05 | 0.18 | 86.37 | 0.09 | 0.22 | |
SVM | 2 | 88.71 | 0.13 | 0 | 88.56 | 0.1 | 0.14 | 83.19 | 0.17 | 0.13 |
3 | 87.65 | 0.15 | 0 | 83.23 | 0.13 | 0.22 | 79.56 | 0.2 | 0.11 | |
4 | 92.75 | 0.09 | 0 | 86.43 | 0.11 | 0.17 | 84.61 | 0.13 | 0.17 | |
RF | 2 | 82.65 | 0.11 | 0.27 | 82.26 | 0.08 | 0.31 | 85.66 | 0.08 | 0.24 |
3 | 78.64 | 0.13 | 0.25 | 82.4 | 0.1 | 0.29 | 77.01 | 0.18 | 0.33 | |
4 | 81.8 | 0.14 | 0.28 | 82.79 | 0.09 | 0.29 | 84.16 | 0.12 | 0.19 |
Fold | |||||||
---|---|---|---|---|---|---|---|
AUC | EER | AUC | EER | AUC | EER | ||
DNN | 2 | 0.92 | 0.19 | 0.87 | 0.16 | 0.94 | 0.13 |
3 | 0.94 | 0.13 | 0.91 | 0.2 | 0.88 | 0.17 | |
4 | 0.95 | 0.13 | 0.95 | 0.12 | 0.9 | 0.15 | |
SVM | 2 | 0.81 | 0.25 | 0.9 | 0.16 | 0.92 | 0.16 |
3 | 0.85 | 0.18 | 0.88 | 0.24 | 0.88 | 0.17 | |
4 | 0.96 | 0.11 | 0.95 | 0.09 | 0.91 | 0.17 | |
RF | 2 | 0.91 | 0.21 | 0.82 | 0.15 | 0.93 | 0.15 |
3 | 0.91 | 0.16 | 0.86 | 0.18 | 0.87 | 0.21 | |
4 | 0.91 | 0.19 | 0.91 | 0.13 | 0.91 | 0.17 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Salman, O.A.; Hosszú, G. Evaluating Feature Impact Prior to Phylogenetic Analysis Using Machine Learning Techniques. Information 2024, 15, 696. https://doi.org/10.3390/info15110696
Salman OA, Hosszú G. Evaluating Feature Impact Prior to Phylogenetic Analysis Using Machine Learning Techniques. Information. 2024; 15(11):696. https://doi.org/10.3390/info15110696
Chicago/Turabian StyleSalman, Osama A., and Gábor Hosszú. 2024. "Evaluating Feature Impact Prior to Phylogenetic Analysis Using Machine Learning Techniques" Information 15, no. 11: 696. https://doi.org/10.3390/info15110696
APA StyleSalman, O. A., & Hosszú, G. (2024). Evaluating Feature Impact Prior to Phylogenetic Analysis Using Machine Learning Techniques. Information, 15(11), 696. https://doi.org/10.3390/info15110696