SIBILA: Automated Machine-Learning-Based Development of Interpretable Machine-Learning Models on High-Performance Computing Platforms
Abstract
:1. Introduction
2. Related Work
- Auto-SKLearn [29] is an automated machine-learning tool built upon the scikit-learn library. It relieves users from the tasks of algorithm selection and hyperparameter tuning. The package also integrates feature engineering techniques such as one-hot encoding, numerical feature standardization, and principal component analysis (PCA). It leverages SKLearn estimators to handle both classification and regression tasks. Auto-SKLearn constructs a pipeline and employs Bayesian search to optimize it. Within this machine-learning framework, two components are introduced to refine hyperparameter tuning using Bayesian inference: meta-learning is applied to initialize the optimizers through Bayesian methods and the automatic configuration setup is evaluated throughout the optimization process.
- FLAML [31] identifies accurate models or configurations for common ML/AI tasks while minimizing computational resource use. It eliminates the need for users to manually choose models or hyperparameters for training and inference, while still allowing for easy customization. By automatically adapting large language models (LLMs) to specific applications, FLAML maximizes the advantages of these resource-intensive models while reducing associated costs. It allows users to create and deploy adaptive AI agents with minimal effort. FLAML also provides a rapid auto-tuning tool driven by a novel, cost-efficient approach, capable of managing large search spaces with varying evaluation costs, complex constraints, guidance, and early stopping mechanisms.
- H2O-AutoML [28] is an open-source, distributed in-memory, machine-learning platform. It is compatible with both R and Python and supports a wide range of commonly used statistical and machine-learning algorithms, such as gradient boosted machines, generalized linear models, and deep learning. H2O features an automated machine-learning module that utilizes its proprietary algorithms to build pipelines. It employs exhaustive search techniques for feature engineering and hyperparameter optimization to enhance pipeline performance. The platform automates various complex tasks in data science and machine learning, including feature engineering, model validation, tuning, selection, and deployment. Additionally, it offers automated visualization tools and machine-learning interpretation.
- AutoGluon [33] can generate models that predict the values in one column based on the other columns for standard tabular datasets (such as those stored in CSV files or extracted from databases). With a single call, it delivers high accuracy in typical supervised learning tasks, including both classification and regression, while automatically handling tasks like data cleaning, feature engineering, hyperparameter tuning, and model selection.
3. Materials and Methods
3.1. Architecture
3.2. Machine-Learning and Deep-Learning Models
3.3. Evaluation Metrics
3.4. Data Cleaning
3.5. Interpretability Algorithms
3.6. Consensus
3.7. Scalability and Performance
3.8. Containerization
4. Results
4.1. Model Search
4.2. Interpretability and Data Fusion
4.3. Parallelization and GPU Usage
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
ADASYN | Adaptative Synthetic |
AI | Artificial Intelligence |
ALE | Accumulated Local Effects |
AUC | Area Under the Curve |
AutoML | Automated Machine Learning |
CSV | Comma-Separated Values |
DL | Deep Learning |
EDA | Exploratory Data Analysis |
GPU | Graphics Processing Unit |
HPC | High-Performance Computing |
ICE | Individual Conditional Expectation |
JSON | JavaScript Object Notation |
LIME | Local Interpretable Model-agnostic Explanations |
LLM | Large Language Models |
MAE | Mean Absolute Error |
MCC | Matthews Correlation Coefficient |
ML | Machine Learning |
MSE | Mean Squared Error |
PCA | Principal Component Analysis |
PDP | Partial Dependence Plot |
RMSE | Root Mean Squared Error |
SLURM | Simple Linux Utility for Resource Management |
SMOTE | Synthetic Minority Oversampling Technique |
XAI | eXplainable Artificial Intelligence |
Appendix A
Appendix A.1. Hyperparameter Values Used in the Random Search Process
Model | Hyperparameter Values |
---|---|
ANN | batch_size: 128, objective: accuracy, activate: [relu, elu, tanh, sigmoid, softmax, linear, exponential], dropout_rate: 0.15, optimizer: [Adam, RMSprop, SGD, Adagrad], loss: sparse_categorical_crossentropy, epochs: 100 |
BAG | n_estimators: [10, 20, 50], max_features: [0.25, 0.5, 1.0], oob_score: [true, false], bootstrap: [true, false], max_samples: [0.25, 0.5, 1.0] |
DT | criterion: [gini, entropy], splitter: [best, random], max_depth: [2, 4, 6, 8, 10, 12], min_samples_split: [0.1, 0.2, 0.4, 0.8, 0.9], min_samples_leaf: [1, 2, 3, 4], max_features: [auto, sqrt, log2], max_leaf_nodes: [50, 100, 200, 300], min_impurity_decrease: [0, 0.1, 0.2, 0.3, 0.4, 0.5], ccp_alpha: [0, 0.1, 0.2, 0.3, 0.4, 0.5] |
KNN | n_neighbors: [3, 4, 5, 6, 7], algorithm: [auto, ball_tree, kd_tree, brute], leaf_size: [10, 20, 30, 50], metric: [minkowski, euclidean, manhattan, chebyshev], p: [1, 2, 3] |
LR | penalty: [l1, l2, elasticnet], tol: [0.001, 0.0001, ], C: [0.25, 0.3, 0.5, 0.6, 0.75, 0.9, 1], fit_intercept: [true, false], solver: [liblinear, newton-cg, sag, saga], max_iter: [50, 100, 500, 1000], l1_ratio: [0.1, 0.25, 0.5, 0.75, 1] |
RF | n_estimators: [50, 100, 400, 800], criterion: [gini, entropy], max_depth: [25, 50, 250], min_samples_split: [2, 5, 10], min_samples_leaf: [2, 5, 10], max_features: [auto, sqrt, log2], oob_score: [true, false], bootstrap: [true, false] |
RLF | tree_size: [4, 16, 32], max_rules: [50, 100, 500, 1000], memory_par: [0.01, 0.05, 0.1], lin_trim_quantile: [0.025, 0.05, 0.1], lin_standardise: [true, false], exp_rand_tree_size: [true, false], cv: [3, 5] |
RP | n_discretize_bins: [10, 20, 50], k: [1, 2], prune_size: [0.25, 0.33, 0.5] |
SVM | C: [0.5, 1, 1.5], kernel: [linear, poly, rbf, sigmoid], degree: [1, 2, 3, 4, 5], gamma: [scale, auto], coef0: [0, 0.5, 1], shrinking: [true, false], tol: [, , , ], cache_size: [100, 200, 300], max_iter: [−1, 100, 150, 200, 500], decision_function_shape: [ovo, ovr] |
XGBOOST | n_estimators: [50, 100, 300, 600], booster: [gbtree, gblinear, dart], eta: [0.1, 0.3, 0.5], gamma: [0, 0.5], max_depth: [4, 6, 8], min_child_weight: [1, 2], max_delta_step: [0, 5, 10], subsample: [0.1, 0.5, 1], lambda: [0.5, 1, 1.5], alpha: [0, 0.5, 1], tree_method: [auto, exact, approx, hist], grow_policy: [depthwise, lossguide], max_leaves: [0, 5, 15, 25], max_bin: [128, 256], sketch_eps: [0.01, 0.03, 0.05], refresh_leaf: [0, 1], scale_pos_weight: [1, 10, 25, 50, 75, 99, 100, 1000] |
Model | Hyperparameter Values |
---|---|
ANN | batch_size: 128, objective: loss, activate: [relu, elu, tanh, sigmoid, softmax, linear, exponential], optimizer: [Adam, RMSprop, SGD, Adagrad], loss: mean_absolute_error, epochs: 120 |
BAG | n_estimators: [10, 20, 50], max_features: [0.25, 0.5, 1.0], oob_score: [true, false], bootstrap: [true, false], max_samples: [0.25, 0.5, 1.0] |
DT | splitter: [best, random], max_depth: [2, 4, 6, 8, 10, 12], min_samples_split: [0.1, 0.2, 0.4, 0.8, 0.9], min_samples_leaf: [1, 2, 3, 4], max_features: [auto, sqrt, log2], max_leaf_nodes: [50, 100, 200, 300], min_impurity_decrease: [0, 0.1, 0.2, 0.3, 0.4, 0.5], ccp_alpha: [0, 0.1, 0.2, 0.3, 0.4, 0.5] |
KNN | n_neighbors: [3, 4, 5, 6, 7], algorithm: [auto, ball_tree, kd_tree, brute], leaf_size: [10, 20, 30, 50], metric: [minkowski, euclidean, manhattan, chebyshev], p: [1, 2, 3] |
LR | penalty: [l1, l2, elasticnet], tol: [, , ], C: [0.25, 0.3, 0.5, 0.6, 0.75, 0.9, 1], fit_intercept: [true, false], solver: [liblinear, newton-cg, sag, saga], max_iter: [50, 100, 500, 1000], l1_ratio: [0.1, 0.25, 0.5, 0.75, 1] |
RF | n_estimators: [50, 100, 400, 800, 2000], max_depth: [25, 50, 250, 500], min_samples_split: [2, 5, 10, 20], min_samples_leaf: [2, 5, 10], max_features: [auto, sqrt, log2], oob_score: [true, false], bootstrap: [true, false], min_weight_fraction_leaf: [0, 0.5] |
SVM | C: [0.5, 1, 1.5], kernel: [linear, poly, rbf, sigmoid], degree: [1, 2, 3, 4, 5], gamma: [scale, auto], coef0: [0, 0.5, 1], shrinking: [true, false], tol: [, , , ], cache_size: [100, 200, 300], max_iter: [−1, 100, 150, 200, 500] |
XGBOOST | n_estimators: [50, 100, 300, 600, 1000], booster: [gbtree, gblinear, dart], eta: [0.1, 0.3, 0.5], gamma: [0, 0.25, 0.5], max_depth: [4, 6, 8, 12, 20], min_child_weight: [1, 2], max_delta_step: [0, 5, 10], subsample: [0.1, 0.5, 1], lambda: [0.5, 1, 1.5], alpha: [0, 0.5, 1], tree_method: [auto, exact, approx, hist], grow_policy: [depthwise, lossguide], max_leaves: [0, 5, 15, 25, 35, 50], max_bin: [128, 256, 512], sketch_eps: [0.01, 0.03, 0.05], refresh_leaf: [0, 1], scale_pos_weight: [1, 10, 25, 50, 75, 99, 100, 1000] |
Appendix A.2. Best Hyperparameters of Each Model Selected for Each Dataset
Hyperparameter | Value |
---|---|
N_estimators | 100 |
Booster | Dart |
Eta | 0.1 |
Gamma | 0.5 |
Max_depth | 8 |
Min_child_weight | 2 |
Max_delta_step | 10 |
Subsample | 1 |
Lambda | 1 |
Alpha | 0 |
Tree_method | Approx |
Grow_policy | Depthwise |
Max_leaves | 25 |
Max_bin | 256 |
Sketch_eps | 0.05 |
Refresh_leaf | 1 |
Scale_pos_weight | 100 |
Hyperparameter | Value |
---|---|
Num_layers | 4 |
Units | [248, 12, 12, 184] |
Output_units | 2 |
Activation | Softmax |
Dropout | False |
Optimizer | Adam |
Learning_rate | 0.001131 |
Loss_function | Sparse_categorical_crosstentropy |
Epochs | 100 |
Hyperparameter | Value |
---|---|
N_estimators | 600 |
Booster | Gbtree |
Eta | 0.5 |
Gamma | 0.25 |
Max_depth | 12 |
Min_child_weight | 1 |
Max_delta_step | 0 |
Subsample | 0.1 |
Lambda | 1.5 |
Alpha | 0.5 |
Tree_method | Hist |
Grow_policy | Depthwise |
Max_leaves | 25 |
Max_bin | 256 |
Sketch_eps | 0.03 |
Refresh_leaf | 1 |
Scale_pos_weight | 25 |
Hyperparameter | Value |
---|---|
N_estimators | 400 |
Criterion | mse |
Max_depth | 50 |
Min_samples_split | 20 |
Min_samples_leaf | 2 |
Min_weight_fraction_leaf | 0 |
Max_features | sqrt |
Oob_score | False |
Bootstrap | False |
Appendix A.3. Metrics Obtained for Every Model with All the Datasets
Model | Accuracy | Precision | F1 Score | Recall | Specificity | AUC | Matthews |
---|---|---|---|---|---|---|---|
ANN | 97.674 | 100.0 | 50.0 | 33.333 | 100.0 | 0.667 | 0.571 |
BAG | 98.837 | 100.0 | 80.0 | 66.667 | 100.0 | 0.833 | 0.812 |
DT | 96.512 | 0.0 | 0.0 | 0.0 | 100.0 | 0.500 | 0.0 |
KNN | 96.512 | 0.0 | 0.0 | 0.0 | 100.0 | 0.500 | 0.0 |
LR | 97.093 | 100.0 | 28.571 | 16.667 | 100.0 | 0.583 | 0.402 |
RF | 98.256 | 100.0 | 66.667 | 50.000 | 100.0 | 0.750 | 0.701 |
RLF | 99.419 | 100.0 | 90.909 | 83.333 | 100.0 | 0.917 | 0.910 |
RP | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 1.0 | 1.0 |
SVM | 96.512 | 0.0 | 0.0 | 0.0 | 100.0 | 0.500 | 0.0 |
XGB | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 1.0 | 1.0 |
Model | Accuracy | Precision | F1 Score | Recall | Specificity | AUC | Matthews |
---|---|---|---|---|---|---|---|
ANN | 94.897 | 92.818 | 93.463 | 94.118 | 95.390 | 0.948 | 0.893 |
BAG | 94.571 | 94.236 | 92.898 | 91.597 | 96.454 | 0.940 | 0.885 |
DT | 77.199 | 94.012 | 59.924 | 43.978 | 98.227 | 0.711 | 0.534 |
KNN | 87.079 | 84.0 | 83.168 | 82.353 | 90.071 | 0.862 | 0.727 |
LR | 92.725 | 92.151 | 90.442 | 88.796 | 95.213 | 0.920 | 0.846 |
RF | 95.223 | 95.101 | 93.750 | 92.437 | 96.986 | 0.947 | 0.899 |
RLF | 94.137 | 94.169 | 92.286 | 90.476 | 96.454 | 0.935 | 0.876 |
RP | 88.708 | 93.471 | 83.951 | 76.190 | 96.631 | 0.864 | 0.763 |
SVM | 78.284 | 67.253 | 75.369 | 85.714 | 73.582 | 0.796 | 0.578 |
XGB | 89.794 | 80.510 | 88.071 | 97.199 | 85.106 | 0.912 | 0.804 |
Model | Accuracy | Precision | F1 Score | Recall | Specificity | AUC | Matthews |
---|---|---|---|---|---|---|---|
ANN | 69.444 | 51.389 | 54.193 | 61.905 | 29.098 | 0.518 | 0.600 |
BAG | 94.444 | 94.103 | 94.447 | 95.055 | 32.717 | 0.535 | 0.917 |
DT | 52.778 | 43.452 | 46.963 | 56.695 | 28.123 | 0.525 | 0.402 |
KNN | 80.556 | 78.974 | 78.792 | 78.816 | 30.389 | 0.526 | 0.706 |
LR | 94.444 | 94.103 | 94.447 | 95.055 | 32.717 | 0.535 | 0.917 |
RF | 97.222 | 96.667 | 96.912 | 97.436 | 32.963 | 0.535 | 0.959 |
RLF | 97.222 | 96.667 | 96.912 | 97.436 | 32.963 | 0.535 | 0.959 |
SVM | 61.111 | 53.968 | 54.325 | 63.146 | 26.167 | 0.529 | 0.454 |
XGB | 97.222 | 97.619 | 97.531 | 97.619 | 33.095 | 0.536 | 0.959 |
Model | Pearson | MAE | MSE | RMSE | |
---|---|---|---|---|---|
ANN | 40.340 | 0.114 | 0.139 | 0.042 | 0.205 |
BAG | 77.408 | 0.599 | 0.091 | 0.019 | 0.138 |
DT | nan | −0.004 | 0.170 | 0.048 | 0.218 |
KNN | 44.353 | 0.162 | 0.141 | 0.040 | 0.199 |
LR | 76.354 | 0.582 | 0.100 | 0.020 | 0.141 |
RF | 79.333 | 0.628 | 0.087 | 0.018 | 0.133 |
SVM | 0.116 | −0.169 | 0.196 | 0.055 | 0.235 |
XGB | 73.440 | 0.533 | 0.100 | 0.022 | 0.149 |
References
- Misra, N.; Dixit, Y.; Al-Mallahi, A.; Bhullar, M.S.; Upadhyay, R.; Martynenko, A. IoT, big data, and artificial intelligence in agriculture and food industry. IEEE Internet Things J. 2020, 9, 6305–6324. [Google Scholar] [CrossRef]
- Duan, Y.; Edwards, J.S.; Dwivedi, Y.K. Artificial intelligence for decision making in the era of Big Data–evolution, challenges and research agenda. Int. J. Inf. Manag. 2019, 48, 63–71. [Google Scholar] [CrossRef]
- Choi, R.Y.; Coyner, A.S.; Kalpathy-Cramer, J.; Chiang, M.F.; Campbell, J.P. Introduction to machine learning, neural networks, and deep learning. Transl. Vis. Sci. Technol. 2020, 9, 14. [Google Scholar] [PubMed]
- Ching, T.; Himmelstein, D.S.; Beaulieu-Jones, B.K.; Kalinin, A.A.; Do, B.T.; Way, G.P.; Ferrero, E.; Agapow, P.M.; Zietz, M.; Hoffman, M.M.; et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface 2018, 15, 20170387. [Google Scholar] [CrossRef] [PubMed]
- Bai, Q.; Ma, J.; Liu, S.; Xu, T.; Banegas-Luna, A.J.; Pérez-Sánchez, H.; Tian, Y.; Huang, J.; Liu, H.; Yao, X. WADDAICA: A webserver for aiding protein drug design by artificial intelligence and classical algorithm. Comput. Struct. Biotechnol. J. 2021, 19, 3573–3579. [Google Scholar] [CrossRef]
- Mater, A.C.; Coote, M.L. Deep learning in chemistry. J. Chem. Inf. Model. 2019, 59, 2545–2559. [Google Scholar] [CrossRef]
- Goh, G.B.; Hodas, N.O.; Vishnu, A. Deep learning for computational chemistry. J. Comput. Chem. 2017, 38, 1291–1307. [Google Scholar] [CrossRef]
- Miotto, R.; Wang, F.; Wang, S.; Jiang, X.; Dudley, J.T. Deep learning for healthcare: Review, opportunities and challenges. Briefings Bioinform. 2018, 19, 1236–1246. [Google Scholar] [CrossRef]
- Zou, J.; Huss, M.; Abid, A.; Mohammadi, P.; Torkamani, A.; Telenti, A. A primer on deep learning in genomics. Nat. Genet. 2019, 51, 12–18. [Google Scholar] [CrossRef]
- Eraslan, G.; Avsec, Ž.; Gagneur, J.; Theis, F.J. Deep learning: New computational modelling techniques for genomics. Nat. Rev. Genet. 2019, 20, 389–403. [Google Scholar] [CrossRef]
- Li, H.; Tian, S.; Li, Y.; Fang, Q.; Tan, R.; Pan, Y.; Huang, C.; Xu, Y.; Gao, X. Modern deep learning in bioinformatics. J. Mol. Cell Biol. 2020, 12, 823–827. [Google Scholar] [CrossRef] [PubMed]
- Gawehn, E.; Hiss, J.A.; Schneider, G. Deep learning in drug discovery. Mol. Inform. 2016, 35, 3–14. [Google Scholar] [CrossRef] [PubMed]
- Maia, E.H.B.; Assis, L.C.; De Oliveira, T.A.; Da Silva, A.M.; Taranto, A.G. Structure-based virtual screening: From classical to artificial intelligence. Front. Chem. 2020, 8, 343. [Google Scholar] [CrossRef] [PubMed]
- Patel, L.; Shukla, T.; Huang, X.; Ussery, D.W.; Wang, S. Machine learning methods in drug discovery. Molecules 2020, 25, 5277. [Google Scholar] [CrossRef] [PubMed]
- Pérez-Gandía, C.; García-Sáez, G.; Subías, D.; Rodríguez-Herrero, A.; Gómez, E.J.; Rigla, M.; Hernando, M.E. Decision support in diabetes care: The challenge of supporting patients in their daily living using a mobile glucose predictor. J. Diabetes Sci. Technol. 2018, 12, 243–250. [Google Scholar] [CrossRef]
- Lee, Y.; Ragguett, R.M.; Mansur, R.B.; Boutilier, J.J.; Rosenblat, J.D.; Trevizol, A.; Brietzke, E.; Lin, K.; Pan, Z.; Subramaniapillai, M.; et al. Applications of machine learning algorithms to predict therapeutic outcomes in depression: A meta-analysis and systematic review. J. Affect. Disord. 2018, 241, 519–532. [Google Scholar] [CrossRef]
- Misawa, M.; Kudo, S.E.; Mori, Y.; Cho, T.; Kataoka, S.; Yamauchi, A.; Ogawa, Y.; Maeda, Y.; Takeda, K.; Ichimasa, K.; et al. Artificial intelligence-assisted polyp detection for colonoscopy: Initial experience. Gastroenterology 2018, 154, 2027–2029. [Google Scholar] [CrossRef]
- Ichimasa, K.; Kudo, S.E.; Mori, Y.; Misawa, M.; Matsudaira, S.; Kouyama, Y.; Baba, T.; Hidaka, E.; Wakamura, K.; Hayashi, T.; et al. Artificial intelligence may help in predicting the need for additional surgery after endoscopic resection of T1 colorectal cancer. Endoscopy 2018, 50, 230–240. [Google Scholar]
- Hamet, P.; Tremblay, J. Artificial intelligence in medicine. Metabolism 2017, 69, S36–S40. [Google Scholar] [CrossRef]
- Schork, N.J. Artificial intelligence and personalized medicine. In Precision Medicine in Cancer Therapy; Springer: Cham, Switzerland, 2019; pp. 265–283. [Google Scholar]
- Khan, O.; Badhiwala, J.H.; Grasso, G.; Fehlings, M.G. Use of machine learning and artificial intelligence to drive personalized medicine approaches for spine care. World Neurosurg. 2020, 140, 512–518. [Google Scholar] [CrossRef]
- Handelman, G.S.; Kok, H.K.; Chandra, R.V.; Razavi, A.H.; Lee, M.J.; Asadi, H. eD octor: Machine learning and the future of medicine. J. Intern. Med. 2018, 284, 603–619. [Google Scholar] [CrossRef] [PubMed]
- Bahri, M.; Salutari, F.; Putina, A.; Sozio, M. AutoML: State of the art with a focus on anomaly detection, challenges, and research directions. Int. J. Data Sci. Anal. 2022, 14, 113–126. [Google Scholar] [CrossRef]
- Alsharef, A.; Aggarwal, K.; Sonia; Kumar, M.; Mishra, A. Review of ML and AutoML solutions to forecast time-series data. Arch. Comput. Methods Eng. 2022, 29, 5297–5311. [Google Scholar] [CrossRef] [PubMed]
- Karmaker, S.K.; Hassan, M.M.; Smith, M.J.; Xu, L.; Zhai, C.; Veeramachaneni, K. Automl to date and beyond: Challenges and opportunities. ACM Comput. Surv. (CSUR) 2021, 54, 1–36. [Google Scholar] [CrossRef]
- Thiyagalingam, J.; Shankar, M.; Fox, G.; Hey, T. Scientific machine learning benchmarks. Nat. Rev. Phys. 2022, 4, 413–420. [Google Scholar] [CrossRef]
- Truong, A.; Walters, A.; Goodsitt, J.; Hines, K.; Bruss, C.B.; Farivar, R. Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 1471–1479. [Google Scholar]
- LeDell, E.; Poirier, S. H2O AutoML: Scalable Automatic Machine Learning. In Proceedings of the 7th ICML Workshop on Automated Machine Learning (AutoML); ICML: San Diego, CA, USA, 2020. [Google Scholar]
- Feurer, M.; Klein, A.; Eggensperger, K.; Springenberg, J.; Blum, M.; Hutter, F. Efficient and Robust Automated Machine Learning. Adv. Neural Inf. Process. Syst. 2015, 28, 2962–2970. [Google Scholar]
- Real, E.; Liang, C.; So, D.; Le, Q. Automl-zero: Evolving machine learning algorithms from scratch. Int. Conf. Mach. Learn. 2020, 119, 8007–8019. [Google Scholar]
- Wang, C.; Wu, Q.; Weimer, M.; Zhu, E. Flaml: A fast and lightweight automl library. Proc. Mach. Learn. Syst. 2021, 3, 434–447. [Google Scholar]
- Ferreira, L.; Pilastri, A.; Martins, C.M.; Pires, P.M.; Cortez, P. A Comparison of AutoML Tools for Machine Learning, Deep Learning and XGBoost. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Erickson, N.; Mueller, J.; Shirkov, A.; Zhang, H.; Larroy, P.; Li, M.; Smola, A. Autogluon-tabular: Robust and accurate automl for structured data. arXiv 2020, arXiv:2003.06505. [Google Scholar]
- Team, T. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://tensorflow.org/ (accessed on 10 November 2024).
- O’Malley, T.; Bursztein, E.; Long, J.; Chollet, F.; Jin, H.; Invernizzi, L. KerasTuner. 2019. Available online: https://github.com/keras-team/keras-tuner (accessed on 10 November 2024).
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Molnar, C. Python Implementation of the Rulefit Algorithm. Available online: https://github.com/christophM/rulefit (accessed on 21 September 2024).
- Imoscovitz. Ruleset Covering Algorithms for Transparent Machine Learning. Available online: https://github.com/imoscovitz/wittgenstein (accessed on 21 September 2024).
- xgboost. XGBoost Documentation. Available online: https://xgboost.readthedocs.io/en/stable (accessed on 21 September 2024).
- King, G.; Zeng, L. Logistic regression in rare events data. Political Anal. 2001, 9, 137–163. [Google Scholar] [CrossRef]
- Menardi, G.; Torelli, N. Training and assessing classification rules with imbalanced data. Data Min. Knowl. Discov. 2014, 28, 92–122. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1–6 June 2008; pp. 1322–1328. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Singh, D.; Singh, B. Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 2020, 97, 105524. [Google Scholar] [CrossRef]
- Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv 2017, arXiv:1702.08608. [Google Scholar]
- Carter, A.; Imtiaz, S.; Naterer, G. Review of interpretable machine learning for process industries. Process Saf. Environ. Prot. 2023, 170, 647–659. [Google Scholar] [CrossRef]
- Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar] [CrossRef]
- Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining explanations: An overview of interpretability of machine learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar]
- ALIBI EXPLAIN, Version 0.9.5 Accumulated Local Effects. Available online: https://docs.seldon.io/projects/alibi/en/stable/methods/ALE.html (accessed on 21 September 2024).
- Apley, D.; Zhu, J. Visualizing the effects of predictor variables in black box supervised learning models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2020, 82, 1059–1086. [Google Scholar] [CrossRef]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. Anchors: High-precision model-agnostic explanations. Proc. AAAI Conf. Artif. Intell. 2018, 32, 1527–1535. [Google Scholar] [CrossRef]
- Mothilal, R.K.; Sharma, A.; Tan, C. Explaining machine learning classifiers through diverse counterfactual explanations. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; pp. 607–617. [Google Scholar]
- ALIBI EXPLAIN, Version 0.9.5 Integrated Gradients. Available online: https://docs.seldon.io/projects/alibi/en/latest/methods/IntegratedGradients.html (accessed on 21 September 2024).
- Ribeiro, M.T. Lime: Explaining the Predictions of Any Machine Learning Classifier. Available online: https://github.com/marcotcr/lime (accessed on 21 September 2024).
- Scikit-learn. Partial Dependence and Individual Conditional Expectation Plots. Available online: https://scikit-learn.org/stable/modules/partial_dependence.html (accessed on 21 September 2024).
- Scikit-learn. Permutation Feature Importance. Available online: https://scikit-learn.org/stable/modules/permutation_importance.html (accessed on 21 September 2024).
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- SHAP. SHAP Documentation. Available online: https://shap.readthedocs.io/en/latest/index.html (accessed on 21 September 2024).
- Lundberg, S. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
- Krishna, S.; Han, T.; Gu, A.; Pombra, J.; Jabbari, S.; Wu, S.; Lakkaraju, H. The disagreement problem in explainable machine learning: A practitioner’s perspective. arXiv 2022, arXiv:2202.01602. [Google Scholar]
- Kurtzer, G.M.; Sochat, V.; Bauer, M.W. Singularity: Scientific containers for mobility of compute. PLoS ONE 2017, 12, e0177459. [Google Scholar] [CrossRef] [PubMed]
- Merkel, D. Docker: Lightweight linux containers for consistent development and deployment. Linux J. 2014, 239, 2. [Google Scholar]
- Hu, G.; Zhang, Y.; Chen, W. Exploring the performance of singularity for high performance computing scenarios. In Proceedings of the 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), Zhangjiajie, China, 10–12 August 2019; pp. 2587–2593. [Google Scholar]
- Kelly, M.; Longjohn, R.; Nottingham, K. The UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu (accessed on 10 November 2024).
- Fernandes, K.; Cardoso, J.; Fernandes, J. Cervical Cancer (Risk Factors). Available online: https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors (accessed on 10 November 2024). [CrossRef]
- Hopkins, M.; Reeber, E.; Forman, G.; Suermondt, J. Spambase. Available online: https://archive.ics.uci.edu/dataset/94/spambase (accessed on 10 November 2024). [CrossRef]
- Aeberhard, S.; Forina, M. Wine. Available online: https://archive.ics.uci.edu/dataset/109/wine (accessed on 10 November 2024). [CrossRef]
- Redmond, M. Communities and Crime. Available online: https://archive.ics.uci.edu/dataset/183/communities+and+crime (accessed on 10 November 2024). [CrossRef]
- Espinase Nandorfy, D.; Watson, F.; Likos, D.; Siebert, T.; Bindon, K.; Kassara, S.; Shellie, R.; Keast, R.; Francis, I. Influence of amino acids, and their interaction with volatiles and polyphenols, on the sensory properties of red wine. Aust. J. Grape Wine Res. 2022, 28, 621–637. [Google Scholar] [CrossRef]
- Pérez-Sánchez, H.; Banegas-Luna, A.J. 164. SIBILA: Investigación y Desarrollo en Aprendizaje Máquina Interpretable Mediante Supercomputación para la Medicina Personalizada [Audio Podcast]. In Investigando la Investigación. Spotify. Available online: https://open.spotify.com/episode/3oRXe7PLpCeK86AT3izn7W (accessed on 10 November 2024).
Model | Name | Libraries | Class./Reg. | Ref. |
---|---|---|---|---|
ANN | Artificial Neural Network | Tensorflow 2, Keras Tuner | Both | [34,35] |
BAG | Bagging | scikit-learn | Both | [36] |
DT | Decision Tree | scikit-learn | Both | [36] |
LR | Linear/Logistic Regression | scikit-learn | Both | [36] |
KNN | K-Nearest Neighbours | scikit-learn | Both | [36] |
RF | Random Forest | scikit-learn | Both | [36] |
RLF | RuleFit | rulefit | Classification | [37] |
RP | Repated Incremental Pruning to Pruduce Error Reduction | wittgenstein | Classification | [38] |
SVM | Support Vector Machine | scikit-learn | Both | [36] |
XGBOOST | eXtreme Gradient Boosting Machine | xgboost | Both | [39] |
Problem | Metrics |
---|---|
Classification | Accuracy, Area Under the Curve (AUC), Confusion Matrix, F1 Score, Matthews Correlation Coefficient (MCC), Precision, Recall, Specificity |
Regression | Coefficient of Determination (), Mean Absolute Error (MAE), Pearson Coefficient, Root Mean Squared Error (RMSE) |
Algorithm | Library | Ref. |
---|---|---|
Accumulated Local Effects (ALE) | alibi | [49,50] |
Anchors (Scopes rules) | alibi | [51] |
Diverse Counterfactual Explanations (DiCE) | dice-ml | [52] |
Integrated Gradients | alibi | [53] |
Local Interpretable Model-Agnostic Explanations (LIME) | lime | [54] |
Partial Dependence Plots (PDP) + Individual Conditional Expectation (ICE) | scikit-learn | [55] |
Permutation Importance | scikit-learn | [56] |
Random Forest Feature Importance | scikit-learn | [57] |
Shapley Values | shap | [58,59] |
Dataset | Description | Task | Samples | Features | Ref. |
---|---|---|---|---|---|
Cancer | Prediction of indicators of cervical cancer | Binary classification | 858 | 40 | [65] |
Spam | Identification of spam emails | Binary classification | 4601 | 57 | [66] |
Wine | Classification of three types of Italian wine | Multiclass classification | 178 | 13 | [67] |
Crime | Prediction of the number of crimes in the USA | Regression | 1994 | 4091 | [68] |
Dataset | Task | Model | Specificity | Precision | Recall | AUC |
---|---|---|---|---|---|---|
Cancer | Binary classification | XGB | 100.000 | 100.000 | 100.000 | 1.000 |
Spam | Binary classification | ANN | 95.390 | 92.818 | 94.118 | 0.948 |
Wine | Multiclass classification | XGB | 33.095 | 97.619 | 97.619 | 0.536 |
Dataset | Task | Model | MAE | MSE | RMSE | |
---|---|---|---|---|---|---|
Crime | Regression | RF | 0.087 | 0.018 | 0.133 | 0.628 |
Cancer | Spam | Wine | Crime | ||||
---|---|---|---|---|---|---|---|
Feature | Attribution | Feature | Attribution | Feature | Attribution | Feature | Attribution |
Dx:HPV | 0.129 | Word_freq_remove | 0.067 | Proline | 0.130 | NumStreet | −0.004 |
Dx | 0.043 | Word_freq_hp | 0.021 | Color intensity | 0.045 | PctKids2Par | −0.003 |
Smokes (years) | 0.004 | Word_freq_free | 0.019 | Flavanoids | 0.035 | racePctWhite | −0.002 |
Hinselmann | 0.001 | Char_freq_$ | 0.018 | Hue | 0.025 | pctWInvInc | −0.001 |
Biopsy | 0.001 | Char_freq_! | 0.015 | Alcohol | 0.017 | PctFam2Par | −0.001 |
Training | Interpretation | ||||||
---|---|---|---|---|---|---|---|
Dataset | Task | CPU | GPU | Improve | Sequential | HPC Cluster | Improve |
Cancer | Binary clf. | 17.35 | 8.64 | 50.20% | 2267.10 | 1245.87 | 45.05% |
Spam | Binary clf. | 27.53 | 986.87 | −3484.71% | 12,841.32 | 6257.19 | 51.27% |
Wine | Multiclass clf. | 19.28 | 19.25 | 0.16% | 167.60 | 49.86 | 70.25% |
Crime | Regression | 29.12 | 20.20 | 30.63% | 116,986.10 | 103,696.92 | 11.36% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Banegas-Luna, A.J.; Pérez-Sánchez, H. SIBILA: Automated Machine-Learning-Based Development of Interpretable Machine-Learning Models on High-Performance Computing Platforms. AI 2024, 5, 2353-2374. https://doi.org/10.3390/ai5040116
Banegas-Luna AJ, Pérez-Sánchez H. SIBILA: Automated Machine-Learning-Based Development of Interpretable Machine-Learning Models on High-Performance Computing Platforms. AI. 2024; 5(4):2353-2374. https://doi.org/10.3390/ai5040116
Chicago/Turabian StyleBanegas-Luna, Antonio Jesús, and Horacio Pérez-Sánchez. 2024. "SIBILA: Automated Machine-Learning-Based Development of Interpretable Machine-Learning Models on High-Performance Computing Platforms" AI 5, no. 4: 2353-2374. https://doi.org/10.3390/ai5040116
APA StyleBanegas-Luna, A. J., & Pérez-Sánchez, H. (2024). SIBILA: Automated Machine-Learning-Based Development of Interpretable Machine-Learning Models on High-Performance Computing Platforms. AI, 5(4), 2353-2374. https://doi.org/10.3390/ai5040116