2. Materials and Methods
The KEGG organizes its pathways in a hierarchy with three levels, which is found here:
https://www.genome.jp/kegg-bin/show_brite?br08901.keg (accessed on 1 June 2024). The top level of the hierarchy contains the top pathway categories, which we will call L1 pathways. The next level contains the pathway categories or modules, which we will call L2 pathways. The lowest level contains the individual human-defined pathways, which we will call L3 pathways. The L1 pathways include “Metabolism”, “Genetic Information Processing”, “Environmental Information Processing”, “Cellular Processes”, “Organismal Systems”, “Human Diseases”, and “Drug Development”. While past publications have only considered the L2 under “Metabolism” and recently the L3 metabolic pathways, we used the kegg_pull [
18] Python package to download all the pathways in the hierarchy, including the L1 pathways and all the L2 and L3 pathways underneith them. We additionally used kegg_pull to download the KEGG compounds as molfiles and determined the mapping from the compound to the associated pathways using the pathway annotations provided by the KEGG.
With the molfiles and pathway mappings, we used the dataset construction method described by Huckvale and Moseley [
16]. This includes converting the molfiles to compound features using the atom coloring technique introduced by Jin and Moseley [
10], constructing the pathway features based on the features of the compounds associated with them, and performing a cross-join of the compound features and the pathway features where each entry in the resulting dataset is a pair of pathway features and compound features concatenated together and the binary label indicates whether the corresponding compound is associated with the corresponding pathway. In order to maximize the validity of the evaluation of the test sets in the cross-validation (CV) analysis (i.e., prevent data leakage from the training sets into the test sets), any duplicate compound or pathway feature vectors were removed prior to the cross-join, resulting in removing 20 duplicate pathway entries and 97 compound entries. Considering the significantly larger amount of data, we were motivated to make the data loading technique used in model training more efficient. While both this work and the work of Huckvale and Moseley [
17] uses the PyTorch Python package [
19] for implementing and training the multi-layer perceptron (MLP) binary classifier, we did not use PyTorch’s built-in data loader in this work. Previously, the built-in data loader was used, retrieving each entry one at a time by selecting the current compound feature vector and pathway feature vector and concatenating them together, followed by inserting the resulting vector into the next batch and loading the batch onto the graphics processing unit (GPU). This batching technique was used because the entire dataset could not fit into the available GPU random access memory (RAM) all at once. While the cross-joined dataset was not able to fit in the GPU, the separated compound features and pathway features could. This enabled us to create our own data loader that samples the compound features and pathway features in batches rather than one at a time and concatenates them together on the GPU. With all the data having been loaded onto the GPU ahead of time and all the batching being performed by efficient tensor function calls on the GPU rather than numpy function calls on the central processing unit (CPU), we were able to reduce the training time of the model by more than 20-fold.
Table S1 shows the difference in computational resource usage between training the final model on the full KEGG dataset using the previous data loading method and that using our novel data loading method. We see a stark increase in GPU utilization usage from 8.8% to 93.7%, followed by a stark decrease in the real computation time from 978.2 min to 47.1 min. The speed improvement is over 20-fold, even though the increase in GPU utilization is only 10.6-fold. We suspect that the custom data loader avoids costly GPU wait states with the transfer of data from the CPU RAM to the GPU RAM, providing better efficiency than one would first expect. These results demonstrate that making better use of the GPU in the data loading greatly decreases the model training time.
After constructing the dataset and implementing the novel data loading technique, we tuned the model hyperparameters using the Optuna Python package version 3.3.0 [
20]. With the best hyperparameters selected, we performed an analysis of 200 CV iterations on the entire dataset using stratified train/test splits [
21]. In these CV analyses, we tracked not only the total number of true positives, true negatives, false positives, and false negatives of all the dataset entries but those of each individual compound and pathway as well. This enabled us to not only construct a confusion matrix and calculate overall metric, but also to calculate the metrics per compound and per pathway as well. The metrics calculated included the Matthew’s correlation coefficient (MCC), accuracy, precision, recall, F1 score, and specificity. In order to ensure valid (i.e., not undefined due to division by 0) metrics for each compound and pathway, we constructed the confusion matrix of each by summing the true positives, true negatives, false positives, and false negatives across all CV iterations. This manner of calculation prevents obtaining a standard deviation. So, for the full dataset, we calculated the metrics per CV iteration and calculated the mean, median, and standard deviation across the CV iterations.
To determine the impact of the chemical information content of the compounds and pathways on the model performance, we additionally created filtered datasets (from a preliminary dataset constructed prior to de-duplicating the pathway and compound entries). First, we created 15 datasets, each with an increasingly higher compound filter threshold. The filtering was based on the number of non-hydrogen atoms in a compound and compounds were removed from the training set if the number of non-hydrogen atoms did not meet each filter threshold.
Figure S1 shows how the number of compounds and entries in the dataset decreases as the compound filter cutoff increases.
We performed a CV analysis of 50 CV iteration on each filtered dataset. We then filtered by pathway size, defining the pathway size as the sum of the number of non-hydrogen atoms across all the compounds associated with the pathway. The pathway filters first ranged from 5 to 50 by multiples of 5, 50 to 100 by multiples of 10, and then 100 to 200 by multiples of 20, a total of 20 pathway filters. We then performed 50 CV iterations on the training set of each pathway filter. The motivation was to determine the ideal compound size and pathway size for the full KEGG dataset.
Figure S2 shows how the number of pathways and entries in the dataset changes as the pathway filter increases.
Figure S3 shows scatter plots of the thresholds used to filter each training set (see
Figures S1 and S2) and compares the thresholds to the MCCs of the top compounds in
Figure S3a and the top pathways in
Figure S3b. For consistent comparison, the top compounds are the compounds remaining in the dataset of the highest compound size filter threshold (i.e., 15) and the top pathways are the pathways remaining after the highest pathway filter threshold (i.e., 200). This is because we cannot justify removing data from the dataset merely because the overall score is higher, since this can only occur because the smaller compounds and pathways are removed and they may be more difficult to predict. But if the smaller compounds and pathways negatively impact the larger compounds and pathways, then it is best to remove them. However,
Table 1 shows that these Pearson correlation coefficients are very close to zero, even though their
p-value are statistically significant. Thus, these relationships are real, but they are very weak. Due to these negligible correlations, we decided to retain all the compounds and pathways for the final model training and evaluation. The data and results of this preliminary analysis can be found in the following Figshare:
https://figshare.com/articles/journal_contribution/FullKEGG_Preliminary_DO_NOT_USE/27173037 (accessed on 4 October 2024).
In addition to testing the impact of filtering entries by compound and pathway size, we also tested filtering by hierarchy level.
Table 2 shows how the number of entries and number of pathways differ between the full dataset containing the L1, L2, and L3 pathways and two other datasets i.e., that containing only L2 and L3 pathways, and lastly, that containing L3 pathways only. The number of compounds remains the same regardless of which pathway hierarchy levels are included. We trained on each of these three datasets to test how the inclusion of one hierarchy level impacts the performance of the others. The L2 and L3, as well as the L3 only, training sets were evaluated over 50 CV iterations, with the metrics calculated by constructing a confusion matrix by summing the true positives, true negatives, false positives, and false negatives across all the included pathways across all CV iterations.
The hardware used for this work included machines with up to 2 terabytes (TB) of random-access memory (RAM) and central processing units (CPUs) of 3.8 gigahertz (GHz) of processing speed. The name of the CPU chip was “Intel(R) Xeon(R) Platinum 8480CL”. The CPUs were sourced from the Intel corporation in Santa Clara, California, USA. The graphic processing units (GPUs) used had 81.56 gigabytes (GB) of GPU RAM, with the name of the GPU card being “NVIDIA H100 80GB HBM3”. The GPUs were sourced from the Nvidia corporation in Santa Clara, California, USA.
All code for this work was written in the major version 3 of the Python programming language [
22]. Data processing and storage were performed using the Pandas version 1.0.3 [
23], NumPy version 1.26.4 [
24], and H5Py version 3.9.0 [
25] packages. Models were constructed and trained using the PyTorch Lightning package version 2.2.1 [
26] built on the PyTorch package version 2.0.1 [
19]. The metrics and the stratified train–test splits were computed using the Sci-Kit Learn package version 1.3.0 [
27]. The results were stored in an SQL [
28] database using the DuckDB package version 1.0.0 [
29]. Data visualizations were produced using the Tableau business intelligence software version 2024.2.2 [
30] as well as the seaborn package version 0.12.2 [
31] built on the MatPlotLib package version 3.7.2 [
32]. The results were finalized in a Jupyter notebook [
33]. The computational resource usage when training the final model was collected using the gpu_tracker package version 3.0.0 [
34]. All code and data for reproducing these analyses can be accessed via the following Figshare item:
https://doi.org/10.6084/m9.figshare.27172941 (accessed on 21 October 2024).