1. Introduction
Precision therapy attempts to optimize the treatment of diseases for specific patient sub-populations, defined by clinical biomarkers [
1]. It is especially promising in the context of cancer therapy, where patient responses to therapy can vary considerably and often lack clinical durability, even with the advent of single-targeted therapies that address specific tumor genomic amplifications and disrupt oncogenic signaling processes [
2]. While patients often respond to targeted therapy initially, resistance is commonly detected as heterogeneity in the tumor cell population, which allows for the adaptive selection of sub-populations that are not affected by the drug, or cells effectively rewire their signaling machinery to bypass the drug inhibition [
3]. This paradigm has facilitated the acceptance of targeted polypharmacology and combination therapy strategies to address drug resistance and reduce recurrence in cancer patients [
4].
The success of small-molecule kinase inhibitors in cancer therapy is evident, with over 70 FDA-approved drugs in this class [
5,
6]. While receptor tyrosine- and cyclin-dependent kinases have been established as target groups, it is increasingly recognized that a much larger proportion of the human kinome is likely therapeutically relevant [
7]. Beyond cancer, kinase drug discovery programs now span a wide range of targets and disease areas [
8]. Despite their efficacy, kinase inhibitor drugs often exhibit limited response rates, and their effectiveness is of a short duration, necessitating the need for precision therapeutic approaches as we learn from the integration of clinical data, such as drug responses and cancer multi-omics characterization, and which features of these molecules are likely to provide the best outcome [
9,
10,
11].
In silico methods, such as ligand-based virtual screening, have been applied to kinase activity modeling and have shown that the utilization of known kinase small-molecule topological and bioactivity information can lead to the enrichment of novel kinase active compounds [
12]. The advantage of ligand-based virtual screening is that compounds can be readily evaluated for interactions across all kinases for which applicable bioactivity data are available. If such models were sufficiently accurate, they could support the prioritization of compounds with a desirable polypharmacology profile while deprioritizing those compounds that may inhibit kinases that are therapeutically less relevant or lead to toxicity in a specific disease or patient [
13,
14].
Classification approaches for ligand-based virtual screening, also called target prediction or target fishing, include a variety of single-task machine learning algorithms, such as logistic regression, random forests, support vector machines, and naïve Bayesian classifiers that aim to separate each kinase target individually, whether a compound belongs in the active or inactive class [
15,
16,
17]. Although these single-task methods perform relatively well in many instances, they do not take into consideration the membership of molecules in multiple classes and therefore do not adaptively learn across different categories, limiting their applicability to predict profiles. To achieve a better performance, machine learning methods must combine diverse sources of bioactivity data across multiple targets [
18]. This is especially relevant for kinases, where there is a large degree of similarity across many different kinases and their inhibitors [
19].
Motivated by previous advancements in multi-task neural network architectures, we conducted an evaluation of multi-task deep neural networks (MTDNN) for kinome-wide small-molecule activity prediction, comparing their performance to single-task methods [
20]. We compiled an extensive dataset comprising over 650,000 aggregated bioactivity annotations for more than 300,000 unique small molecules, covering 342 kinase targets. We evaluated the machine learning methods using reported actives and reported inactives as well as using the reported actives and considering all other compounds in the global dataset as inactives. Our results indicate that multi-task deep learning results in a substantially better predictive performance compared to single-task machine learning methods using various cross-validation strategies. Additionally, the performance of the multi-task method continues to improve with the addition of more data, whereas single-task methods tend to plateau or diminish in performance. Our work extends the contribution of prior studies, which employed deep learning techniques for broad target classification and specific benchmarking datasets [
19,
21,
22,
23,
24,
25,
26,
27], by evaluating MTDNN on a much larger dataset than previously explored, utilizing real-world datasets from both public and private sources. Through thorough characterization and evaluation, our study provides practical insights into the applicability of kinome-wide multi-task activity predictors for virtual profiling and compound prioritization.
3. Discussion
In this study, we investigated the MTDNN approach for one of the largest protein families of drug targets, kinases, which are particularly relevant in cancer. We leveraged large numbers of available public and private small-molecule activity data, modeled the data in different ways, and compared deep learning and multi-task learning to classical single-task machine learning methodologies. Our analysis involved exploring the applicability of heterogeneous and sparse datasets of small-molecule kinase inhibitors to develop such models and how the various characteristics of these datasets impact model performance. The applicability and reusability of large public datasets are of considerable interest because of the wide diversity of screening technologies and data processing pipelines and available metadata annotations [
31]. These datasets encompass a diverse array of factors, including assay methods, detection technologies, assay kits, reagents, experimental conditions, normalization, curve fitting, and more, as described in BioAssay Ontology [
33]. It is worth noting that these details often vary considerably among the tens of thousands of protocols and thousands of laboratories that generated these datasets. In many cases, these details are not available in a standardized format and must be manually curated from publications or patents.
We compare the performance of this MTDNN method against more traditional single-task machine learning approaches in the scope of virtual screening. While it is ideal to build models only from confirmed actives and inactives (KA–KI approach), published data extracted from journal publications and patents, which underlie our datasets, tend to exhibit a strong bias towards reporting active compounds, while inactive compounds are typically not considered as interesting. In practice, as evident, for example, in experimental high-throughput screening (HTS), even of focused libraries, most compounds are typically inactive. Therefore, the utilization of the KA–KI approach, while highly predictive in many contexts, is of limited practical relevance in the context of the virtual screening of very large datasets. While considering all data tested against unique protein kinases that do not have reported biological activity as presumed inactives (KA–PI approach) may introduce some errors, they are likely not overwhelming because active compounds are typically screened against the most similar likely off targets and the activity (including a lack thereof) is typically reported to characterize selectivity. Any errors introduced by the KA–PI assumption would be expected to degrade model performance. Therefore, the reported results could be considered conservative in the context of these assumptions.
Our results demonstrated a consistent and reliable predictive performance across most kinase tasks when a sufficient number of active compounds or structure–activity data points were available. The MTDNN exhibited a strong performance across all kinase groups, which considerably vary with respect to the available data. Kinases within the same group exhibit a higher similarity to one another by sequence compared to kinases across different groups [
34]. The MTDNN demonstrates its ability to effectively leverage information from similar tasks, improving predictions for all tasks, including those that are underrepresented, by capturing shared knowledge and leveraging the similarities between closely related kinase groups. This is a capability that single-task methods cannot achieve. This characteristic of MTDNN is likely advantageous in the case of kinase-focused datasets, where the pronounced similarities of kinase ATP-binding sites and the well-established cross-activity of many small-molecule kinase inhibitors come into play.
Similarity in molecular structure plays an important role in the performance of MTDNNs [
35]. Our investigation highlighted that the MTDNN performs particularly well for kinase tasks that share similar compounds with other kinase sets. One of the key factors contributing to the superior performance of MTDNN could be its ability to adaptively learn molecular feature representations for active compounds across different classes, allowing for the model to better learn the distinct representation between all compounds. Specific reasons for such a performance remain elusive, emphasizing the need for further investigation to delve deeper into those observations.
The primary goal of this study was to explore the applicability of this method in digging out small numbers of actual actives from a very large pool of inactives through virtual screening. This exploration involved leveraging diverse, target family-focused datasets of small-molecule kinase inhibitors. Importantly, our focus diverged from emphasizing the perfect classification of tested compounds into active and inactive classes. In this context, the enrichment factor (EF) stood out as a better choice to evaluate the model performance among various metrics available. The EF measures the ratio of the proportion of active compounds found within a very small fraction of predicted most likely actives in the library to the proportion of active compounds in the overall library. In other words, it quantifies how efficiently a model can enrich the active compounds at the top of the ranked list. When conducting virtual screening, the primary objective is to identify a small subset of compounds from a large library that is most likely to be active against a specific target. In this context, the EF provides a direct measurement of how well a model performs in prioritizing potentially active compounds. Enrichment is essentially a measurement of precision and recall for the initial small fraction considered. It is a useful measurement for highly imbalanced datasets and is particularly practical and established in virtual screening. The relative enrichment factor normalizes absolute enrichment based on the maximum possible enrichment, which depends on the ratio of active vs. inactive compounds and can be used to compare across different datasets. By considering unknown activity for any kinase as inactive, our approach can be considered practically relevant, as inactives vastly outnumber actives and many inactives are highly similar to actives; in some cases, presumed inactives may be active, leading to a degraded performance and more conservative results.
In this study, we utilized a pActivity threshold of 6 (corresponds to 1 μM) to distinguish between active and inactive inhibitors. The rationale behind selecting this specific cut-off value primarily originates from our comparison to previous high throughput screening hit-finding campaigns, where hits of greater than 1 μM are often deprioritized due to assay artifacts and solubility concerns. Upon analyzing the model performance across different cut-off values (pActivity of 7, 8, and 9) in conjunction with the threshold of 6 (
Figures S11 and S12), we found that there is a modest improvement from cutoff 7 (100 nM). However, further improvement with cutoff 8 (10 nM) and cutoff 9 (1 nM) do not seem to yield a better performance. This may be attributed to the insufficient data for cutoffs 8 and 9 to demonstrate a better performance, whereas cutoff 7 may be close to optimal for minimizing the false negatives in the training data while still keeping a substantial number of actives. The optimal active classification threshold requires careful consideration, and future investigations are essential to understand the trade-offs and limitations associated with varying thresholds, considering both the model performance and practical implications.
While the MTDNN performed well across the diverse small-molecule datasets investigated and appears applicable for virtual screening, limitations in predicting activities of highly dissimilar compounds, as observed in the IDG-DREAM Challenge datasets, have to be acknowledged. Specifically, the disparity in chemical structures between the training dataset and the IDG kinase tasks posed challenges to the model’s ability to make accurate predictions. This highlights the need for further studies and improvements to enhance the performance in scenarios where substantial differences exist in chemical structures between the training dataset and the target datasets. Future studies should explore strategies to address this limitation, such as incorporating additional training data that better represents the chemical space of the target datasets or employing transfer learning techniques to adapt the model to different chemical contexts.
It is also worth noting that, while the MTDNN approach yielded the best results and appeared most generalizable, it does come with considerably higher computational costs. However, with the latest generation of GPUs and as available datasets increase in size and diversity, deep neural networks are becoming increasingly relevant to fulfill the promise of virtual screening. Overall, our study contributes to the growing body of research in the field of virtual screening and underscores the potential of deep neural networks to further advance this area of drug discovery. Continued improvements and refinements in the application of these models will undoubtedly pave the way for more accurate and efficient virtual screening approaches in the future.
4. Methods and Materials
4.1. Data Aggregation
Small-molecule activity data against the human kinome were obtained and curated from ChEMBL (release 21) and Kinase Knowledge Base (KKB) (release Q12016), which are large general and kinase-specific bioactivity data sources.
To remove redundancy and inconsistency in the molecular representations, canonicalization or standardization is required to generate a unique SMILES representation for each molecule. An in-house chemical structure standardization protocol was implemented using Biovia Pipeline Pilot (version 20.1.0.2208). Salts/addends and duplicate fragments were removed so that each structure consisted of only one fragment. Stereochemistry and charges were standardized, acids were protonated and bases deprotonated, and tautomers were canonicalized. Stereochemistry and E/Z geometric configurations were removed to compute extended connectivity fingerprints of length 4 (ECFP4) descriptors (see
Section 4.3) for compounds since these are not differentiated by standard ECFP fingerprints.
After small molecules were preprocessed and normalized, UniProt identifiers were obtained by mapping domain information to UniProt using the Drug Target Ontology (DTO), resulting in 485 unique UniProt IDs. All UniProt IDs were then mapped to corresponding ChEMBL and KKB target IDs. Note that mutants were excluded from the ChEMBL and KKB datasets based on the existence of their variant labels. The bioactivity annotations obtained from ChEMBL were filtered by assay annotation confidence score ≥ 5, and only compounds with activity annotations corresponding to a standard type of Kd, Ki, IC50, or Potency were accepted. The resulting data were then aggregated by chembl_id, standard_type, and UniProt_id. The median standard_value was calculated and then transformed using a –log10-transformation.
Commercial kinase bioactivity data from KKB were also obtained, filtered, and aggregated by unique compounds, endpoints, and targets. The overlapping compound, target annotations, and endpoints in ChEMBL and KKB data were aggregated by their median. Compounds were then grouped by canonical SMILES and received an active label for each kinase where an aggregated pActivity value ≥ 6 was observed and received no label otherwise. Kinase labels with <15 unique active compounds were removed, leaving 668,920 measurements (315,604 unique compounds) distributed across 342 unique kinase classes for which to build models. The compounds utilized in this study predominantly represented pre-clinical compounds (>99%), featuring comprehensive bioactivity data sourced from the literature and patents (
Figure S1). The aggregated dataset distribution was slightly skewed towards inactive compounds; 47% of the 668,920 training examples were active (about 315 K). The specific data curation procedure and target and compound overlap before removing kinase labels with <15 unique active compounds are shown in
Figure 1.
4.2. Dataset Processing
The rationale behind gathering targets from public and commercial sources was to amass a collection of data that could leverage the power of both deep learning and multi-task algorithms. Given a set of related tasks, a multi-task network has the potential for higher accuracy as the learner detects latent patterns in the data across all tasks. This makes overfitting less likely and makes features accessible that may not be learnable by a single task.
We built the kinase predictors based on two different methods of handling kinase datasets. One method trained single- and multi-task classifiers from known actives and known inactives (KA–KI), which utilized only the reported active and inactive compounds for each task. The other method built single- and multi-task classifiers using all available data for unique kinase molecules, termed known active and presumed inactives (KA–PI). The KA–PI models were built by identifying for each kinase task the active compounds and treating the remaining compounds as inactive (decoys). For the purpose of this study, KA–PI was used for model building and evaluation for the KA–PI model, and KA–KI was trained for the KA–KI model but was evaluated on the larger KA–PI set. Both methods utilized all datasets and were characterized using the metrics introduced below.
4.3. Small-Molecule Topological Fingerprints (Features)
Extended connectivity fingerprints (ECFP4) were calculated using RDKit toolkits in Python. The ECFP4 algorithm assigns numeric identifiers to each atom, but not all numbers are reported. The fingerprints were hashed to a bit length of 1024; therefore, very similar molecules can both be assigned the same numeric identifiers. Although increasing the number of bits reported can reduce the chances of a collision, there are also diminishing returns in the accuracy gains obtainable with longer fingerprints (e.g., 1024-bit, 2048-bit, or larger fingerprints can be used). This and computational complexity concerns were the pragmatic reasons why we chose to use 1024-bit ECFP4 fingerprints.
4.4. Cross-Validation Approach
To evaluate the predictive performance of the multi-task models, we implemented three different 5-fold cross-validation strategies, including splitting by scaffold, by molecular weight, and randomly. In virtual screening, it is important to consider the chemical diversity of the training and test sets for domain applicability and for evaluating how well the classifier generalizes to new chemotypes. For the scaffold-based cross-validation, we performed hierarchical clustering for all compounds using Biovia Pipeline Pilot (version 20.1.0.2208), specifying approximately 300 clusters with an average of approximately 1000 compounds. The pairwise Tanimoto similarities were calculated between all cluster centers and visualized to ensure that chemical dissimilarity was sufficient (
Figure S6). Each cross-validation held out 1/5 of the scaffolds, and mean performance was calculated. Molecular weight was another distinguishing feature of compounds that could be used to estimate classifier performance. This method aims to keep the classifiers from overfitting on compound size, and molecular weight can also be considered a simple surrogate for how different compounds are. Molecular weight was calculated in Python using RDKit. Compounds were sorted by increasing molecular weight, and 1/5 of the dataset was held out during each training iteration. Randomized 5-fold cross-validation was also performed using a random stratified splitter to split the ChEMBL and KKB aggregated data into train, valid, and test sets with 80%, 10%, and 10% accordingly for the evaluation of the multi-task model performance. By running random stratified cross-validation within one dataset, all 3 sets share common compounds and kinase targets.
4.5. Model Construction
4.5.1. Multi-Task Artificial Neural Network Architecture
Neural networks can produce impressive non-linear models for classification, regression, or dimensionality reduction and are applicable in both supervised and unsupervised learning situations. Neural networks take as input numerical vectors and render input to output vectors with repeated linear and non-linear transformations experienced repeatedly at simpler components called layers. Internal layers project distributed representations of input features that are useful for specific tasks.
More specifically, a multiple hidden layer neural network is a vector-valued function of input vectors
x, parameterized by weight matrices
Wi and bias vectors
bi:
where
fi is nonlinear activation function, such as rectified linear unit (ReLU) (max[0,
zi]),
xi is the activation of layer
i, and
ziis the net input to layer
i. After traversing
L layers, the final layer
xL is output to a simple linear classifier (in our case, the softmax) that provides the probability that the input vector
x has label
t:
where
M is the number of possible labels in our case of binary prediction for each task (
M = 2) and
wt,
wm are weight vectors;
wTx is the scalar (dot) product. Our network therefore takes a numerical chemical fingerprint descriptor of size 1024 as input, one or multiple layers of ReLU hidden units, and softmax output neurons for each kinase class or task. Given the known input and output of our training dataset, we optimized network parameters (
x,
y) = ({
Wi}, {
bi}) to minimize a defined cost function. For our classification problem, we used the cross-entropy error function for each task:
where
T is the total number of tasks, kinase classes in our implementation. The training objective was therefore the weighted sum of the cross-entropies over all kinase targets.
The algorithm was implemented in Python using the Keras package with Theano backend and was run on Nvidia GeForce GTX 1650 GPUs with 32GB RAM to increase performance. Hyperparameter optimization included adjustments of momentum, batch size, learning rate, decay, number of hidden layers, number of hidden units, dropout rate, and optimization strategy. The best-performing model consisted of training a batch size of 128 with two hidden layers of size 2000 × 500 using a dropout rate of 25% and a learning rate of 0.0003 across each hidden layer for stochastic gradient descent learning. These initially tuned hyperparameters were consistently applied throughout the study, revealing a commendable level of stability and robustness across various data split strategies. Model training varied in time from 1 day for all kinase classes to 2 h for the 100 kinases with the most active compounds.
4.5.2. Single-Task Deep Neural Networks
A single-task deep neural network (STDNN) model architecture is designed to address a specific task or singular objective. The key distinction lies in the number of output tasks. An MTDNN has multiple output tasks, each consisting of two neurons, representing whether a compound is active against the target or not, whereas an STDNN is characterized by a single output task, emphasizing its focus on individual targets. The same set of hyperparameters used for MTDNN was applied for STDNNs. All STDNNs were executed on the Pegasus supercomputer at the University of Miami (
https://idsc.miami.edu/pegasus/).
4.5.3. Other Single-Task Methods
All single-task methods were implemented in Python using the Sci-kit Learn machine learning library. Methods included logistic regression, random forests, and Bernoulli naïve Bayes binary classifiers. Each method was implemented on the Pegasus supercomputer using stratified 5-fold cross-validation strategies and both KA–KI and KA–PI datasets but training each class individually.
4.5.4. Multi-Task Non-DL Approach
A multi-task random forest was performed in Python as the multi-task non-DL method to compare the performance with the MTDNN model. The aggregated ChEMBL and KKB data were used to train the model, and the randomized cross-validation strategy was performed to split the dataset randomly into 5 folds, which would perform the fitting procedure five times in total, with each fit being performed on 90% as the train set and remaining 10% as the test set. Random forest fitted a total number of 100 decision tree classifiers on the train set. Two-sample t-test was used to compare the model performance between this random forest model with MTDNN based on their ROC score.
4.6. Metrics and Model Evaluation
To adequately evaluate the machine learning models, we used a variety of metrics. The commonly used receiver operating characteristic (ROC) classification metric is defined as the true positive rate (TPR) as a function of the false positive rate (FPR). The TPR is the number of actives at a given rank position divided by the total number of actives, and the FPR is the number of inactives at a given rank position divided by the number of inactives for a given class. The area under the curve (AUC) was calculated from the ROC curve and is the metric we report. The calculation from a set of ranked molecules is given as follows:
where
n is the number of active compounds,
N is the total number of compounds,
A is the cumulative count of actives at rank position
i, and
I is the cumulative count of inactives at rank position
i.
Metrics including accuracy, precision, recall, and F1-score were also utilized to evaluate the model performance. The imbalanced datasets pose a significant challenge for binary classification models, as they can result in biased predictions towards the majority class. Accuracy measures the proportion of correctly classified samples over the total number of samples. However, accuracy can be misleading in the case of highly imbalanced datasets, as it can result in high scores even when the model fails to correctly classify minority class samples. Precision, recall, and F1-score are more suitable metrics for imbalanced datasets. F1-score considers both precision and recall in its calculation, where precision measures how accurate the model is in predicting the positive class, and recall measures how well the model identifies the positive class among all positive samples. The formulas for the specific evaluation indicators are as follows:
Although precision, recall, and F1-score are suitable for imbalanced data, for the purpose of this study, in which we focused on virtual screening, the enrichment factor (EF) was a more preferred and popular metric used in the evaluation of virtual screening performance. EF denotes the quotient of true actives among a subset of predicted actives and the overall fraction of actives and can be calculated as follows:
where
ri indicates the rank of the
ith active, n is the number of actives,
N is the number of total compounds, and
X is the ratio within which EF is calculated. We evaluated the EF at 0.1% (
X = 0.001) and 0.5% (
X = 0.005) of the ranked test set. To evaluate the maximum achievable enrichment factor (EFmax), the maximum number of actives among the percentage of tested compounds was divided by the fraction of actives in the entire dataset. To more consistently quantify enrichment at 0.1% of all compounds tested, we report the ratio of EF/EFmax (
Figure 5). This normalized metric is useful because EF values are not directly comparable across different datasets due to the maximum possible EF being constrained by the ratio of total to active compounds and the evaluated fraction as shown in the equation above.
4.7. Kinase Task Similarities
To evaluate how similar chemical structures in one kinase task are compared to those in all other kinase tasks, 25 diverse active compounds in each kinase task were selected using the diverse molecule component in Pipeline Pilot (version 20.1.0.2208). This algorithm defines diverse molecules by maximum dissimilarity using the Tanimoto distance function and ECFP4 descriptors. For each kinase task, the average maximum Tanimoto similarity to all other kinase tasks was calculated based on the 25 diverse samples from each class. This task-based similarity refers to the average maximum similarity of the reference class to all other kinase tasks as shown in
Figure 8.
4.8. KINOMEscan and IDG Predictions
KINOMEscan datasets were obtained from the LINCS Data Portal (
http://lincsportal.ccs.miami.edu/). The initial datasets contained 85 compounds with defined/known chemical structures and 387 different kinase targets. Kinase activity was screened at 10 µM compound concentration. Kinase domain targets were curated and standardized using the Drug Target Ontology (DTO,
http://drugtargetontology.org/), and datasets were joined into a data matrix of unique kinase targets (domains) and standardized small-molecule canonical smiles. Of the total 387 kinase targets, a subset of 291 targets shared commonality with those included in the training data, which was utilized to evaluate the model’s performance; null values were introduced where no data was available. Kinase activity values were binarized for each small-molecule kinase inhibition value [0, 1], where 1 indicated active (≥80% inhibition) and 0 indicated inactive. The resulting dataset was imbalanced with ~83% inactives and ~17% actives. Accuracy was calculated as described above.
IDG-DREAM round 1 and 2 pKd kinase inhibitor datasets were collected from Drug Target Commons (
https://drugtargetcommons.fimm.fi/). After mapping kinase domain information to the UniProt identifiers via DTO, a total number of 808 data points was obtained for 94 compounds with 288 unique kinase targets in the dataset. From these, 223 kinase targets, which aligned with the targets present in the training data, were kept for testing purposes. An amount of ~63% of data points were labeled as actives using a cut-off of pKd (−logM) of 6, where the actives were defined as data points with pKd greater than or equal to 6 and inactives (~37%) were those with pKd less than 6. ROC scores were calculated as described above. For both KINOMEscan and IDG chemical structures, the same ECFP fingerprint descriptor of size 1024 was used.