1. Introduction
Malaria is a life-threatening disease transmitted through the bites of infected female Anopheles mosquitoes. There are five parasite species that cause malaria in humans, and two of these species, namely Plasmodium falciparum and Plasmodium vivax, pose the greatest threat. Notwithstanding the fact that the substantial efforts and advancements across the globe to combat malaria have improved significantly during the last decade, in 2022, nearly half of the world’s population was at risk of malaria. According to the WHO, there were an estimated 249 million infections of malaria in 2022, and the estimated number of malaria deaths stood at 608,000 [
1].
Malaria is one of the most devastating and widespread tropical parasitic diseases of those most prevalent in developing countries. The WHO regions of South-east Asia, the Eastern Mediterranean, the Western Pacific, and the Americas reported a significant number of cases and deaths. Among them, the WHO African region carries a disproportionately high share of the global burden of malaria. In 2022, the region was home to 95% and 96% cases of malaria and deaths, respectively [
1,
2,
3]. The incidence of malaria cases, as observed in the WHO report, has increased consistently for 22 years from 2000 to 2022. One of the reasons seems to be parasite resistance to existing antimalarial drugs.
Resistance has emerged to most approved antimalarial drugs. This spread and establishment can undermine the gains seen over the last decade. Therefore, the need for new drugs that work through differentiated modes of action is urgent [
4]. Antimalarial drug resistance is the ability of a parasite strain to survive and/or to multiply despite the administration and absorption of medicine given in doses equal to or higher than those usually recommended [
5]. Resistance is a key challenge for anti-infectives in general, but particularly for antimalarials. There are a number of different mechanisms by which resistance can emerge in pathogens. These include the following: mutations in the target enzyme, amplification of the target enzyme, etc. [
6]. However, several factors have contributed to the emergence of resistance to current antimalarial drugs and medications. These include the rate of parasite mutation, the potency of the selected drug, inadequate pharmacokinetic properties, substandard quality of antimalarial drugs, among other elements that can contribute and exacerbate the resistance issue [
5].
With the rising resistance to frontline drugs (artemisinin-based combinations), there is a need to accelerate the discovery and development of novel antimalarial drugs [
7]. As drugs are urgently needed, the concerted effort of many in the field has resulted in the screening of millions of small molecules in phenotypic assays to assess parasite death [
8,
9,
10]. These efforts have identified thousands of new chemical scaffolds. A major goal of new drug development for malaria is the discovery of compounds that kill parasites in multiple stages of the life cycle and, thus, could be used both in disease prevention and treatment. In the realm of drug discovery, chemical biology and computational drug design methodologies are harnessed for the effective identification and refinement of lead compounds. Chemical biology focuses primarily on unraveling and elucidating the biological function of a target and understanding the mechanism of action of a chemical modulator. Conversely, the in silico-based method or computer-aided drug design leverages structural insights from the target (structure-based) or known bioactive ligands (ligand-based) to facilitate and accelerate the identification of promising candidate drugs [
11].
There has been a shift in the discovery of antimalarial drugs from phenotypic screening to target-based approaches, as more potential drug targets have been validated in Plasmodium species [
4]. Given the high attrition rate, high demand for new drugs, and enormous cost and time-consuming nature of drug discovery [
12,
13,
14,
15,
16], it is essential to select the targets that are the most likely to deliver progressable drug candidates. Target-based drug discovery is the dominant paradigm of drug discovery [
17]. In this study, we utilize the following three potential and high priority targets found in Plasmodium that are responsible and ready to enter the antimalarial drug discovery process.
Acetyl CoA Synthetase (PfAcAS; high priority target [
4]) is responsible for the biosynthesis of acetyl coenzyme A from coenzyme A and acetate. Researchers have recently reported on the validation of this enzyme [
18,
19].
Bifunctional Farnesyl/Geranylgeranyl Pyrophosphate Synthase (F/GGPPS; high priority target [
4]) is a key enzyme in isoprenoid biosynthesis that synthesizes C15 and C20 prenyl chains. Prenyl chains are the substrates of several prenyltransferases that result in isoprenoid products essential for the survival of the parasite [
20].
Monoacylglycerol Lipase (PfMAGL; new and emerging [
4]) [
21]. Human monoacylglycerol lipase (MAGL) catalyzes the hydrolysis of a variety of monoglycerides into fatty acids and glycerol. In Plasmodium falciparum, this enzyme has been reported to play a role in the processing of these monoglycerides, including palmitoyl and oleoyl glycerols [
21].
The criteria considered important for selecting the aforementioned targets for antimalarial drug discovery are discussed by Forte et al. [
4]. They also describe the analysis of several drug targets within the Malaria Drug Accelerator (MalDA) pipeline, allowing them to prioritize targets that are ready to enter the drug discovery phase.
In recent years, machine learning (ML) techniques such as graph neural networks (GNN) and natural language processing (NLP), exemplified by methods such as ChemBERTa, SMILES-BERT, etc. [
22,
23,
24], have shown remarkable performance in various domains such as cheminformatics and bioinformatics [
13,
25,
26]. Researchers have actively proposed data-driven approaches and have adopted computational biology perspectives based on deep learning to explore drug–target interaction or relations [
14,
15,
16,
27,
28,
29,
30,
31,
32,
33]. Accurate prediction and identification of drug–target interactions (DTIs) are pivotal elements in drug discovery [
14,
15,
16,
27,
32,
33]. The emphasis on computational methods for the prediction of DTIs has grown due to the significant costs and time associated with extensive in vivo and in vitro experiments with a wide range of potential drug chemical compounds [
13,
14,
15,
16]. On the other hand, advanced DL frameworks, utilizing a variety of GNNs such as graph convolutional networks (GCNs) [
34], graph attention networks (GATs) [
35], gated graph neural networks (GGNNs) [
36], and residual graph convolutional networks (RGCNs) [
37], have achieved groundbreaking performance in various research domains, including social and natural sciences, and knowledge graphs [
38,
39].
These frameworks have shown their effectiveness in various applications, such as molecular properties prediction [
40], widespread application in predicting DTIs [
12], and the prediction of protein function [
41] within the realm of biochemical problems where the interactions can be represented as graph-like structured data. Specifically, GCNs have found application in addressing pharmacological similarities by considering both sequential and structural properties [
12]. The graph representations of biochemical entities have proven capable of capturing structural features akin to Euclidean ones, eliminating the need for extensive feature engineering [
42,
43]. GNNs have been demonstrated to be relatively powerful in modeling their formally input graph-structured data to the DL pipeline [
44,
45,
46]. The graph-structured data are basically presented as
, where
G is a graph,
V indicates vertices or nodes,
E means edges comprising adjacency (A) with weights (W), and
u indicates feature vectors of the nodes. All GNN layers are implemented via a message passing network (MPN) interface or framework [
47]. Quantitative Structure–Activity Relationship (QSAR) modeling investigates the relationship between the chemical structure of molecules or substances and their biological activities towards a specific target (i.e., a drug target) [
48]. By employing statistical techniques, QSAR and ML models have the capacity to predict drugs for drug repurposing or repositioning and pharmacological activity of novel compounds based on their structural attributes [
49]. This capability enables chemists to strategically modify molecules, improving the potency of a drug or alleviating side effects, resulting in a more cost-effective and time-efficient drug development process [
50].
Herein, this study introduces a pioneering approach that harnesses the potential of advanced ML methodologies to predict and identify antimalarial drugs. The essence of this approach is the integration of state-of-the-art NLP and GNN techniques. The study proposes a fusion of the SMILES-based bidirectional encoder representations from transformers (BERT) model, an NLP model, with the robust capabilities of the GNN model (i.e., relational graph convolutional network (RGCN)). We adopted the deepchem SmilesTokenizer module for embedding features generation [
51] that were later treated by RGCN as node features. This tokenizer heavily inherits from the BertTokenizer class found in Huggingface’s transformers library. It runs a WordPiece tokenization algorithm over SMILES strings using the tokenization SMILES regex developed by Schwaller et al. [
52]. Furthermore, the RGCN model was trained using multi-relational antimalarial drugs and Plasmodium potential targets graph data, where nodes represent antimalarial drugs and potential targets, and edges represent drug–target relations. The contributions of this study can be summarized as follows:
Apparently, this is the first study to introduce the NLP-based applications to chemistry and biology (i.e., BERT-based SmilesTokenizer) and the GNN model (i.e., RGCN) for the antimalarial drug prediction against Plasmodium falciparum.
Secondly, the graph-based network was designed between antimalarial drugs and three potential new targets for antimalarial drugs found in Plasmodium as shown in
Figure 1.
We further developed three independent models, BERT-RGCN, Mordred-RGCN, and BERT-Mordred-RGCN, and compared their performance to understand the contributions of the features.
Lastly, various experiments were performed using features of individual antimalarial drugs, followed by combination of features of antimalarial drugs and three potential targets for malaria drugs found in Plasmodium.
Literature Review
In modern drug discovery, the integration of cheminformatics and QSAR modeling has emerged as a formidable alliance, allowing researchers to harness and leverage the extensive potential of ML techniques for predictive molecular design and analysis [
50]. Previous studies have conducted thorough validation of predictive models for antimalarial drugs using ML methods and molecular descriptors. Mswahili et al. [
2] proposed various ML models, with the best models achieving accuracies exceeding 82%. In the same approach, Liu et al. [
53] proposed traditional ML classification models to predict antimalarial activity against Plasmodium falciparum and achieved accuracies of 87.3% and 88.9% for a support vector machine (SVM) and general regression neural network (GRNN), respectively. Additionally, Danishuddin et al. [
54] conducted a comprehensive validation of antimalarial predictive models using ML approaches. Among these, SVM and XGBoost exhibited high performance, achieving an accuracy of approximately 85% on the independent test set. While the results are encouraging and promising, the majority of the in silico-based methods proposed still focus primarily on the features of antimalarial drugs. The common limitation, clearly, is that these previous methods did not encompass the distinctive features of potential targets found in Plasmodium, which are paramount in the discovery of novel antimalarial drugs.
3. Experiment and Evaluation Criteria
The proposed framework integrates two different advanced ML methodologies: NLP (i.e., SMILES-based BERT model) and GNN (i.e., RGCN), as shown in
Figure 2. The SMILES-based BERT model, renowned for its proficiency in processing molecular structures, was used to encode molecular information from SMILES strings (i.e., input embeddings or feature values) for antimalarial drugs and three targets (i.e., PfAcAS, F/GGPPS, and PfMAGL). This encoding of SMILES strings facilitated the representation of drugs and three potential targets in a manner conducive to subsequent graph-based analysis as shown in
Figure 2. BERT was first introduced in this study specifically for the prediction of antimalarial drugs as a tool for extracting and generating features (i.e., Bert-based features) from drugs and targets SMILES strings as shown in
Figure 2. To further validate the contribution and effectiveness of Bert-based features, we compared them with features generated by Mordred (i.e., Mordred-based features) [
55], the most widely used tool for feature extraction in this area of antimalarial drugs prediction [
2,
54].
Thereafter, the GNN model, namely RGCN, was deployed to handle graph-structured data, encapsulating the relationships between antimalarial drugs and potential targets found in Plasmodium, as shown in
Figure 1. Input nodes’ features used during implementing and training RGCN were of course ones computed by using the DeepChem adopted Bert SmilesTokenizer module and further compared to the ones calculated by a well-known descriptor calculator known as Mordred. Training in RGCN involved iterative processes to discern and capture intricate drug–target interactions embedded within the multi-relational data. An overview of the architecture and training strategy of the proposed RGCN model is summarized in
Table 4.
To evaluate the predictive capabilities of the integrated models, rigorous evaluation methodologies were used. The proportion of the dataset used in this study for model training and evaluation is summarized in
Table 6.
The datasets were partitioned into training, validation, and test sets to facilitate model training and subsequent assessment. Cross-validation techniques, such as 10-fold cross-validation, ensured robustness and mitigated biases in the model evaluation process. Evaluation metrics, including accuracy, sensitivity, specificity, Matthew’s correlation coefficient (MCC), area under receiver operating characteristics (AUROC), and area under the precision–recall curve (AUPRC), were used to quantify the predictive performance of the model. Comparative analyses against baseline models and existing approaches to antimalarial drug discovery served as benchmarks to gauge the efficacy and superiority of the proposed frameworks.
5. Discussion
The integration of NLP, represented by the SMILES-based BERT model, and GNN, exemplified by RGCN, presents a novel and potent approach to identify novel antimalarial drug candidates. The use of these advanced ML techniques has shown promise in overcoming the complexities associated with drug–target interactions, offering a pathway towards combating the challenges posed by drug resistance in malaria treatment. The successful prediction of potential antimalarial drugs validated the efficacy of our proposed approach. By harnessing multi-relational data that encompass critical antimalarial drug targets such as PfAcAS, F/GGPPS, and PfMAGL, our proposed models such as BERT-RGCN and Mordred-RGCN demonstrated their efficacy in capturing multi-drug–target relations, showcasing the potential of the proposed approaches in uncovering novel drug candidates with varying modes of action against malaria. The prediction accuracy, validated through rigorous testing and cross-validation methodologies, substantiates the reliability and robustness of the proposed frameworks. This capability is perhaps pivotal in uncovering novel compounds capable of disrupting the malaria parasite’s life cycle at various stages.
Comprehensive analysis of graph-structured data revealed various drug–target interactions, shedding light on potential mechanisms of action and the polypharmacology essential in addressing the complexity of malaria treatment. The ability of the RGCN models such as BERT-RGCN and Mordred-RGCN to identify and infer within complex relational data allowed the identification of previously unexplored antimalarial drug–target interactions, presenting their potential and a foundation for further exploration and experimentation for accelerating drug development in the realm of antimalarial drug discovery.
However, it is important to acknowledge certain limitations. The availability of comprehensive and curated datasets remains a challenge in drug discovery research. Despite the success of the models (i.e., BERT-RGCN and Mordred-RGCN) in navigating complex relational data, the quality and depth of available data profoundly influence predictive accuracy. Additionally, the need for experimental validation of predicted drug candidates remains imperative to confirm their efficacy and safety profiles. Moreover, while our approach showcases promise, further refinement and optimization are necessary. Fine-tuning the model architecture and leveraging larger and more diverse datasets could enhance the prediction accuracy and broaden the scope of potential drug candidates identified.
In conclusion, the effectiveness of the integrated BERT model and RGCN in predicting potential antimalarial drugs offers a promising avenue to accelerate drug discovery efforts and address the challenges posed by drug resistance in the fight against malaria. The results also showed that the use of new potential targets such as PfAcAS, F/GGPPS, and PfMAGL for antimalarial drugs regardless of the extraction of features was promising and with the use of graphs to express antimalarial drug molecules and potential targets in Plasmodium, RGCN yielded better results. The success in predicting drug candidates against Plasmodium highlights the potential impact of advanced ML in addressing global health challenges. Moving forward, continued advances in methodologies and collaborative efforts between computational and experimental research are crucial in translating predictions into tangible therapeutic interventions to combat malaria effectively.
6. Conclusions
The pursuit of novel antimalarial drugs to combat the persistent threat posed by Plasmodium demands innovative and interdisciplinary approaches. Our study embarked on a pioneering journey, leveraging the synergy between advanced ML techniques such as NLP (i.e., BERT) and GNN (i.e., RGCN) to predict antimalarial drugs against Plasmodium falciparum. Graph-structured data can effectively represent the relationships between antimalarial drugs and Plasmodium potential targets, allowing for a more holistic understanding of their interactions. Training the RGCN on multi-relational data related to potential targets like PfAcAS, F/GGPPS, and PfMAGL seems like a strategic move to capture diverse malaria drug–target interactions.
Integration of the SMILES-based BERT model with the robust RGCN showcased immense potential to decipher intricate drug–target interactions and accelerate drug discovery processes, especially in combating diseases like malaria where resistance to existing treatments is a significant challenge. The model’s adeptness in navigating multi-relational data, particularly focusing on critical antimalarial drug targets like PfAcAS, F/GGPPS, and PfMAGL, underscores its capacity to uncover promising compounds capable of disrupting the malarial various stages of the parasite’s life cycle. Our study’s success in predicting antimalarial drugs offers a glimpse into the transformative impact of computational methodologies in drug discovery. However, this progress does not exist in isolation; it is based on collaborative efforts across scientific disciplines, emphasizing the importance of interdisciplinary research in addressing global health challenges.
Nevertheless, several challenges persist; these include dependence on data availability and quality, the necessity for rigorous experimental validation, and the ongoing evolution of the malaria parasite. The journey towards translating computational predictions into viable therapeutic interventions demands a concerted effort from both computational and experimental researchers. The convergence of cutting-edge ML methodologies with traditional drug discovery approaches holds promise. Fine-tuning and predictive models, expanding datasets, fostering collaborations, and integrating insights from diverse domains will accelerate the field towards identifying more effective and resilient antimalarial drugs. In conclusion, our study represents a crucial step towards leveraging advanced ML in the pursuit of addressing the challenges of malaria treatment. The intersection of computational methodologies and drug discovery research presents a promising path forward in the quest to alleviate the global burden of malaria, emphasizing the transformative potential of integrative approaches in shaping the future of medicine against malaria. We believe that our method that exploits the RGCN and BERT is applicable to other domains such as object recognition and autonomous driving, because it is based and relies on distances in the embedding representations which can be applicable in domains where the data has local patterns.