1. Introduction
Drug repositioning is the process of exploring the new effects of existing drugs except for the original indications for medical treatment. It is a direction with great opportunities and challenges. In addition, it has the advantages of low-cost, short-time and low-risk [
1,
2]. The drug-target interactions (DTIs) play an important role in drug discovery and drug repositioning. Accurate prediction of DTIs can improve the accuracy of drug clinical trials, thus greatly reducing the risks of experiments. For a long time, the accumulation of a large number of biological experimental data and related literature makes the biological database richer and richer, which provides a favorable condition for the use of computational methods.
Traditional computing methods are mainly divided into two categories: ligand-based methods and structure-based methods. However, structure-based approaches are limited when the 3D structures of the target protein are absent, and ligand-based approaches have low accuracy when there are only a few binding ligands for the target protein [
3,
4,
5,
6,
7]. In recent years, the widespread recognition of data-driven methods has made machine learning algorithms widely used in biomolecular correlation prediction [
8,
9,
10,
11]. There are mainly four related methods of in-silico methods: machine learning-based methods, network-based methods, matrix factor-based methods, and deep learning-based methods [
12,
13,
14]. For example, Ding et al. [
15] used substructure fingerprints, physical and chemical properties of organisms, and DTIs as feature extraction methods and input features, and further used SVM for classification. Chen et al. [
16] employed gradient boosting decision tree (GBDT) to predict drug-target interactions based on three properties, including IDs of the drug and target, the descriptor of drug and target, DTIs. Luo et al. [
17] constructed a heterogeneous network to predict the potential DTIs by integrating the information of multiple drugs. Chen et al. [
18] and
Ji et al. [
19] proposed a multi-molecular network model based on network embedding to predict novel DTIs. Liu et al. [
20] proposed a model called NRLMF, which calculates the score of DTIs through logical matrix decomposition, where the properties of the drug and target are expressed in terms of their specificity. Zheng et al. [
21] proposed to map the drug and target into a low-rank matrix and to establish the weighted similarity matrix, and solve the problem by using the small square algorithm. Wen et al. [
22] used unsupervised learning to extract representations from the original input descriptors to predict DTIs.
Recently, the extensive application of non-Euclidean structured data in graph neural networks has led to various graph-based algorithms [
23,
24,
25,
26,
27,
28,
29,
30], such as graph convolution networks (GCN), graph attention networks (GAT), graph autoencoders (GAE), graph generative networks, graph spatial-temporal networks, etc. Based on the analysis of biological data, it is found that the biological data network has a good preference for the graph neural network. Gao et al. [
31] used long short-term memory (LSTM) and graph convolutional networks (GCN) to represent protein and drug structures, to predict DTIs. Previous work has shown the preferable performance of graph neural network for DTIs [
27,
32], however, a single understanding of the data relationship between DTIs cannot mine out the hidden information of the graph data well. Therefore, it is necessary to explore the depth information of the drug and target protein through the graph neural network.
In the actual graph, the relationship between two nodes is complex, and the features of each node are usually composed of a variety of attributes. It is necessary to clearly understand the relationship between nodes. Therefore, the extraction of node features should be multi-angle and multi-dimensional. To solve these challenges, we propose a novel method to predict DTIs based on large-scale graph representation learning (LGDTI). Unlike previous graph-based neural network-based approaches, LGDTI aims to gain an in-depth understanding of known drugs and targets association networks through different graph-based representation learning methods. To extract hidden graph features of drugs and targets in a complex biological network, two types of graph representation learning were used to excavate them.
3. Results and Discussion
3.1. Performance Evaluation of LGDTI Using 5-Fold Cross-Validation
To accurately evaluate the stability and robustness of LGDTI, 5-fold cross-validation was adopted. In detail, the original data set was randomly divided into 5 subsets, among which 4 subsets were selected for each training, and the remaining subsets were used as the test set and repeated 5 times. Additionally, we used five evaluation indicators, including Acc. (Accuracy), MCC. (Matthews’s Correlation Coefficient), Sen. (Sensitivity), Spec. (Specificity), and Perc. (Precision). Moreover, for binary classification, the receiver operating characteristic (ROC) curve can reflect the capability of the model, while the AUC is the area under the ROC curve. The closer the ROC curve is to the upper left corner, the better the performance of the model. Similarly, the value of AUC is also high. The precision-recall (PR) curve contains precision and recall, with recall as the horizontal axis and precision as the vertical axis. On very skewed data sets, the PR curve can give us a comprehensive understanding of the performance of the model. The details of LGDTI under 5-fold cross-validation are shown in
Table 1 and
Figure 4. The results of each fold AUC, AUPR, and various evaluation criteria show that the proposed method has a better predictive ability. Studying it carefully, the results of each training are close to each other, which shows that the model has preferable stability and robustness.
3.2. Comparison LGDTI with the Different Machine Learning Algorithms
Different machine learning algorithms have different representations of features. By comparing different classification algorithms, including logistic regression (LR), K-nearest neighbor (KNN), gradient boosting decision tree (GBDT), and random forest classifier (RF), we can intuitively see the feature advantages of LGDTI. To make the comparison fairer and more objective, all classification algorithms choose the default parameters. The detailed evaluation results of 5-fold cross-validation are shown in
Table 2 and
Figure 5.
The results can be explained as follows: (i) for logistic regression, because of the depth and high complexity of input features, it may be difficult to form a linear classification surface, so it is impossible to fit features; (ii) for K-nearest Neighbor, in the characteristics of the sample studied in the early stage, the attributes of the neighboring nodes in the sample have been fused, which makes it impossible to accurately compare K-nearest neighbor; (iii) gradient boosting decision tree and random forest classifier are both ensemble classifiers, which can better solve the shortcomings of a single classifier, especially the random forest classifier, which can achieve preferable results on this dataset.
3.3. Comparison of the Different Feature with Attribute, GF and LGDTI
In summary, LGDTI constructs a graph and combines the first-order and high-order information of the nodes in the graph to denote the characteristics of each node. The first-order graph information aggregates the direct neighbor information of nodes. In graph theory, two nodes have similarities if the structure is similar to the subgraph. The high-order graph information provides a preferable representation of each node’s indirect neighbor information. Therefore, we conducted experiments on the different features of nodes, in which random forest classifier was used, as shown in
Table 3 and
Figure 6. In
Table 3, Attribute has exemplified the feature of drug molecular structure and protein sequence; only first-order graph information is represented as GF; LGDTI includes the first-order and high-order graph information. When only node self-attributes are the worst, while self-attributes of nodes can be enhanced through GCN. Therefore, only the combination of first-order graph information and high-order graph information can better explore the potential features of nodes.
3.4. Compared with Existing State-of-the-Art Prediction Methods
To evaluate the advantage of the proposed method, it is compared with other advanced methods. Although the method proposed by Chen et al. [
18] and Ji et al. [
19], considers the network information of nodes, it fully expressed the local information of nodes in the network. Then, LGDTI is relatively sufficient for information extraction of nodes, and its high AUROC, AUPR, and ACC are stronger than other methods, as shown in
Table 4.
Compared with other methods, node attributes (LGDTI (Only Attribute)), node first-order information (LGDTI (GF)), and the LDGTI model are all better. Among them, in the case of only node attributes, the AUROC, AUPR, and ACC of our model are at least 0.031, 0.0281, and 0.0259 higher respectively. Meanwhile, LGDTI (GF) still has some advantages. Definitively, the AUROC, AUPR, and ACC of the LGDTI model are at least 0.0222, 0.019, and 0.0281 higher than that of Ji et al. methods (Attribute+Behavior), respectively. The first-order neighborhood information aggregation makes node attribute characteristics are enhanced. Furthermore, the integration of first-order information and high-order information of the node will make our method have better prediction ability.
3.5. Case Studies
To test the practical ability of our model, the drugs clozapine and risperidone were exploited to predict potential targets, respectively. Clozapine can be used to treat many types of schizophrenia, and it can directly inhibit the brain stem reticulum up-activation system and has a powerful sedative and hypnotic effect. Risperidone is a psychiatric drug used to treat schizophrenia. In particular, it has an improved effect on the positive and negative symptoms and their accompanying emotional symptoms. It may also reduce the emotional symptoms associated with schizophrenia. In this case study, all known associations in the benchmark dataset were trained by our method, and we sorted the predicted scores of the remaining candidate targets and selected the top 5 targets, as shown in
Table 5. The experiment showed that there were 3 targets of the drugs clozapine and risperidone predicted by LGDTI, which could be proved in the SuperTarget database [
46]. The remaining unproven targets may be candidates, hopefully, to be explored by medical researchers.
4. Conclusions
Although the accurate and efficient computational model could greatly accelerate the process of identification of DTIs, there is still a huge gap between academia and industry. In this study, we developed a novel method called LGDTI for predicting DTIs. Specifically, the nodes in LGDTI can be represented by 2 kinds of feature including first-order information learned by GCN and high-order information learned by DeepWalk from the graph. in which molecular fingerprint technology was used to extract the attribute of drugs, and the k-mer method was used to extract the attribute of targets. Then, the Random Forest classifier was applied to carry out the relationship prediction task. The presented method obtained the AUC of 0.9455 and the AUPR of 0.9491 under 5-fold cross-validation which is more competitive than several state-of-the-art methods. Moreover, our method can learn three kinds of information about the node, including the node’s attributes, local structure, and global structure. Specifically, LGDTI can integrate attribute information with structural information for learning. The experimental results show that LGDTI has a prominent predictive ability for DTIs. Nevertheless, due to the limitation of the benchmark dataset, the performance of LGDTI cannot be shown collectively in multiple data. Moreover, LGDTI may be greatly improved if two kinds of node information can be better integrated. Consequently, we hope that the proposed model could be utilized to guide drug development and other biological wet experiments.