1. Introduction
Identifying drug-target interactions (DTI) plays a pivotal role in drug discovery. Drugs usually interact with one or more proteins to achieve their functions. However, discovering novel interactions between drugs and target proteins is crucial for the development of new drugs, since the aberrant expression of proteins may cause side effects of drugs [
1]. In the past decades, researchers have been identifying drug-target interactions through clinical observation and biological experiments. However, using these experimentally based methods is still time-consuming and expensive. In addition, they also face the problem of high attrition rate [
2]. Therefore, the use of computational techniques to predict drug-target interactions has become a hot research topic in the field of molecular pharmacology in recent years. The US Food and Drug Administration (FDA) report that the development of a new type of drug can at least cost billions of dollars. The rate for drugs to be approved for marketing is still very low in recent years due to their uncertain side effects. It is an urgent need to develop new methods that can predict drug-target interactions on a large scale to reduce the development period and cost of drug discovery [
3].
In recent years, more and more drug-target interactions have been discovered by clinical research and stored in some public databases, offering data resource for computation method to train reliable prediction model for DTIs. Different types of databases that are used for drug-target relationships have been established and public released, such as SuperTarget and Matador [
4], DrugBank [
5], Therapeutic Target Database (TTD) [
6], and Kyoto Encyclopedia of Genes and Genomes (KEGG) [
7], providing data resources for computational tools.
To date, traditional computational methods for predicting drug-target interactions include docking simulations [
8], ligand-based methods [
9], and literature text mining methods [
10]. However, these three types of methods still have some limitations. The docking simulation method requires usable three-dimensional (3D) structural information of the target protein, which is only available for a small fraction of proteins. Therefore, it fails to be applied to predict DTIs on a large scale. Ligand-based methods typically do not perform well with target proteins due to the limited number of known ligands. The text mining method mainly relies on keywords to search, so it is difficult to detect novel interactions between the drugs and target proteins. However, protein meta-structure is a new approach for chemical and molecular biology. It effectively identifies possible chemical fragments, which can also be used for fragment-based drug design. This method is solely based on primary sequence information and it does not require 3D protein structure information, so it allows a wider application for predicting drug-target interactions [
11].
In recent years, researchers have developed different types of computational methods for inferring potential drug-target interactions. For example, Chen et al. [
12] proposed a novel computational model combining a machine learning-based method and a network-based method. Wu et al. [
13] proposed a useful tool, called substructure-drug-target network-based inference (SDTNBI), to identify potential drug-target interactions. The model combines topologic information of the DTI network and chemoinformatics features to implement repositioning predictions on four benchmark datasets: kinases, GPCRs, ion channels, and nuclear receptors. Mei et al. [
14] proposed an effective algorithm, called BLM-NII, to identify drug-target interactions, which combines the Neighbor-based Interaction-profile Inferring method and the Bipartite Local Model. Yamanishi et al. [
15] developed a supervised method that was based on a bipartite graph framework to predict unknown drug-target interactions. Specifically, the method maps geometric space and chemical space into a unified space called the pharmacological space. Xia et al. [
16] developed a regularized semi-supervised learning method (NetLapRLS), which predicts DTI based on the genomic space, chemical space, and drug-protein interaction network space. Cheng et al. [
17] developed three types of methods that were based on complex network theory to identify interactions between potential drugs and targets, which are network-based inference (NBI), target-based similarity inference (TBSI), and drug-based similarity inference (DBSI). Kuang et al. [
18] proposed an efficient method that is based on the technique of eigenvalue transformation, combining a semi-supervised link prediction classifier (SLP) and a regularized least squares classifier (RLS) to predict drug-target interactions. More recently, Wang et al. [
19] presented a computational approach to infer potential drug-target interactions. Specifically, the method converts the protein sequence into a position-specific scoring matrix (PSSM) while using biological evolutionary information and encodes the drug molecule as a fingerprint feature vector. Based on such feature information, feature extraction is performed on PSSM by using the auto-covariance (AC) algorithm.
In this work, we present a novel computational approach only using the information of target protein sequences and drug substructure fingerprints to predict drug-target interactions on a large scale. It can generally be divided into three steps: first, we convert all the target protein sequences into PSSM, considering the biological evolutionary information between different types of amino acids. Meanwhile, molecular substructure fingerprints are used as the features of drugs. Second, an efficient feature extraction method that is based on local phase quantization (LPQ) is used to convert the PSSMs into vectors. Third, an ensemble classifier, rotation forest, is adopted to perform DTI predictions on four gold standard datasets including enzymes, ion channels, GPCRs, and nuclear receptors. We also compare the proposed method with several types of existing methods to evaluate the prediction performance. The experimental results further indicate that the proposed method can effectively predict drug-target interactions.
3. Materials and Methods
3.1. Golden Standard Datasets
In this article, we explore four golden standard datasets to evaluate the prediction performance of the proposed with regards to its prediction on drug-target interactions. These datasets are collected from four databases, KEGG BRITE [
7], SuperTarget [
4], BRENDA [
24], and DrugBank [
5]. In each dataset, the data cover four types of drug target families, namely enzymes, ion channels, GPCRs, and nuclear receptors. The numbers of known drug targeting enzymes, ion channels, GPCRs, and nuclear receptors are 445, 210, 223, and 54, respectively, and the numbers of their corresponding target proteins are 664, 204, 95, and 26, respectively. The total number of DTIs in these datasets was 5127. Among them, the numbers of known DITs for the enzyme, ion channel, GPCR, and nuclear receptor datasets are 2926, 1476, 635, and 90, respectively.
Table 7 shows the statistical information in different datasets.
In this work, we represent the network of drug-target interactions as a bipartite graph, in which the nodes refer to target proteins or drug molecules, and the links are the interactions between them. The network is sparse, as the number of the known DTIs is limited. There are totally 295,480 (445 × 664) connections in the corresponding bipartite graph. But only 2926 edges existing and represented as the known drug-target interactions. In this case, the number of possible negative samples is 292,554 (295480−2926), significantly larger than that of positive samples (2926). To deal with this problem that is caused by the sample unbalance, we randomly selected the negative sample from the unlabeled drug-protein pairs with the same number of positive samples. In general, the negative sample sets that were obtained in this way may contain a small number of really interacting drug-target pairs. However, when considering the large-scale study of DTIs, the number of real interaction pairs selected from the negative sets is quite small. As a result, the number of enzyme, ion channel, GPCR, and nuclear receptor datasets in the negative samples was 2926, 1476, 635, and 90, respectively.
3.2. Drug Substructure Feature
In previous studies, feature information for drugs include topological, geometrical, constitutional, and quantum chemical properties. In this work, we use molecular fingerprints as the drug feature information to consider the substructures of drug compounds. Each bit in the binary fingerprint vectors is used to represent a specific substructure of a certain molecule [
25]. Substructure fingerprints can directly encode structural information for a given drug compound into a series of binary bits, indicating the presence of a specific substructure of the drug molecule. There is a list of SMARTS substructure patterns in the predefined dictionary. Based on the predefined SMARTS pattern, the corresponding bit in the fingerprint vector is set to 1 if a given drug molecule contains its corresponding substructure, and otherwise it is assigned to be 0. In this study, we selected the chemical structure of the molecular substructure fingerprints that were collected from the PubChem system (available at
https://pubchem.ncbi.nlm.nih.gov/). As a result, the drug molecule feature is a binary vector of 881dimensions.
3.3. Position-Specific Scoring Matrix
There are currently many effective ways to convert protein sequences into multidimensional feature vectors. For instance, by using statistical distributions of amino acids [
26,
27,
28] or by using the physico-chemical properties of amino acids [
29,
30,
31]. These methods provide a powerful basis for predicting drug-target interactions. Position-specific scoring matrix (PSSM) is widely used in previous research, including protein secondary structure prediction [
32], protein binding site prediction [
33], and prediction of disordered regions [
34]. PSSM is also able to extract evolutionary information of 20 types of amino acids. Effective protein descriptors are crucial for the prediction of drug-target interactions. In this work, we adopted PSSM to extract target protein features for DTI prediction. For a given protein sequence, we converted it to PSSM while using a Position-specific iterated basic local alignment search tool (PSI-BLAST) [
35]. The PSSM of the protein sequence can be expressed as:
where
;
is the length of an amino acid sequence. Finally, a matrix of
can be constructed for each protein sequence. We set the relevant parameters of PSI-BLAST (E-value) as 0.001, the number of iterations as 3, and other parameters as default values in order to obtain highly homologous sequences. Details of PSI-BLAST can be accessed at
https://blast.ncbi.nlm.nih.gov/Blast.cgi.
3.4. Local Phase Quantization
With the development of image processing techniques, many methods have emerged for extracting features from the data matrixes of original images. For example, local phase quantization (LPQ) was proposed as an effective operator for texture descriptors that were first proposed by Ojansivu et al. [
36]. Specifically, LPQ remains the blur invariance property of the Fourier phase spectrum in the image matrixes, extracting local phase information based on two-dimensional (2-D) short-term Fourier transform (STFT) [
37]. For a given original image
, its spatially invariant blurring in the observed image
can be represented by convolution:
where
is the point spread function (PSF) for the blur,
denotes a two-dimensional convolution, and
represents a vector of coordinates
in the image. In the Fourier domain, this can be expressed as:
where
,
, and
are the discrete Fourier transform (DFT) functions of
,
, and
, respectively. Here,
denotes a vector of coordinates
Based on the characteristics of the Fourier transform, we can express the magnitude and phase as:
Suppose the blur
is centrally symmetric, meaning that
, in which case its Fourier transform is always real-valued, so its phase can only be represented as a two-valued function. That is:
For all
, there is:
In the LPQ method, the shape of a regular PSF
is usually similar to a sin or Gaussian function. It should be noted that the low frequency values of
are positive. It uses two-dimensional DFT to extract local phase information in order to obtain local information efficiently. That is, the phase information is obtained by the rectangular
neighborhood
at each pixel position
of a given image
. These local spectra are calculated while using a STFT, which can be defined as:
where
denotes the basis vector of the two-dimensional DFT at frequency
and
represents another vector containing all the
image sample pixels from the neighborhood
. In order to improve the efficiency of the calculation, according to the separability of the basis functions, we can use the one-dimensional convolution formula for the rows and columns to calculate the STFT of each pixel position in the image. In the LPQ method, the calculation formulas of the local Fourier coefficients at four frequency points are:
,
,
, and
. Here,
is a sufficiently small frequency parameter. Thus, each pixel position can be represented by a vector.
where Re and Im represent the real and imaginary parts of a complex number, respectively. Next, we can use a simple scalar quantizer to calculate the phase information, as follows:
where
refers to the
jth component of the vector
. After quantization, the quantized coefficients can be represented as the integer values between 0–255 by employing binary coding
As a result, we obtain the distribution of the integer values of all the pixels in the image
, and these results are used as a 256-dimensional feature vector for further classification. In this work, we used the LPQ method to analyze the four target protein datasets, and finally converted the PSSM of each protein sequence into a 256-dimensional feature vector. The predicted results for a given drug and target protein based on the proposed method are displayed in
Figure 6. We illustrate the prediction of the interactions between a drug, Sulfasalazine, and two target protein sequences. The length of the sequence, named Arachidonate 12-lipoxygenase, 12
S-type is 663, and the length of the sequence named Lipoprotein lipase is 475. According to the results, we can see that the drug Sulfasalazine is predicted to interact with target protein Arachidonate 12-lipoxygenase, 12
S-type with possibility score of 0.844, and not to interact with target protein Lipoprotein lipase with possibility score of 0.3200.
3.5. Rotation Forest
Rotation Forest (RF) is a classification method that is widely used for supervised learning. RF was originally proposed by Rodriguez et al. [
38], and it has outstanding prediction performance as an ensemble learning classifier. In the rotation forest algorithm, the feature set
is randomly divided into
subsets (
is a parameter in RF), and the bootstrap sampling technique is then applied to 75% of the original training samples on each feature subset to obtain a sparse rotation matrix. Next, the classifier is constructed by using the repeated projection features of the matrix multiple times, and the final class of the test sample is given in combination with the prediction result of the multiple classifiers.
Let be the training sample set, which is a matrix of , and it is composed of feature vectors for each training sample . Let the feature set be and the corresponding class label be , denoted as . Suppose that there are a total of decision trees in a rotation forest classifier, which are denoted as , respectively. Subsequently, for an individual classifier , the implementation of the training set is as follows:
(I) The feature set is randomly divided into disjoint parts, and the feature number of each subset is .
(II) Let denote the jth subset of features, which is the training set for classifier . Afterwards, for each such subset, we use a bootstrap sampling technique to reconstruct a new training set from 75 percent of the original training dataset .
(III) Apply principal component analysis to using only the features in . The coefficients of the principal components are stored in a matrix and their size is which are denoted as .
(IV) The coefficients obtained in the matrix
are arranged into a sparse rotation matrix
, as follows:
For a given test sample
during the classification period, assuming
be the probability that is obtained by the classifier
, which is used to discriminate that
belongs to class
. Next, calculate the confidence of the class by the average combination method, the formula is as follows:
Finally, is assigned to a class with the largest calculation result.
4. Conclusions
When considering the drug substructure fingerprints, target protein sequences, and known drug-target interactions as important information for DTI prediction, here we developed a novel computational method for predicting DTIs on a large scale. Specifically, the proposed method combines position-specific scoring matrix (PSSM), local phase quantization (LPQ), and rotation forest (RF) classifier to predict DTIs. The five-fold cross-validation method was used in this work to assess the predictive performance of the proposed method on the golden standard datasets. As a result, the average accuracy that was yielded by our method achieved 89.15%, 86.01%, 82.20%, and 71.67% on enzymes, ion channels, GPCRs, and nuclear receptors datasets, respectively. To better illustrate the predictive power of the proposed method, we also compared it to the support vector machine classifier as well as some other previous methods on the golden standard datasets. The experimental results further indicate that the proposed method can effectively predict drug-target interactions. We anticipate that the proposed model can serve as a useful tool for predicting DTIs on a large scale in the future research of molecular pharmacology.