1. Introduction
The effective identification of compound-protein interactions (CPIs) plays an important role in drug design and phage biology [
1]. The discovery of unknown CPIs, namely drug repositioning or drug screening [
2,
3], contributes to the discovery of new uses and potential side effects of drugs, which not only provides valuable insights for the understanding of drug action and off-target adverse events, but also greatly reduces the time-consuming and laborious process of traditional clinical trial methods [
4]. Compounds can be represented by a Simplified Molecular Input Line Entry Specification (SMILES) string sequence [
5] and 2D molecular graph with atoms as nodes and chemical bonds as edges; proteins are represented by sequences of amino acids. CPI indicates that the compounds have positive or negative effects on the functions performed by proteins, thus affecting the development of diseases [
6].
In order to predict the potential CPI, many researchers have proposed a number of methods. The traditional structure-based and ligand-based virtual screening methods, although having achieved great success, are not applicable when the 3D structure of proteins is unknown or there are too few known ligand datasets. For this reason, Bredel and Jacoby introduced a new idea called chemical genomics to predict the compound-protein interaction without considering the 3D structure of the protein [
7]. From the perspective of chemical genomics, the researchers then developed a prediction method based on machine learning, which considered the chemical space, genomic space and their interactions within a unified framework. The chemical space refers to the set of all possible molecules, and the genomic space refers to the set of collective characterization, quantitative research and comparative research of all genes of organisms. For example, Jacob and Vert [
8] applied the support vector machine with two nuclei and used the finite element analysis based on tensor product between chemical substructures and protein families. Yamanishi et al. [
9] used a bipartite graph learning method to map compound proteins to a common eigenvector space. Bleakley and Yamanishi [
10] proposed a two-part local model (BLM) using similarity measures between chemical structures and protein sequences.
Most traditional prediction methods use only simple characterization of labeled data (such as known protein structure information and available CPI) to assess the similarity between the compound and the protein and infer unknown CPIs. For example, the similarity kernel function [
11] and the graphics-based SIMCOMP [
12] method are used to compare different drugs and compounds, which are used to describe the drug-protein interaction spectrum. The normalized Smith Waterman score [
9] was used to assess the similarity between targets (proteins).
In the field of machine learning, representation learning (RL) and deep learning (DL) are two popular methods for effectively extracting features and solving scalability problems in large-scale data analysis. RL aims to automatically learn data representation (features) from original data collected from reference and open platform, which can be more effectively utilized by downstream machine learning models to improve learning performance [
13,
14]. DL is a data-driven technique that has proven to be one of the best models for predicting drug target binding affinity. DeepDTA [
15] uses convolutional neural network (CNNs) to extract the low-dimensional real value features of compounds, which uses a vector having eight elements to represent the features of the proteins. Three convolution layers were used for feature extraction of compounds and proteins, and finally concatenates the two feature vectors to calculate the final output through the fully connected layer. WideDTA [
16] follows a similar line of thought, and it also takes advantage of two additional features, ligand maximum common structure (LMCS) and protein domains and motifs (PDM), to improve the model performance. The LMCS is obtained after the pair comparison of 2k molecules [
17]. PDM refers to the motifs and profiles of each protein obtained from the PROSITE database. Multiple sequence alignment of protein sequences reveals that specific regions within the protein sequence are more conserved than others, and these regions are usually important for folding, binding, catalytic activity or thermodynamics. These subsequences are called either motifs or profiles. A motif is a short sequence of amino acids (usually 10–30 aa), while profiles provide a quantitative measure of sequences based on the amino acids they contain. GraphDTA [
18] uses neural network graphs [
19] for graph convolutional neural network [
20] (GCN) instead of learning representative compounds of CNN. In addition, the feature vectors of compounds and proteins in DeepAffinity were extracted using recurrent neural networks (RNNs), where protein feature vectors were encoded by protein structure attribute sequence (SPS) [
21]. The main advantage of deep learning is that through nonlinear transformation in each layer [
22], they can better represent the original data and, thus, facilitate the learning of hidden patterns in the data. DL are now being focused on many other fields, including bioinformatics such as genomics [
23] and quantitative structure-activity relationships in drug discovery [
24].
In this paper, a new deep learning framework is developed which combines the local chemical environment of the sequence and the topological structure of the molecule together to predict the compound protein interaction. Specifically, proteins are represented by structural property sequence SPS (which have lower dimensions and more information than protein Pfam domains), and compounds are represented by the SMILES string and molecular graph. After that, we propose a deep learning model SSGraphCPI that combines recurrent neural networks and graph convolutional neural networks, using unlabeled data and labeled data to predict CPI. Unlabeled data refer to a compound/protein characteristic representation and are used in the pre-training section of RNN/RNN; Labeled data refer to compound-protein interactions and are used during unity training (pretraining and unity training refer to 2.2.1). The input of RNN/RNN is SPS sequence and SMILES string, and the input of GraphCNN is 2D structure diagram. In the process of unified training, the SPS/SMILES feature expressions were input into CNN to get protein and compound feature vectors, and then compound feature vectors were combined with the vector obtained by GraphCNN to get the final compound vector. The final protein vector and compound vector were input into the full connection layer to predict CPI. The experimental results show that the deep learning model proposed in this paper has a lower root mean square (RMS) error than the previous model. Later, we refer to the pre-trained SPS/SMILES model as RNN/RNN, SMILES combined with 2D structural diagrams as RNN/GCNN and SMILES/SPS/ 2D structural diagrams as RNN/RNN/GCNN.
3. Discussion
This model is the first three-channel model that includes protein SPS sequence, SMILES string and 2D structure diagram of a compound. The input of the three channels contains physicochemical properties, sequence information and structure information, which is a very comprehensive input. Moreover, an attention mechanism is added in each channel, which can extract compound protein characteristics more effectively.
The comparison model is different from the model in input or deep learning framework, which is more conducive to the comparison of suitable input and deep learning framework. In this paper, the random partition method is adopted in the division of training set and verification set, and further research can be made on cross verification and optimization of hyperparameters in the future. In this paper, RMSE and R2 were used as measurement indexes to compare the differences of different models on different datasets. It can be seen from the results that SSGraphCPI model can achieve better results on the same dataset, but there are great differences in model performance between different datasets, indicating that the sensitivity of the model on specific datasets needs to be studied.
5. Conclusions
Accurately predicting CPI is an important and challenging task in drug discovery. In this article, we present a new end-to-end deep learning framework, SSGraphCPI, for CPI prediction. The framework combines GCNN model to extract molecular topological information and BiGRU model to obtain local chemical background of SMILES/SPS. This method can extract compound/protein related information more effectively and comprehensively, which is beneficial to CPI prediction. The results show that SSGraphCPI can effectively improve the accuracy of the model and reduce the RMS error of the model on most datasets.
Furthermore, we proposed a new deep learning model SSGraphCPI2, which added protein amino acid sequence information on the basis of SSGraphCPI, and also used the BiGRU model for feature learning. The results show that the RMS error and loss value on most datasets are significantly reduced, indicating that this model can also effectively improve the accuracy of CPI prediction.