1. Introduction
Post-translational modifications (PTMs), such as methylation, acetylation, glycosylation, ubiquitination, and phosphorylation, are chemical modifications that play a critical role in the functional diversity and complexity levels of promotes following the protein biosynthesis by regulating localization activity and interactions with other cellular molecules in most of the biological processes [
1]. These modifications may occur at any time during the life cycle of a newly synthesized protein. Therefore, the identification and characterization of PTMs become a challenging task for a comprehensive understanding of cellular proteins and human diseases and provide extensive applications. As a prevalent and significant post-translational modification, lysine glutarylation has recently drawn a great deal of attention due to its involvement in diverse physiological and biological processes, including amino acid metabolism, fatty acid metabolism, and cellular respiration [
2]. It is a type of lysine acyl modifications that contain malonylation, succinylation, and glutarylation. Lysine glutarylation itself is a protein PTM that can be regulated by SIRT5, a major enzyme in the cells. The SIRT5 catalyzes lysine deglutarylation both in vitro and in vivo and also reserves the glutarylation of carbamoyl phosphate synthetase 1 (CPS1) inhibits its activity [
3]. In addition, enzymes such as thiolase, 3-hydroxy-3-methylglutaryl-coenzyme A (HMG-CoA) synthase, HMG-CoA lyase, d(−)-
β-hydroxybutyrate dehydrogenase (bOHB), 3-hydroxy-3-methylglutaryl-CoA synthase 2 (HMGCS2), CPS1, and manganese superoxide dismutase (MnSOD) can participate in a variety of important enzymatic reactions [
4]. It also plays an essential role in human sperm maintaining sperm motility [
5]. The previous works suggested that glutarylation sites have been correlated to a lot of human diseases such as diabetes [
6], glutaric academic type I disease [
7], neuronal anaplerosis [
8], and heart disease [
9].
Prediction of PTM sites as well as lysine sites have been common in bioinformatics fields and there have been many studies conducted with promising performance [
10]. Glutarylation found on lysine residues has revealed as an important regulator of several metabolic and mitochondrial processes [
11]. However, little attention has been paid to enhancing glutarylation sites prediction and become a challenging task accordingly. GlutPred [
12] is the first computational prediction of glutarylation sites in which they encoded mRNA codon-triplet as features. Next, iGlu-Lys [
13] adopted a conventional machine learning support vector machine on amino acid pair order and special-position information to improve the predictive performance from GlutPred. In an effort to incorporate different features, AL-barakati et al. [
14] implemented RF-GlutarySite based on a random forest classifier to predict glutarylation sites with independent test accuracy reaching 72%. Finally, a recent predictor for this purpose has been released by Huang et al. [
15] in which they included intrinsic interdependence between positions in the substrate sites to improve the performance. At the end, their predictor could reach an accuracy of 71% on a benchmark independent dataset. Despite some positive achievements that have been made for the identification purpose in recent years, improvements are still needed to enhance glutarylation prediction. For example, multiple tools purposed for this prediction obtained undesirable performances, the low correlation coefficient between true and predicted values in comparison with prediction tools for other PTM sites.
According to recent studies, bio-sequence has been proven to be used for a broad range of bioinformatics research, such as family classification, visualization of proteins, prediction of structure, disordered recognition of proteins, and protein interactions [
16], [
17]. One of the main challenges for protein sequence analysis is contextualizing the structural properties of the desired proteins from the amino acid sequence database. An automated processing framework working effectively and optimizing time-consuming is essential for sequence data analysis. The advances in deep learning approaches applied in protein analysis have shown promising results and advantages in processing sequential data [
18]. There has been growing evidence that deep learning approach can be successfully applied in protein prediction and genomic analysis [
19]. There has existed a contextual relationship among amino acid sequences that biological sequences, particularly protein sequences, are comparable with natural language in terms of composition. Therefore, natural language processing (NLP) techniques have been used to address biological sequence processing [
20]. Moreover, word embedding techniques widely used in NLP can be adopted to transform the contextual relationship among amino acid sequences. The integration of embedding techniques into deep learning enables us to solve biological sequence feature representation and extraction. In bioinformatics, word embedding techniques have been used to analyze the protein structural properties from its amino acid sequence representation learning [
21,
22]. In a similar way, the CNN-BiLSTM model has been used to identify and achieve more functionality than traditional models of the potential contextual relationships of amino acid sequences [
23]. A new way of representing protein sequences as continuous vectors were proposed as a new biological language model, which effectively traces the biophysical properties of protein sequences from unlabeled big data (UniRef50) [
24]. Based on the aforementioned studies, it would be reasonable to suggest that a deep neural network approach based on embedding techniques has great potential applications for glutarylation sites prediction.
In this study, we first conducted a thorough survey considering the state-of-the-art computational prediction tool, in which the algorithms, feature selection techniques, performance evaluation methods, and so on were meticulously discussed. In addition, we designed a novel Deep Neural Network framework based on word Embedding techniques (DNN-E) for protein glutarylation site prediction. The results show that our proposed framework could generate better optimal features for this problem, thus improving the performance by reducing the feature dimension as well as accuracy and confidence. This paper summarizes major contributions as follows:
- (1)
Develop a novel deep neural network framework for glutarylation prediction based on word embedding techniques;
- (2)
Evaluate the effectiveness of conventional machine learning, deep neural network models, including long short-term memory (LSTM), stacked LSTM (S-LSTM), bidirectional LSTM (B-LSTM), convolutional neural network LSTM (CNN-LSTM), and convolutional neural network bidirectional LSTM (CNN-BLSTM) in glutarylation prediction;
- (3)
Evaluate the prediction performance on different word embedding models, including embedding layer model, pre-trained word embedding techniques such as global vectors for word representation (GloVe), and embedding from language models (ELMo).
We organized the rest of this paper as follows:
Section 2 shows the process material and methods, how to extract the glutarylation sites, and describes the overview architecture of the DNN-E framework, which shows a theoretical method of how to extract the glutalyration features and how to utilize the machine learning, deep learning techniques to classify the glutarylation sites.
Section 3 provides the detailed experiment design and experimental results as well as the evaluation and comparison to the previous studies.
Section 4 expands the discussion and the limitation. Finally,
Section 5 draws conclusions and further study.
3. Experiment and Results
This section describes the experimental settings and the selected parameters of the proposed DNN-E framework for glurarylation prediction. In addition, it presents a performance evaluation metric that was used in the validation and performance comparison.
3.1. Experimental Setup
All experiments were conducted using Python 3.7, Keras library with TensorFlow backend [
45], and Adam optimization. An Intel Core i7-7700 (3.60 GHz) CPU with 64 GB of memory was used with the CenOS Linux machine supported GeForce GTX 1080 Ti 11176 of memory GPU. In the beginning, a k-fold cross-validation approach [
46] was used in this study. The training set was first randomly partitioned into five equally size portions or folds. Subsequently, four out of five portions of the training set were used to train while the remaining one-fifth of the training set was used to validate the performance of the training model.
3.1.1. Embedding Layer Parameters
The embedding layer was constructed as the first hidden layer of the DNNs in which the embedding was learned along with multiple deep learning models. There are three prerequisite arguments used to construct the embedding layer, which is the first hidden layer of the DNN. First, the size of vocabulary known as the input dimension interpreted the total unique amino acids in the dataset. There are 20 amino acids and 1 rare amino acid, which are combined in different ways to make protein. Therefore, the size of vocabulary should be selected as 21. However, in order to avoid a collision, the size of vocabulary 30 is chosen. Secondly, the size of the vector space equal indicates the size of the output feature dimension in which amino acids will be embedded. The size of output vectors could be 8, 16, or higher. In this case, the size of output dimension 16 was found to be the most efficient result. Finally, the input length refers to the size of the input glutarylation site. In this work, all the selected glutarylation sites have a fixed size of 21, therefore, the input length of 21 is chosen. As shown in
Table 3, there were three required input parameters of the DNNs.
3.1.2. GloVe, ELMo Embedding Parameters
The pre-trained word embedding dataset is given by GloVe [
47]. It was trained on one billion tokens with a 400-thousand-word vocabulary. Some embedding vector sizes are available, of which 50, 100, 200, and 300 are included. We used the GloVe 100 dimensions to train the model by integrating the vector scale. The embedded output vectors have been used in the embedded layer portion as the data for CNN-LSTM ‘s deep training model.
ELMo vector assigned to a token or amino acid is a function of the entire sequence, unlike traditional word embedding techniques such as Word2vec and GLoVe. Therefore, the same symbol would have different embedded vectors under different contexts. ELMo embedding module available in TensorFlow-hub. Each sequence of amino acids was tokenized into a list of characters before fitting into the ELMo embedding model.
Table 4 summarizes the detailed selected parameters used for pre-trained embedding models.
3.1.3. Deep Neural Networks Hyperparameters
Most deep learning algorithms offer various hyperparameters that control multiple aspects, such as time consumption, computational resources, and accuracy of the algorithm. Hyperparameter tuning refers to the process of fine-tuning the optimal setting values in order to achieve the lowest generalization error and adjust the effective capacity of the model subject to computational resources. In order to configure the optimal hyperparameters for each training model, we have thoroughly evaluated the performance of those models on different measurements. For example, the number of neurons, the number of batch size, and the number of epochs were between 50 to 200 neurons, 25 to 100 epochs, and 10 to 30 batch sizes, respectively. The dropout rate was also used in each model to eliminate the overfitting effect. The final optimal hyperparameters are shown in
Table 5.
Number of neurons characterizes the dimensions of hidden stages (outputs);
Number of epochs specifies the number of times that the learning algorithm will work through the entire training dataset;
Number of batch sizes defines the number of samples to work through before updating the internal model parameters. The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset;
Dropout is a regularization technique that decides the probability of a network neuron or node being excluded from activation and weight updates while training a network. The dropout rate shows the effect of reducing overfitting and improving model performance;
Kernel size or convolution filter is a hyperparameter used in a convolutional neural network. Kernel size determines the size of the sliding window that convolves across the data. The filter or feature detector defines how many sliding windows going to run through the data.
3.2. Performance Evaluation
Different statistical scores as defined in [
48,
49] have been used to assess each classification’s performance. Each query point in the test sets has its true class label in a usually supervised binary classification problem.
The classifier maps the question points in one of the categories during the evaluation process: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). In this method, the problem is a positive or negative concern for a specific class. On this basis, each class is determined by TP, TN, FP, and FN. To assess the output of the classifier, the following statistical results are used for each class:
Matthew’s correlation coefficient (MCC)
Both sensitivity and specificity are appropriate for evaluating classification models for most datasets because these measures consider all entries in the confusion matrix. While sensitivity deals with True Positives and False Negatives, specificity deals with False positives and True Negatives. In other words, the combination of sensitivity and specificity is a comprehensive measure when both true positives and true negatives should be considered. The Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate that produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (True Positives, False Negatives, True Negatives, and False Positives), proportionally both to the size of positive elements and the size of negative elements in the dataset. The Matthews correlation coefficient is in the range [−1,1] where values of −1 and 1 indicate the worst-possible and the best-possible classifier, respectively.
The receiver operating characteristics (ROC) curve indicates the probability of classifying between classes, especially commonly used as the main performance metric in binary classification evaluation. The curve provides a convenient diagnostic tool to investigate one classifier with different threshold values and the effect on the True Positive Rate and False Positive Rate. One might choose a threshold in order to bias the predictive behavior of a classification model. The receiver operating characteristics area under the curve (ROC-AUC) represents the degree or measure of separability. This single score can be used to compare binary classifier models directly. As such, this score might be the most commonly used for comparing classification models for imbalanced problems. The score is a value between 0 and 1 for a perfect classifier.
3.3. Experimental Results
3.3.1. Model Selection
Our first objective aims to evaluate the performance between the conventional machine learning models and the LSTM model based on the embedding layer for model selection. K-folds cross-validation process is illustrated in
Figure 8 in which training dataset including 1290 glurarylation sites is split into
k equally sized subsets called folds. One of the k-folds will act as the validation set, known as the holdout set, and the remaining folds are used to train the model. This process repeats until each of the folds has acted as a holdout fold. After each evaluation, a score is retained, and the average score represents the overall performance of the training model as all iterations have been completed. The independent test set includes 138 glurarylation sites separated from the training set and will only be used for testing to ensure that testing data have not been used in the training set. The results of cross-validation performance on the training test set are given in
Figure 8. As can be seen, the LSTM model outperformed the conventional machine learning models by obtaining the highest ACC and MCC correlation coefficient. There is a noticeable gap in performance between LSTM and conventional machine learning models, which is a significant effect of the LSTM model. The accuracy rate obtained by the LSTM model was approximately 0.73 on average, while the accuracy rate was almost the same around 0.67 on average in conventional machine models. Furthermore, the MCC correlation coefficient rate obtained by the LSTM model signed a significantly higher 0.39 compared to conventional machine learning models producing even negative rates in some models such as Linear Regression (LR), Naive Bayes (NB), and Random Forest (RF) classifiers. The main reason for the unsatisfactory performance of the conventional machine learning model is that these models were only able to predict negative instances with higher specificity. In contrast, the negative prediction rate or specificity obtained by the LSTM model was significantly higher compared to machine learning models. It indicated that the predicted results obtained by the LSTM model based on the embedding layer were more accurate and reliable than the conventional machine learning models. Therefore, the results provide compelling evidence that the LSTM model enables to capture of the dependency relationship between amino acids in the sequence while making predictions. LSTM shares parameters (weights) update at each time step; therefore, the prediction task should be able to utilize the previously predicted results. The detailed results for 5-folds cross-validation are summarized in
Table 6.
The receiver operating characteristics (ROC) plot was used to evaluate classification accuracy. While the area under the ROC (AUC) curve represents the capability of distinguishing between classes. Moreover, we also evaluated the training: testing the splitting ratio to select the most efficient splitting ratio. The 5-fold cross-validation procedure and splitting ratio evaluation results are shown in
Figure 9. In 5-fold validation, the LSTM model obtained a high value of the AUC score with 0.69 on average. It indicates that our model achieves a highly reliable performance on unseen data. As shown in
Figure 10, the training:testing splitting ratio 90:10 obtained the highest AUC score. As we compared the performance of the AUC score between 5-fold with 10-fold cross-validation, we identified that the AUC score of the 5-fold validation obtained a higher score on average than the 10-fold validation. The variation in training and validation accuracy and loss is shown in
Figure 11. As the number of epochs increases, there is a probability of an overfitting problem occurring.
3.3.2. DNNs Variance Architecture Evaluation
Next, we evaluated the performance of the proposed DNN-E framework by replacing the LSTM model with the hybrid deep neural networks to reveal how the effectiveness of different models on amino acid sequence processing. The comparison performance between multiple DNN models based on the embedding layer for the independent test dataset is shown in
Figure 12. As can be seen, the S-LSTM model obtained the highest performance both on accuracy score, and correlation coefficient scores (0.79:0.51). LSTM model obtained the same accuracy score but lower
MCC correlation coefficient scores compared to the S-LSTM model. Although the correlation coefficient score obtained by the B-LSTM model was roughly equivalent to hybrid CNN-LSTM, and CNN-BLSTM models, the accuracy score was slightly higher. It can be observed that the LSTM model and S-LSTM performed better than the hybrid LSTM models. It can be interpreted that the feed-forward network is trained to learn the sequence of input more effectively compared to the spatial mapping on CNN. We also reasoned that adding recurrent network layers improves the performance of LSTM architecture. Hybrid CNN-LSTM and CNN-BLSTM models show less effectiveness in predictions compared to S-LSTM and LSTM models. Hybrid LSTM models such as CNN-LSTM, and CNN-BLSTM obtained the lower
ACC and
MCC (0.74:0.37). ROC-AUC analysis is given in
Figure 13 as a comparison between these classifiers. S-LSTM proved that it obtained the highest AUC score. The detailed confusion matrix for independent testing is summarized in
Table 7. The t-Distributed Stochastic Neighbor Embedding (TSNE) [
50] was used to visualize the prediction results, as given in
Figure 14.
3.3.3. Word Embedding Techniques Evaluation
In order to evaluate the impact of word embedding techniques on prediction results, pre-trained word embedding datasets, including GloVe and ELMo were used to replace the embedding layer in DNN.
Figure 15 shows the comparison between embedding layer performance and the pre-trained word embedding dataset. As can be seen, the S-LSTM model based on the embedding layer outperformed GloVe and ELMo models in both ACC and MCC. Although pre-trained word embedding was shown a significant impact on word representation, their performance was unattractive on the independent test set compared to the S-LSTM model based on the embedding layer. It can be interpreted that the pre-trained word embedding models could pick up more semantic signals in text processing. However, each amino acid has a unique function in the chain of the protein. The sequential positions of the amino acid are more important attributes needed to be captured. In other words, there is an existing hidden pattern of amino acid sequence of classification in which deep neural network and embedding layer work more effectively. Despite GloVe obtaining accuracy as the same as ELMo, it provided a lower MCC score in prediction in this case. The detailed results are given in
Table 8.
3.3.4. Comparison with the Previous Research
For a comparison between our model and the previously published works on the same problem, we retrieved the performance results from the previous four works [
12,
13,
14,
15]. In these works, they only used machine learning techniques for lysine glutarylation sites by extracting different sets of features e.g., amino acid pair order and substrate sites. Although the LSTM model exhibited a lower SN score than integrated SVM (i-SVM) [
15] in the 5-fold cross-validation phase, it obtained much higher scores for SP, ACC, and MCC, as listed in
Table 9. The same results were observed using the independent test set. As shown in
Table 10, our optimal model S-LSTM achieved a lot of improvements as compared to the other published works [
12,
15]. S-LSTM model obtained an SN score that was 9% lower compared to the i-SVM model obtained but 9% higher compared to the Gluted obtained. While S-SLTM obtained a much higher SP score compared to SP obtained by i-SVM. Interestingly, our proposed model outperformed with respect to ACC and MCC on the independent test set. It determines that we can find a novel set of signatures that might be more suitable for this classification purpose. Furthermore, the use of deep neural networks helps us generate a hidden feature set of features that makes our model more robust than the machine learning algorithms.
4. Discussion
The pre-trained word embedding vector has a larger dimensional embedding vector and it shows a great effect on natural language processing such as sentiment analysis. The capture semantic of a word is one of the significant functions of the pre-trained embedding vector model. In other words, the same word in the token will have a different meaning or embedding value based on the position of the word. However, this strategy shows a limited effect on the amino acid sequence. Because each amino acid has a unique function in the chain. The order sequence of amino acids plays an important factor in identifying and classify the function of the protein. Therefore, capturing semantics in this scenario has little impact on classification results. Testing on large datasets could be performed in order to conclude this evaluation. In comparison with the longer natural language processing model with 1024 proteins ranging from about 30 to 33,000 residues [
24]. Further GPU memory is required for longer proteins, and the underlying models can only keep a limited record of long-range dependence. Protein uses 20 standard amino acids in most cases and 5 additional characters in unusual, undefined, or unknown causes, compared to up to a limit of two million natural language processing terms. Less vocabulary could be problematic if protein sequences encode sentences of similar complexity.
In this study, several deep learning models for protein function prediction (e.g., glutarylation sites) based on a variety of biological data forms are addressed, which analyzed the evolution of machine learning approaches used to predict protein function based on trained data. Although there was an increase in the usage of computational models to extract significant functions and create good-appearing predictors, techniques using deep learning strategies had been still capable of outperforming other methods. One of the challenges that deep learning faces is that it needs a large amount of data, which possibly limits its effectiveness, at least in certain research on predicting protein function. Several methods covered in this study achieved excellent findings over a diverse range of functional groups. However, several other methods that did not produce similar outcomes need to be discussed for a variety of reasons, including the following: (1) When additional data is available for training their models, their outcomes may improve; (2) Technology advancements may result in improved outcomes; and (3) By combining these techniques with a more effective one, we may be able to get better results than using them separately.
5. Conclusions
In this study, we proposed a novel DNN-E framework for glutarylation prediction. We found that the deep neural network approach obtained higher accuracy and confidence rate than previous research. It is another compelling evidence to prove that, deep neural networks can work effectively with biological sequence data to handle complex problems in protein identification. The embedding layer added in deep neural network work more productively compared to pre-trained word embedding such as GloVe and ELMo. This study, therefore, indicates that word embedding, in general, provides a mechanism to transfer the language of biology. In addition, this work provides the potential to detect and detect new sites of glutarialisation and reveal the links between glutarial and well-known protein acetylation and methylation for modifications of lysine, including malogylation and succinylation recently identified. The small dataset for training is considered a limitation of model performance. The extension of work to optimize the framework should be conducted in further research.
Nevertheless, scientists are now going to resort large amount of input features, especially those that have been taken from biological sequences. To close the gap between the known and unknown sequences, reliable data-driven models are essential. which would help to know the effects of protein mutations on illnesses and the development of novel proteins. Finally, we are convinced that an effective scientific procedure can be developed in which hypotheses are produced by applying the best method for predicting functions to the scientific data that is currently available. These theories are then put to the test in the lab, resulting in confident predictions of a protein’s function. We anticipate that the results of this study will be helpful to computational and laboratory molecular biology professionals and complete this mission more efficiently.