1. Introduction
Bioactive peptides are short chains of amino acids that can influence diverse biological activities due to their specific structures and functions [
1]. Bioactive peptides come in various types, each with its own specific functions and roles in the body. For example, antimicrobial peptides (AMPs) can act as part of the innate immune system, defending the body against pathogens [
2]. Anticancer peptides (ACPs) serve several crucial functions in the context of cancer treatment, and their unique properties make them promising candidates for targeted therapeutic interventions [
3]. Cell-penetrating peptides (CPPs) have the remarkable ability to smoothly interact with the negatively charged membranes of cells, making it easier for them to penetrate the cell’s outer defenses [
4]. Efficiently predicting and identifying these peptides is crucial for unraveling the basics of biological workings and propelling advancements in therapy.
Existing peptide prediction methods mostly focus on predicting specific types of peptides. For example, ACP-DL [
5], StackACPred [
6], ACP-check [
7], and CACPP [
8] are primarily used for identifying anticancer peptides. sAMPpred-GAT [
9], DNNs [
10], ENAMP [
11], and AMP-EBiLSTM [
12] are developed for predicting antimicrobial peptides. Cppsite2.0 [
13], CPPred-RF [
14], and BChemRF-CPPred [
15] are employed in cell-penetrating peptide prediction. pLMFPPred [
16] are employed in functional peptides prediction. In addition, there are several general-purpose tools that have been specifically designed for extracting peptide sequence features, such as FusPB-ESM2 [
17] and TP-LMMSG [
18]. FusPB-ESM2 constructs a feature extraction model by combining two pre-trained protein models, ProtBERT and ESM2. TP-LMMSG constructs a peptide sequence predictor by assembling a graph deep neural network model. Although these methods are developed to predict different types of peptides or proteins, they share common features within their respective models.
Various methods have been proposed to extract peptide features. For instance, the Amino Acid Composition (AAC) method quantifies the relative frequencies of individual amino acids in a peptide sequence, revealing its primary composition [
19]. In contrast, PseAAC incorporates local structural information, offering a deeper understanding beyond ACC [
20]. Similarly, methods like Dipeptide Composition (DPC) and Tripeptide Composition (TPC) analyze the frequency of occurrence of short peptide segments, capturing local structural patterns [
21,
22]. The Binary Profile Features (BPF) methodology entails the conversion of sequential data into a binary representation, where each segment of the sequence is depicted as a fixed-length binary vector [
23]. These methods extract features from the perspective of the composition of peptide sequences and the frequency of amino acids.
Approaches like DCGR integrate the physicochemical properties of amino acids to construct CGR curves, offering a unique perspective on peptide characteristics [
24]. Another method utilizing the physicochemical properties of amino acids is the Composition of k-spaced Amino Acid Group Pairs (CKSAAGP) [
25]. Although numerous feature extraction algorithms exist, they mostly focus on local traits, neglecting peptides’ global structural features. Consequently, it is crucial to develop an algorithm that can capture the entire sequence to enhance the accuracy of identification.
Machine learning models demonstrate strong learning patterns from data to make predictions, enabling them to accommodate diverse types and sources of bioactive peptide data. The commonly used machine learning models in peptide identification include Support Vector Machines (SVMs) [
26], Naive Bayes (NB) [
27], Random Forests (RFs) [
28], and K-Nearest Neighbors (KNN) [
29]. Although these models have demonstrated considerable effectiveness in peptide prediction tasks, they still possess limitations. The activity of bioactive peptides might involve intricate nonlinear relationships, which traditional machine learning methods might struggle to capture, as they are typically based on linear models and rely on manually extracted features.
With the advancement of deep learning, researchers have begun to employ deep neural networks for peptide prediction. This includes model such as Convolutional Neural Networks (CNNs) [
30], Graph Convolutional Networks (GCNs) [
31], Recurrent Neural Networks (RNNs) [
30], Long Short-Term Memory Networks (LSTM) [
32], and various network variants. Deep learning models can address the shortcomings of traditional machine learning in capturing nonlinear relationships. The activity of bioactive peptides is often influenced by long-range dependencies between different parts of the sequence. RNNs and LSTM typically capture long-term dependencies through gating units but may overlook capturing local patterns in peptide sequences. CNNs can effectively extract potential local relationships between amino acids but fail to capture long-term dependencies in sequences. Additionally, when sequences are too long, the issues of vanishing or exploding gradients persist [
33].
Temporal Convolutional Networks (TCNs) combine the advantages of RNNs and CNNs while overcoming their drawbacks [
34]. TCNs stack convolutional layers and build residual connections, enabling them to capture long-term dependencies in sequences without the common issue of gradient vanishing seen in RNNs. Additionally, TCNs address the limitation of CNN models in processing only local information and exhibit good generalization to unseen sequence data. Despite the tremendous potential of TCNs in bioactive peptide identification, they have not yet been applied in peptide recognition.
In this paper, we introduce TF-BAPred, a universal bioactive peptide predictor, using three-channel feature extraction (see
Figure 1 for the workflow of TF-BAPred). The main contributions of this work include the following:
(i) We popose Fixed-Scale Vector Graph (FVG) feature extraction strategy using a fixed-scale vector graph to capture the global structural patterns of each peptide sequence. This approach aims to provide a more comprehensive understanding of the overall structural characteristics exhibited by peptide sequences.
(ii) We employ a TCN for automatic temporal feature extraction, facilitating the extraction of long-range dependency information among amino acids in peptide sequences while also capturing local patterns within the sequence. To the best of our knowledge, this is the first time a TCN has been used in peptide recognition.
(iii) We apply the TF-BAPred algorithm to different types of bioactive peptides, including antimicrobial peptides, anticancer peptides, and cell-penetrating peptides. The benchmarking tests demonstrate that TF-BAPred exhibits a more competitive performance across these types of peptides.
In the introduction of this paper, we first review the background and current state of research in the field, followed by a brief overview of our work. In
Section 2, we provide a detailed description of the methodology and datasets, outlining the basic framework of the experimental design, implementation steps, and specific details of certain methods.
Section 3 presents the experimental results and data analysis, discussing the significance of the main findings and their potential applications. Finally, in
Section 4, we summarize this paper and propose directions for future research as well as possible improvements.
2. Materials and Methodology
2.1. Overview of the TF-BAPred
TF-BAPred integrates feature vectors extracted from three channels and feeds them through a series of linear transformations into a classification network composed of fully connected neural layers for predicting bioactive peptides. The first channel of TF-BAPred constructs a fixed-scale vector graph to capture the global structural patterns of peptide sequences. The completed vector graph can be transformed into a feature matrix and passes this matrix to a fully connected neural network for nonlinear transformation. In the second channel, the original sequences are encoded into a 730-dimensional feature vector by combining five different feature extraction methods: AAC, DPC, BPF, RSM, and CKSAAGP. The merged vector is then input into a fully connected neural network with ReLU as the activation function. The core structure of the third channel is TCN. The channel encodes the input peptide sequences into uniformly sized discrete numerical vectors and then transform them into fixed-length continuous low-dimensional vector representations through an embedding layer. These embedding vectors are fed into the TCN to extract sequential features with temporal information. The outputs from the TCN are passed through a dropout layer, which randomly ignores a fixed proportion of neurons to help alleviate overfitting. TF-BAPred integrates the features extracted by these three channels and passes the combined features through a series of transformations into a classification network with a sigmoid activation function. An overview of the TF-BAPred framework is illustrated in
Figure 1.
2.2. Fixed-Scale Vector Graph
Inspired by [
35], we proposed a novel approach that utilizes a fixed-scale vector graph to depict the global structural patterns of each peptide sequence. We named this method FVG. FVG can transform any peptide sequence into a fixed-size matrix using the following methods.
Suppose
denotes the alphabet encompassing all amino acids, and its size is
m. FVG defines a one-to-one mapping function as follows.
As depicted in Equation (1), FVG converts each amino acid x in the set into a integer z (), ensuring that no two different amino acids are mapped to the same integer.
For each peptide sequence , FVG constructs a directed graph to represent the topological structure of the sequence. Each amino acid x () corresponds to two vertices, and , in the graph, while the set V consists of vertices. For each pair of adjacent characters, (, , where n represents the length of sequence S). There is a corresponding edge (, ) in the set of edges E. FVG standardizes the format of the graph with the following rules:
(i) For two vertices and representing the same amino acid x, is positioned to the left of .
(ii) If vertices and represent amino acid x, vertices and represent amino acid y, and , then vertices and are placed to the left of vertices and .
(iii) For each edge (, ), the height of this edge is set to .
(iv) For each edge (, ), the width of this edge is designed to be how many times the tuple appears in sequence S.
To describe the graph
G mentioned above, FVG employs Algorithm 1 to transform G into an
m ×
dimensional matrix, where
m denotes the size of set
. Algorithm 1 takes a peptide sequence as input, simulates the graph construction process described above, and represents the resulting graph as a matrix
M. For each element
of the matrix
M, its value presents the total width of the edges in graph
G passing through position <
>
.
Algorithm 1 Constructing Feature Matrix Based on FVG. |
- 1:
Input: a peptide sequence S - 2:
Output: a feature matrix M - 3:
- 4:
for in S do - 5:
- 6:
- 7:
- 8:
for k in 1 to do - 9:
- 10:
- 11:
end for - 12:
if then - 13:
for k in to do - 14:
- 15:
end for - 16:
else - 17:
for k in to do - 18:
- 19:
end for - 20:
end if - 21:
end for
|
Variations in both the length and composition of the peptide sequence can lead to differences in the widths and lengths of the edges within their corresponding graph. Consequently, this disparity may result in differences in the magnitudes of values for elements within the associated matrix. Such disparities can impact the accuracy of subsequent peptide predictions. To address this issue, FVG normalizes the matrix
M to
. FVG employs the following formula to calculate the value of each element
(
) in matrix
:
where
denotes the minimum element value within matrix
M, while
signifies the maximum element value within matrix
M. The resulting matrix
obtained using the aforementioned strategy serves as the feature matrix for the FVG method.
Figure 2 presents an example of constructing a fixed-scale vector graph and converting it into matrix
M for the peptide sequence
with set
. As shown in
Figure 2b, the fixed-scale vector graph comprises 10 vertices labeled
,
,
,
,
,
,
,
,
, and
, along with 7 edges. Specifically, the edge (
) has a width of 2 due to the sequence
S containing 2
.
Figure 2c represents a matrix
M corresponding to the fixed-scale vector graph, which can be understood as a grayscale representation of this graph. The elements in the matrix represent the sum of the widths of the edges in graph G at the corresponding positions. For example, both edges (
) and (
) pass through position <4, 6>, with widths of 2 and 1, respectively. Consequently, the value of
is 3.
2.3. Residue Sparse Matrix
Inspired by the k-mer sparse matrix [
36], we propose a feature extraction method named Residue Sparse Matrix (RSM) that aims to capture both the position and composition of amino acids within each peptide sequence. RSM first constructs a boolean matrix based on the peptide sequence and the one-to-one mapping function
F introduced in the previous section. Subsequently, RSM transforms this boolean matrix into a fixed-size feature vector. Further details are outlined as follows.
For each peptide sequence , RSM defines an -dimensional boolean matrix A, where m represents the number of amino acids in the alphabet , and n represents the length of S. For each element in the matrix A, if , then ; otherwise, .
The above definition of matrix
A indicates that its dimensions vary depending on the length of the sequences, which complicates subsequent feature analysis. To address this issue, RSM performs singular value decomposition on matrix
A to convert it into an
m-dimensional vector
Z. For the matrix
A with dimensions
, its singular value decomposition can be represented as follows:
where
U is an
orthogonal matrix whose column vectors are the left singular vectors of
A.
S is an
diagonal matrix whose diagonal elements are the singular values of
A, typically arranged in descending order.
V represents an
orthogonal matrix whose row vectors are the right singular vectors of
A. Let
represent the
i-th element of vector
Z, and its computation is as follows:
where
and
refer to the elements located at the
i-th row and
j-th column of matrices
U and
S, respectively. RSM regards
Z as the feature vector representing the arrangement and composition of amino acids within the peptide sequence.
2.4. Temporal Convolutional Network
Analogous to a time series where each observation is ordered chronologically, the position of each amino acid within the sequence plays a critical role in bioactive peptides. The TCN shines as a deep learning architecture that is meticulously designed for tackling sequence modeling tasks. Its utilization of dilated convolutions and residual connections enables the network to effectively capture long-range dependencies, facilitating the accurate extraction of features from bioactive peptides. Moreover, the TCN can handle different scales of patterns by stacking multiple convolutional layers, each with varying convolutional kernel widths. Therefore, we employ a TCN to extract features from the peptide sequence. The workflow of the TCN can be outlined as follows:
We configure the input format acceptable by the TCN as a three-dimensional tensor as follows:
where
b denotes the number of samples in a batch,
t represents the length of the time series, and
i indicates the number of features at each time step.
The residual block serves as the fundamental building unit in the TCN, typically composed of a series of convolutional layers. Departing from the prior paradigm of simple one-dimensional causal convolutional layers composing the basic building block, we adopt a scheme where residual blocks consist of two-layer convolutional blocks with identical dilation factors and residual connections. Within each convolutional block, we sequentially establish a one-dimensional convolutional layer, a normalization layer, a rectified linear unit (ReLU) activation layer, and a dropout layer to extract features from the data and perform feature transformations. Furthermore, to ensure smooth connectivity between residual blocks, the results of two convolutional blocks are combined, with a convolutional block included within the residual block to ensure compatibility of input and output data shapes.
After completing the basic residual block construction, we add the residual blocks to the network through residual connections to construct the residual network. This work is based on the Keras network framework. The construction of the residual network enables the TCN to have multiple stacked convolutional layers. Therefore, by adjusting the dilation convolution factor and filter size, the TCN has a more flexible receptive field compared to traditional CNNs. The receptive field size of the residual network can be defined as follows:
Specifically,
k denotes the size of the convolutional kernel within residual blocks,
r denotes the number of stacked residual blocks, and
d represents the dilation rate for each individual residual block.
3. Results
TF-BAPred aims to explore effective features for different types of peptides and apply them to peptide identification. It introduces a novel feature representation method, called FVG, to represent the global structural patterns of each peptide. It also propose a feature extraction strategy called RSM that outputs a vector containing the types and positional information of amino acids. Additionally, it employs a TCN for automatic feature extraction. TF-BAPred is implemented using the Keras [
37] framework with the TensorFlow [
38] deep learning backend library. To benchmark TF-BAPred, we initially assessed the effectiveness of the proposed feature extraction methods (TCN and FVG). We subsequently evaluated the performance of TF-BAPred on six challenging datasets and compared it with ACP predictors including ACP-DL [
5] and ACP-check [
7], the AMP predictors including DNNs [
10] and Ma’s method [
39], as well as the CPP predictor CPPred-RF [
14]. In addition, we tested the performance of these predictors under different ratios of training and testing datasets.
3.1. Dataset Information
In order to facilitate the comparison of TF-BAPred with state-of-the-art approaches, we collected six challenging datasets encompassing three types of bioactive peptides. These included two datasets related to anticancer peptides, ACP740 [
5] and ACPmain [
40]; two datasets focused on antimicrobial peptides, Veltri’s dataset [
10] and Ma’s dataset [
39]; and two datasets associated with cell-penetrating peptides, CPP924 [
41] and CPPsite3 [
14]. The details of these benchmark datasets are presented in
Table 1. More detailed information about the data can be found in the
Supplementary Materials.
3.2. Assessment of TCN and FVG
To validate the effectiveness of the TCN, we employed a single TCN channel for peptide prediction, then replaced the TCN with a CNN or LSTM to evaluate its performance. Additionally, we merged the feature vectors derived from FVG with the aforementioned three methods to gauge the effectiveness of FVG. The average accuracy (ACC), sensitivity (SEN), specificity (SPE), Matthews correlation coefficient (MCC), and F1 score achieved using the above methods across the six datasets are presented in
Table 2, while the ROC curves of the aforementioned strategies on each dataset are shown in
Figure 3.
As shown in
Table 2, the TCN outperformed both the CNN and LSTM. Owing to the TCN’s integration of a CNN’s local receptive fields and LSTM’s long-term dependency modeling capability, it could achieve a better understanding of the structural characteristics and patterns within bioactive peptide sequences. When combined with FVG, the performance of all of the above methods further improved. For example, The average F1 scores for CNN, LSTM, and TCN improved by 5.5%, 7.1%, and 6.9%, respectively. Additionally, the combination of TCN and FVG outperformed the other methods, showing increases of at least 5.6% in ACC, 7.6% in SEN, 2.1% in SPE, 6.5% in F1 score, and 17.4% in MCC compared to the alternative methods. Additionally, the results depicted in
Figure 3 align with those presented in
Table 2, reaffirming that the combination of the TCN and FVG yielded the most optimal performance across all of the datasets. For example, the AUC obtained via the combination of the TCN and FVG on the CPP924 dataset was 0.965, representing increases of 10.8%, 10.2%, 5.5%, 5.8%, and 6.9% compared to the CNN, LSTM, TCN, CNN + FVG, and LSTM + FVG, respectively.
3.3. Evaluation of Generalization
In order to evaluate the generalizability of TF-BAPred, we collected six datasets encompassing three types of bioactive peptides: ACP, AMP, and CPP,. We compared TF-BAPred’s performance with that of five state-of-the-art methods, each specifically designed for predicting one type of these peptides. The compared methods included the ACP predictors ACP-DL and ACP-check; the AMP predictors DNN, Ma’s method, and AMP-EBiLSTM; the CPP predictor CPPred-RF; and the functional peptide predictor pLMFPPred. For a fair comparison, we provided all of the models with consistently divided datasets, splitting the datasets into training, validation, and test sets at a ratio of 7:1:2. The accuracy (ACC), sensitivity (SEN), specificity (SPE), F1 score, and Matthews correlation coefficient (MCC) achieved using these methods are presented in
Figure 4.
As shown in
Figure 4, TF-BAPred holds a more competitive performance than the other methods across these three types of bioactive peptides. For example, in predicting ACP and AMP, TF-BAPred outperformed the other predictors in terms of ACC, SEN, SPE, F1 score, and MCC, especially in MCC. On the ACP740 and ACPmain datasets, TF-BAPred achieved an average MCC that was 26.3% higher than that of the other predictors. On the Veltri’s and Ma’s dataset, TF-BAPred achieved an average MCC that was 23.4% higher than the other predictors. In predicting CPP using CPP924, TF-BAPred obtained 0.934%, 0.945%, 0.933%, and 0.870% higher metrics than the other predictors, including 5.4–35.6%, 18.6–59.9%, 5.7–30.1%, and 12.4–125.4% in ACC, SPE, F1 score, and MCC, respectively. These results demonstrate that TF-BAPred has generalizability in predicting different types of peptides. Due to its ability to extract long-term temporal features from peptide sequences and capture the local correlations of amino acids through multiple feature representations, TF-BAPred enables a deep understanding of bioactive peptide sequences.
Predictors such as ACP-check and CPPred-RF demonstrate favorable performances across six datasets, showcasing their consistent capability in predicting different types of peptides. Similarly, pLMFPPred and AMP-EBiLSTM have also shown considerable stability across different datasets. DNN and Ma’s method achieved superior results on the AMP dataset compared to the ACP and CPP datasets. On the one hand, it is evident that they are highly suitable for predicting specific types of peptide sequences. On the other hand, these two methods prioritize the construction of long-term memory-based temporal features over extracting amino acid local information. Hence, the integration of features from both aspects could significantly enhance their general predictive capability for bioactive peptides. Although lacking in general predictive capabilities for different types of peptides, ACP-DL has managed to grasp the structure and patterns of peptide sequences using minimal feature representations, exhibiting remarkable stability on the ACP dataset.
Although TF-BAPred demonstrated generality across these datasets, there remains room for improvement. For instance, on the CPP924 and CPPsite3 datasets, TF-BAPred achieved a mean SEN 5.7% lower and a SPE 39.2% higher compared to CPPred-RF. The metric SEN assesses the ability to correctly identify positive samples, whereas SPE measures the ability to correctly identify negative samples. Enhancing the SEN of a test may potentially compromise its SPE, and vice versa. Thus, there arises a necessity for TF-BAPred to strike a balance between its capacity to identify positive and negative samples. In summary, TF-BAPred exhibits generalizability and is usually better than other predictors in predicting different types of bioactive peptides.
3.4. Impact of Varying Proportions of the Training Dataset
To test the impact of the proportion of the training set on the algorithm’s performance, we benchmarked TF-BAPred against other predictors on varying proportions of training sets from ACP740. The accuracies of six predictors under varying proportions of the training set are presented in
Figure 5.
As shown in
Figure 5, we can observe that the accuracy of TF-BAPred improved slightly and achieved higher accuracies than the other predictors, indicating that the proportion of the training set has a small impact on it. When the training set proportion increased from 50% to 90%, the accuracy of TF-BAPred improved by 4.8%, and the improvement rate was 70.0%, 20.0%, 73.8%, 52.0%, and 23.1% lower than ACP-DL, ACP check, DNN, Ma’s method, and CPPred-RF, respectively. While some methods exhibited favorable performance across varying training set proportions, there were instances where they experience brief declines in accuracy as the training set proportion increased, as seen in the case of ACP-DL, ACP-check, AMP-EBiLSTM, and the DNN. For example, ACP-check achieved a 3.8% decrease in accuracy when the training set proportion was increased from 85% to 90%. pLMFPPred outperformed most of the methods when the training set was small; however, as the test set decreased, the prediction accuracy of pLMFPPred began to fluctuate, ultimately falling below the level achieved when the training and test sets were balanced. These benchmarking tests indicate the practicality of TF-BAPred under limited training data.