1. Introduction
Dynamic DNA modifications, such as methylation and demethylation have an essential role in the regulation of gene expression. DNA methylation as a heritable epigenetic marker is one type of chemical modification of DNA, which alters genetic performance without altering the DNA sequence [
1,
2]. Several researches have shown that it has the ability to change DNA protein interactions, DNA conformation, DNA stability, and chromatin structure. Meanwhile, it can regulate some different functions including cell developmental, genomic imprinting, and gene expressions [
3,
4]. N4-methylcytosine (4mC), 5-Methylcytosine (5mC), and N6-methyladenine (6mA) as three common methylations by specific methyltransferase enzymes occur in both prokaryotes and eukaryotes [
5,
6,
7].
In prokaryotes, the host DNA from exogenous pathogenic DNA can be identified by 6mA and 4mC [
8], and also 4mC regulates DNA replication and its errors [
9,
10]. Meanwhile, 4mC as part of a restriction-modification (R-M) system prevents restriction enzymes from degrading host DNA [
11]. In eukaryotes, 5mC has a crucial role in transposon suppression, gene imprinting, and regulation. By high sensitivity techniques, 6mA and 4mC can only be detected in eukaryotes [
12].
The 5mC, as the most well-explored and common type of cytosine methylation plays a significant role in several biological processes [
13] and can be caused by cancer, diabetes, and also some neurological diseases [
14,
15,
16]. The 4mC as effective methylation protects its own DNA from the restriction of enzyme-mediated degradation. It has an important role in controlling some various processes including cell cycle, gene expression levels, differentiating self and non-self-DNA, DNA replication, and correcting DNA replication errors [
9,
17].
Some extensive experimental studies have been performed to detect 4mC sites in the whole genome such as 4mC-Tet-assisted bisulfite sequencing, methylation-precise PCR, mass spectrometry, and Single-Molecule of Real-Time (SMRT) sequencing [
18,
19,
20,
21]. The aforementioned experimental approaches are laborious and expensive work when performing genome-wide testing. Therefore, it is necessary to develop a computational method for identifying 4mC sites.
Lately, several 4mC sites identifiers [
22,
23] have been proposed for different species such as
C. elegans,
D. melanogaster,
A. thaliana,
E. coli,
G. subterraneus,
G. pickeringii. The i4mC-ROSE [
24] as the first computational tool for predicting 4mC sites within the
Rosaceae genomes has been proposed to identify the 4mC sites in the genomes of
F. vesca [
25] and
R. chinensis [
26]. It generated six probabilistic scores by using six encoding methods; random forest (RF), algorithms with k-space spectral nucleotide composition (KSNC), electron-ion interaction pseudopotentials (EIIP), k-mer composition (Kmer), binary encoding (BE), dinucleotide physicochemical properties (DPCP), and trinucleotide physicochemical properties (TPCP) that cover various aspects of DNA sequence information. Then, those scores were combined with a linear regression model for enhanced prediction performance [
24]. The 4mcDeep-CBI [
27] as a deep learning framework has been proposed to predict the 4mC sites in an expanded dataset of
Caenorhabditis elegans (
C. elegans). 3-CNN and BLSTM were used to extract deep information and to obtain advanced features.
In this work, a novel predictor, DNC4mC-Deep, has been established for the identification of 4mC sites in the genome of
F. vesca,
R. chinensis, and cross-species which is newly prepared. The overall framework of our work summarized as; Firstly we used the six encoding techniques named 2Kmer [
28], 3Kmer [
29], binary encoding (BE) [
30,
31], nucleotide chemical property (NCP) [
32], nucleotide chemical property, and nucleotide frequency (NCPNF) [
32], and multivariate mutual information (MMI) [
33,
34]. Then, we made a deep learning model by using the Convolution Neural Network (CNN). We applied a grid search algorithm to obtain the optimal model with tuned hyperparameters. All six encoding schemes were input separately in the optimal selected model and recorded the results of each encoding scheme and used the K-fold cross-validation method with the value of K as 10. To evaluate and analyze the results of the model on each encoding scheme, we used the performance evaluation metrics. We also presented two different applications; the first one is silico mutagenesis [
35] representation using heat maps, and the second is distinguishing the most significant portions of a sequence using saliency maps [
36]. After getting the results from the model by all six different feature encoding methods, we ended up with that Dinucleotide composition (DNC) is outperforms from all six encoding schemes and the state-of-the-art model. In comparison to the state-of-the-art model, DNC4mC-Deep successfully achieves 0.635, 0.565, and 0.562 of MCC on
F. vesca,
R. chinensis, and cross-species independent dataset, respectively.
3. The Proposed Deep Learning Model
In this study, an efficient deep learning model based on CNN was proposed for the identification of 4mC sites in the genome of
F. vesca,
R. chinensis and cross-species. CNN does not require manually extracted features like a conventional supervised learning processes. The immense advantage of a CNN, it can extract the features by itself automatically for the classification process. Additionally, a handy crafted feature can also be fed to CNN to build a heterogeneous model. A CNN has a big impact on various fields of natural language processing, image processing [
53,
54,
55,
56] and computational biology [
57,
58]. To get an optimum model we applied grid search and during learning the CNN, six hyperparameters were tuned. The ranges within each hyper-parameter was tuned to are listed in
Table 3.
After getting the best model from the grid search, we used six different encoding schemes (DNC, TNC, BE, NCP, NCPNF, MMI) for the input of the CNN model. Each encoding technique converted into vectorization of the input sequence and used the same CNN model for training and testing also verified the robustness from the independent dataset. All the feature encoding approaches had a different impact on a single model.
In the proposed model, initially, two blocks used with the same number of layers but different values of parameters. Each block contains one convolution layer Conv1D (f, k, s) where parameter f is the number of filters, k is the kernel-size, and s represents the stride value are equal to 32, 5 and 1, respectively on both blocks. The convolution layer utilizes its ability to fetch the features by self-regulating for the input sequence of positive and negative 4mC samples. As a parameter of the convolution layer, we used L2 regularization and bias regularization to make sure that the model has no overfitting problem. We set the values for both regularizations with 0.0001 for the two Conv1D of blocks. As an activation function, an exponential linear unit (ELU) is used. Each Conv1D was followed by a group normalization layer (GN) as GroupNormalization (g) where g is a number of groups, to decrease the outcomes of convolution layers. Group normalization divides channels into groups and normalizes the feature within each group. The number of groups was set to 4 on both blocks of GN. To reduce the redundancy of the features after GN layers, we employed a max-pooling layer in each block as MaxPooling1D (l, r) where l denotes pool-size and r is the stride were set as 4 and 2, respectively. The max-pooling layer helps to reduce the dimensionality of the features from former layers. The outputs of the max-pooling layers were passed through dropout layers, Dropout (d) with a probability of 0.25 as a value of d on both blocks for the prevention of overfitting during the training. Dropout helps to switch off the effects of a few hidden nodes by adjusting the output of nodes to zero at training.
After both blocks, to unstack the output, a flatten function was used to squash the feature vectors from the previous layers. Right after a flatten layer, a fully connected (FC) dense layer used as Dense (n) with the number of n neurons which was set as 32 and also used the L2 regularization parameter for bias and weights with the value of 0.0001. ELU activation function used in the FC layer. At last, a FC layer was applied and used sigmoid function for the binary classification. Sigmoid is used to squeezes the values between the range of 0 and 1 to represent the probability of having 4mC and non-4mC sites.
Figure 1 shows the complete architecture of the presented model.
The DNC4mC-Deep was carried out on the Keras Framework [
59]. In DNC4mC-Deep we used stochastic gradient descent (SGD) optimizer with a momentum of 0.95 and the learning rate is set as 0.005. For the loss function, binary cross-entropy was used. On the fit function, we set the 100 for the epoch and 32 for the batch size. The checkpoint was used on call back function for saving the models and their best weights whereas early stopping was also implemented to halt the prediction accuracy at the time when validation stops improving. The patience level was set to 30 in early stopping.