1. Introduction
DNA methylation is a common epigenetic modification, referring to the process whereby methyl groups (-CH3) are added to bases within the DNA molecule. This modification is typically catalyzed by DNA methyltransferases (DNMTs) [
1], methylated regions of DNA that are usually associated with gene silencing or the suppression of gene expression [
2], which can affect the transcriptional activity of genes [
3], thereby playing a key role in the processes of cellular differentiation and development [
4,
5]. In prokaryotic and eukaryotic genomes, three common methylation types have been identified: N4-methylcytosine (4mC) [
6], 5-methylcytosine (5mC) [
7], and N6-methyladenine [
8]. 5mC refers to the cytosine that is methylated at the fifth carbon atom of the cytosine ring. 5mC is the most common modification of DNA methylation and is widely present in the genomes of eukaryotes [
9], and its location in the genome can affect the structure of chromosomes. 6mA refers to the modification of adenine at the N6 position (the sixth carbon atom on the adenine ring) by a methyl group. Similar to cytosine methylation, 6mA is also an important form of DNA methylation, but it is less common in the genomes of eukaryotes and mainly exists in prokaryotes [
10]; the presence of 6mA is closely associated with biological processes such as DNA repair and tolerance to environmental stress in organisms. 4mC refers to cytosine where the N4 position (the fourth carbon atom in the cytosine ring) is modified by a methyl group (-CH3). Studies have shown that the presence of N4-methylcytosine can affect the physical and chemical properties of DNA, thereby influencing DNA replication and repair processes, which are crucial for maintaining genome stability. This epigenetic modification can also affect gene expression levels by influencing transcription factor binding [
11] and chromatin structure adjustments [
12,
13]. Relative to the other two modification sites, there has been less research on 4mC. Therefore, the study of N4-methylcytosine helps to provide new perspectives and understanding for scientific research and may provide targets for new therapeutic approaches.
Currently, there are several experimental methods available for identifying 4mC sites in DNA. Methylation-specific polymerase chain reaction (PCR) [
14] uses differences in DNA methylation to detect methylation sites in DNA by PCR amplification. Mass spectrometry [
15] detects methylation by analyzing precise mass changes in DNA fragments. Whole-genome bisulfite sequencing [
16] uses sulfites to transform unmethylated cytosine while methylated cytosine is unaffected. The methylation sites in the DNA are then identified by sequencing. Single-molecule real-time (SMRT) sequencing [
17] detects methylation sites by observing the activity of DNA polymerase during DNA synthesis. However, the experimental methods for detecting 4mC sites in DNA have drawbacks, such as being time-consuming and having high costs [
18]. With the advancement of machine learning and deep learning, several computational methods have been developed for predicting 4mC sites. Deep learning methods can handle large-scale genomic data and support end-to-end learning, directly extracting features and classifying from raw DNA sequences without the need for complex manual feature engineering. This simplifies the data processing workflow and improves work efficiency. Through deep learning technology, researchers are able to gain a deeper understanding of the complexity of epigenetic modifications, providing a powerful tool for genomics and biomedicine research. 4mCCNN [
19] utilizes one-hot encoding and two one-dimensional convolutions for classifying 4mC sites. One-hot encoding represents each base in the DNA sequence as an independent feature vector. One-dimensional convolution can learn the local feature representations of these feature vectors, thereby better identifying the site modifications within the sequence. 4mCPred-SVM [
20] integrates four sequence features and combines them with an SVM classifier to train an optimal prediction model. DNA sequences are integrated by these four coding methods to obtain vector features. SVM maps the vector features into a high-dimensional space with the purpose of finding an optimal hyperplane that separates data points of different categories to achieve classification. Deep4mC [
21] encodes 12 features, evaluated by eight different classifiers. Binary, ENAC, EIIP, and NCP are used as inputs; two one-dimensional convolution layers are used for feature extraction; and the attention layer is used to capture key features. These key features are then finally input into the LR classifier to obtain an output score that represents the probability of a 4mC site. All of the above methods have been studied in six species:
Arabidopsis thaliana,
Caenorhabditis elegans,
Drosophila melanogaster,
Escherichia coli,
Geobacter pickeringii and
Geoalkalibacter subterraneus; comparatively few studies have been conducted in mice [
22,
23,
24]. Research on mice has only slowly emerged in recent years, and mice are commonly used to model human diseases and to study disease mechanisms and drug screening; investigating 4mC sites in mouse DNA may help discover and understand epigenetic changes related to human diseases, providing new targets and strategies for the treatment and prevention of related diseases.
4mCpred-EL [
22] is the first method developed for identifying 4mC sites in mouse genes, it utilizes four machine learning algorithms and seven feature encodings to generate probability features, which are then utilized for prediction through ensemble classifiers. i4mC-Mouse [
23] transforms sequences into feature vectors using six different encodings and classifies them using an RF classifier. These two methods are based on machine learning, which tends to have weaker learning capabilities and complex feature extraction processes. In contrast, 4mCPred-CNN [
24] and Mouse4mC-BGRU [
25] are based on deep learning methods. 4mCPred-CNN utilizes one-hot encoding and nucleotide composition profiles for feature extraction, employing convolutional neural networks (CNNs) to learn more abstract features. Mouse4mC-BGRU employs k-mer tokenization for encoding and inputs features into a bidirectional gated recurrent unit (GRU) to automatically extract both long-term and short-term dependencies within DNA sequences, thereby learning contextual information. Recently, a new method called MultiScale-CNN-4mCPred [
26] has emerged, which combines convolutional neural networks with different kernel sizes and long short-term memory (LSTM) to capture features of different scales and contextual information for predicting 4mC sites in mouse genes, thus improving prediction accuracy.
However, most of the above methods perform early feature fusion during the encoding stage, and integrating all features into the same encoding space may overlook the differences between different features, leading to feature conflicts or information loss issues. To address this problem, we proposed Mus4mCPred, which employs multi-view [
27] feature learning. It inputs different encoded features into separate neural networks to extract multi-view features and integrates these multi-view features to better represent DNA sequences. Each neural network can be optimized specifically for specific types of features to improve the effect of feature extraction. Mus4mCPred comprises adaptive embedding, residual convolutional neural networks, and bidirectional LSTM networks. The embedding layer effectively maps discrete features to dense vector representations, allowing neural networks to better learn semantic information between features. CNNs can efficiently extract local features to capture spatial or temporal local structures of the data, with translational invariance and local connectivity. Bidirectional LSTM can effectively capture long-term dependencies in sequential data through its gating mechanisms, thereby providing a better understanding of the contextual information in the sequence data. The incorporation of residual structures enables the model to more effectively capture the complex features within DNA sequences, enhancing the network’s representational ability.