A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning
Abstract
:1. Introduction
- In this study, a novel DNA encoding scheme was proposed and this scheme was used for the prediction of DNA enhancers.
- In this study, DNA enhancers were analyzed and predicted by various DNA encoding schemes. For the first time in this study, EIIP, integer number, and atomic number DNA encoding schemes were applied to this field.
- For the first time in this study, the BiLSTM deep learning model was applied to the mentioned DNA encoding schemes.
2. Related Works
3. Materials and Methods
3.1. Data Set
3.2. DNA Encoding Schemes
3.2.1. Integer Number DNA Encoding Scheme
3.2.2. Atomic Number DNA Encoding Scheme
3.2.3. EIIP DNA Encoding Scheme
3.3. A Novel DNA Encoding Scheme Based on Base Repetition Frequency
- For A base:
- For C base:
- For G base:
- For T base:
3.4. BiLSTM Deep Learning Model
3.5. Evaluation Criteria
4. Application Results
4.1. Predicting Human and Mouse Enhancers
- DNA sequences used for training data in the input layer were evaluated.
- In the second layer, 256-unit BiLSTM was used, and the activation function was determined as SeLU (scaled exponential linear unit).
- Dropout was done and 15% of the data was discarded.
- Then the 128-unit BiLSTM was used and the activation function SeLU was chosen.
- Again, dropout was selected and 20% of the data was forgotten.
- Then the 64-unit BiLSTM was used and the activation function SeLU was chosen.
- Again, dropout was selected and 20% of the data was forgotten.
- Batch normalization was applied, and the data were reduced to 1—dimensional with the flattening process.
- Three different fully connected layers were employed. Their number of neurons was determined as 512, 256, and 128.
- In the last layer, the Sigmoid activation function was used, and the data were classified.
- Binary-cross entropy was used for the loss of the model and the model was optimized with the RMSProp optimization algorithm.
- The training process was carried out with 500 epochs.
- Seventy-five percent of the data was used for training, 15% for validation, and 15% for testing.
4.2. Predicting DNA Enhancers
- DNA sequences used for training data in the input layer were evaluated.
- In the second layer, 128-unit BiLSTM was used, and the activation function was determined as SeLU.
- Dropout was done and 15% of the data was discarded.
- Then the 64-unit BiLSTM was used and the activation function SeLU was chosen.
- Again, dropout was selected and 20% of the data was forgotten.
- Batch normalization was applied, and the data were reduced to 1—dimensional with the flattening process.
- Two different fully connected layers were employed. Their number of neurons was determined as 256 and 128.
- In the last layer, the Softmax activation function was used, and the data were classified.
- Categorical-cross entropy was used for the loss of the model and the model was optimized with the Adam optimization algorithm.
- The training process was carried out with 500 epochs.
- Seventy-five percent of the data was used for training, 15% for validation, and 15% for testing.
4.3. Discussion
- Studies with genomic sequences vary greatly according to the numerical methods used. Although the deep learning method used in this study was the same, the results were different from each other. The lack of a standard method and the fact that the results vary according to the encoding methods cause the studies in this field to be limited and to be interpreted as unhealthy.
- Furthermore, mouse and human DNA enhancers are currently scarce. The increase in this number over time may affect the results obtained in this study positively or negatively, and accordingly may cause the results to change. A new analysis with an increase in the number of data will be more effective in evaluating the performance of both the proposed method and other methods.
- It is important to use the proposed method in other DNA analysis studies and to interpret the results to be obtained there. In this way, the performance of the proposed method can be demonstrated in detail.
- In addition, only BiLSTM deep learning model was used in this study. There are many deep learning methods. These DNA encoding schemes need to be analyzed with other deep learning algorithms and the results should be interpreted. In this way, more effective results can be obtained.
- No feature extraction was performed in the study. The use of different signal processing methods (DWT (discrete wavelet transform), FFT, EMD (empirical mode decomposition), VMD (variational mode decomposition), etc.) can be instrumental in obtaining more effective features and observing more successful results.
- Furthermore, optimization algorithms were not used in the study. Performing the optimization process can improve results and increase the performance of DNA encoding schemes.
- As seen in the ROC curve in Figure 8, the data set contains insufficient data. In this case, it may cause two different problems in the model: overfitting or underfitting. Obtaining, examining, and interpreting DNA sequences takes time and causes a difficult process. Therefore, the emergence of insufficient data is a common problem in bioinformatics studies [48,49]. There is such a problem in this study. Researchers need to consider this situation.
- In addition to the overfitting problem, approaches such as ensemble learning, transfer learning, and data duplication (synthetic data) are generally used for insufficient data. The insufficient data problem observed in this study can be addressed by using one or more of these approaches. Although each of these approaches has advantages and disadvantages, these approaches also need to be evaluated for further studies.
- In such cases, overfitting is generally observed. In the case of overfitting, the model memorizes patterns in the data. In order to avoid the overfitting problem, options such as reducing the network capacity, using regularization methods (L1 and L2), and placing dropout layers are generally used. For the second scenario in this study, although the network capacity was reduced and the dropout layer was used, this problem could not be avoided. The use of regularization methods or other approaches may prevent this problem.
- In order to prevent the overfitting problem, early stopping and pruning approaches can be used in addition. However, these approaches also have several disadvantages. Early stopping puts the artificial intelligence model’s training phase on hold before it can learn about the data noise. Nevertheless, if the timing is not set properly, the model will still not produce reliable results. The process of feature selection, also known as pruning, identifies the most crucial features in the training set and gets rid of the rest. Identifying effective features in this approach takes time and is often a tedious process.
- In addition, this problem can be avoided by using cross-validation. Although the number of labels (classes) of the data set used in the study is low, there are approximately 1000 features for each label. In short, each DNA sequence contains at least 1000 bases. The implementation of the cross-validation process takes time and increases the processing load. This approach can also be preferred on more powerful hardware and the performance of the developed DNA encoding method can be interpreted in a healthier way.
- In addition, the implications obtained from this study can be summarized as follows:
- Other cis-regulatory elements including promoters, insulators, and silencers can be predicted using the suggested BFDNA encoding technique. Experimental approaches are often preferred to determine cis-regulatory elements [50]. However, using experimental approaches takes time and is costly [51]. With this study and similar studies, it has been shown that cis-regulatory elements can be determined by computational approaches rather than experimental approaches.
- The proposed approach can also be used generically for research involving whole genomic sequencing, making it useful for both academics and healthcare professionals. In order for genomic sequences to be analyzed by computational methods, sequences must be converted to numerical expressions. There are various DNA encoding methods in the literature. In this study, some of these methods are included and their performances in these methods are evaluated. With this study, a novel DNA encoding method has been proposed in the literature. This method can be used not only to predict DNA enhancers, but also for genomic sequencing research (prediction of intron-exon regions, STR analysis, phylogenetic analysis, species identification, etc.).
- One of the biggest achievements of this study is that the developed BFDNA DNA encoding method has a dynamic structure compared to other methods. In all other approaches, the methods use a dynamic structure. Even if the length of the DNA sequence is different or the locations of the bases are different, the bases in the DNA sequence always take the same value. To give a simple example, in the atomic number DNA encoding method, the A base takes the value 70, regardless of the length of the DNA sequence or the location of the base. However, this is not the case in the proposed BFDNA method. Since this approach uses the length of the DNA sequence and the repetition frequency of the bases, there is no fixed value. This has resulted in the approach being adaptive and different from other DNA encoding methods.
- In studies on most DNA sequences, it has been observed that the chemical properties of DNA sequences are also used [52,53]. However, various experimental applications are used to determine these chemical properties. As this requires experimental equipment, it is both costly and time-consuming. With this study, it has been shown that DNA encoding methods used in computational approaches can also be effective in studies on DNA sequencing. When DNA enhancer studies were examined, it was observed that some studies focused on the chemical properties of DNA sequences [54,55,56,57,58]. When the performances in those studies were examined, it was observed that the accuracy scores ranged between 41.7% and 78%. In this study, only the integer number DNA encoding method was within this range, and an accuracy score of 77% was obtained. All remaining DNA encoding methods showed accuracy scores of over 85%. Moreover, the proposed BFDNA DNA encoding method achieved a high accuracy score of 92.16%. These results showed that computational features can also be effective.
5. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
References
- Smith, J.; Sen, S.; Weeks, R.J.; Eccles, M.R.; Chatterjee, A. Promoter DNA hypermethylation and paradoxical gene activation. Trends Cancer 2020, 6, 392–406. [Google Scholar] [CrossRef] [PubMed]
- Angeloni, A.; Bogdanovic, O. Enhancer DNA methylation: Implications for gene regulation. Essays Biochem. 2019, 63, 707–715. [Google Scholar] [CrossRef] [PubMed]
- Maricque, B.B.; Chaudhari, H.G.; Cohen, B.A. A massively parallel reporter assay dissects the influence of chromatin structure on cis-regulatory activity. Nat. Biotechnol. 2019, 37, 90–95. [Google Scholar] [CrossRef] [PubMed]
- Boyle, A.P.; Davis, S.; Shulha, H.P.; Meltzer, P.; Margulies, E.H.; Weng, Z.; Furey, T.S.; Crawford, G.E. High-resolution mapping and characterization of open chromatin across the genome. Cell 2008, 132, 311–322. [Google Scholar] [CrossRef]
- Johnson, D.S.; Mortazavi, A.; Myers, R.M.; Wold, B. Genome-wide mapping of in vivo protein-DNA interactions. Science 2007, 316, 1497–1502. [Google Scholar] [CrossRef]
- Giresi, P.G.; Kim, J.; McDaniell, R.M.; Iyer, V.R.; Lieb, J.D. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res. 2007, 17, 877–885. [Google Scholar] [CrossRef]
- Müeller-Storm, H.P.; Sogo, J.M.; Schaffner, W. An enhancer stimulates transcription in trans when attached to the promoter via a protein bridge. Cell 1989, 58, 767–777. [Google Scholar] [CrossRef]
- Eraslan, G.; Avsec, Z.; Gagneur, J.; Theis, F. Deep learning: New computational modelling techniques for genomics. Nat. Rev. Genet. 2019, 20, 389–403. [Google Scholar] [CrossRef]
- Alakuş, T.B.; Türkoğlu, İ. A comparative study of amino acid encoding methods for predicting drug-target interactions in COVID-19 disease. Stud. Syst. Decis. Control. 2022, 366, 619–643. [Google Scholar] [CrossRef]
- Bu, H.; Gan, Y.; Wang, Y.; Zhou, S.; Guan, J. A new method for enhancer prediction based on deep belief network. BMC Bioinform. 2017, 18, 418. [Google Scholar] [CrossRef]
- Kaur, A.; Chauhan, A.P.S.; Aggarwal, A.K. Prediction of enhancers in DNA sequence data using a hybrid CNN-DLSTM Model. IEEE/ACM Trans. Comput. Biol. Bioinform. 2022, 20, 1327–1336. [Google Scholar] [CrossRef] [PubMed]
- Rajagopal, N.; Xie, W.; Li, Y.; Wagner, U.; Wang, W.; Stamatoyannopoulos, J.; Ernst, J.; Kellis, M.; Ren, B. RFECS: A random-forest based algorithm for enhancer identification from chromatin State. PLoS Comput. Biol. 2013, 9, e1002968. [Google Scholar] [CrossRef] [PubMed]
- Geng, Q.; Yang, R.; Zhang, L. A deep learning framework for enhancer prediction using word embedding and sequence generation. Biophys. Chem. 2022, 286, 106822. [Google Scholar] [CrossRef] [PubMed]
- Liu, F.; Li, H.; Ren, C.; Bo, X.; Shu, W. PEDLA: Predicting enhancers with a deep learning-based algorithmic framework. Sci. Rep. 2016, 6, 28517. [Google Scholar] [CrossRef]
- Vista Enhancer Browser. Available online: https://enhancer.lbl.gov/ (accessed on 10 April 2023).
- Kwan, H.K.; Arniker, S.B. Numerical representation of DNA sequences. In Proceedings of the IEEE International Conference on Electro-Information Technology, Windsor, ON, Canada, 7–9 June 2009. [Google Scholar] [CrossRef]
- Cristea, P. Genetic signal analysis. In Proceedings of the International Symposium on Signal Processing and Its Applications, Kuala Lumpur, Malaysia, 13–16 August 2001. [Google Scholar] [CrossRef]
- Afreixo, V.; Bastos, C.A.C.; Pinho, A.J.; Garcia, S.P.; Ferreira, P.J.S.G. Genome analysis with distance to the nearest dissimilar nucleotide. J. Theor. Biol. 2011, 275, 52–58. [Google Scholar] [CrossRef]
- Hebert, P.D.N.; Cywinska, A.; Ball, S.L.; de Waard, J.R. Biological identifications through DNA barcodes. Biol. Sci. 2003, 270, 313–321. [Google Scholar] [CrossRef]
- Holden, T.; Subramaniam, R.; Sullivan, R.; Cheung, E.; Schneider, C.; Tremberger, G.; Flamholz, A.; Lieberman, D.H.; Cheung, T.D. ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes. In Proceedings of the Optical Engineering and Applications, San Diego, CA, USA, 26–30 August 2007. [Google Scholar] [CrossRef]
- Cosic, I. Macromolecular bioactivity: Is it resonant interaction between macromolecules?—Theory and applications. IEEE Trans. Biomed. Eng. 1994, 41, 1101–1114. [Google Scholar] [CrossRef]
- Voss, R.F. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 1992, 68, 3805–3808. [Google Scholar] [CrossRef]
- Kumar, G.K.; Rani, D.M. Paragraph summarization based on word frequency using NLP techniques. In Proceedings of the 3rd International Conference on Advancements in Aeromechanical Materials in Manufacturing, Hyderabad, India, 24–25 July 2020. [Google Scholar] [CrossRef]
- Hasan, R.; Maliha, M.; Arifuzzaman, M. Sentiment analysis with NLP on Twitter data. In Proceedings of the International Conference on Computer, Communication, Chemical, Material and Electronic Engineering, Rajshahi, Bangladesh, 11–12 July 2019. [Google Scholar] [CrossRef]
- Chen, D.; Wang, J.; Yan, M.; Bao, F.S. A complex prime numerical representation of amino acids for protein function comparison. J. Comput. Biol. A J. Comput. Mol. Cell Biol. 2016, 23, 669–677. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Janiesch, C.; Zschech, P.; Heinrich, K. Machine learning and deep learning. Electron. Mark. 2021, 31, 685–695. [Google Scholar] [CrossRef]
- Alakuş, T.B.; Türkoğlu, İ. Prediction of protein-protein interactions with LSTM deep learning model. In Proceedings of the 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies, Ankara, Türkiye, 11–13 October 2019. [Google Scholar] [CrossRef]
- Baldi, P. Deep learning in biomedical data science. Annu. Rev. Biomed. Data Sci. 2018, 1, 181–205. [Google Scholar] [CrossRef]
- Zemouri, R.; Zerhouni, N.; Racoceanu, D. Deep learning in the biomedical applictions: Recent and future status. Appl. Sci. 2019, 9, 1526. [Google Scholar] [CrossRef]
- Min, S.; Lee, B.; Yoon, S. Deep learning in bioinformatics. Brief. Bioinform. 2016, 18, 851–869. [Google Scholar] [CrossRef]
- Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
- Chen, A.I.; Balter, M.L.; Maguire, T.J.; Yarmush, M.L. Deep learning robotic guidance for autonomous vascular access. Nat. Mach. Intell. 2020, 2, 104–115. [Google Scholar] [CrossRef]
- Baldi, P.; Sadowski, P.; Whiteson, D. Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 2014, 5, 4308. [Google Scholar] [CrossRef] [PubMed]
- Song, X.; Liu, Y.; Xue, L.; Wang, J.; Zhang, J.; Wang, J.; Jiang, L.; Cheng, Z. Time-series well performance prediction based on Long Short-Term Memory (LSTM) neural network model. J. Pet. Sci. Eng. 2019, 186, 106682. [Google Scholar] [CrossRef]
- Cheng, X.; Wang, J.; Li, Q.; Liu, T. BiLSTM-5mC: A bidirectional long short-term memory-based approach for predicting 5-methylcytosine sites in genome-wide DNA promoters. Molecules 2021, 26, 7414. [Google Scholar] [CrossRef]
- Rahman, M.; Watanobe, Y.; Nakamura, K. A bidirectional LSTM language model for code evaluation and repair. Symmetry 2021, 13, 247. [Google Scholar] [CrossRef]
- Kang, Y.; Xu, Y.; Wang, X.; Pu, B.; Yang, X.; Rao, Y.; Chen, J. HN-PPISP: A hybrid network based on MLP-Mixer for protein–protein interaction site prediction. Brief. Bioinform. 2022, 24, bbac480. [Google Scholar] [CrossRef]
- Rosset, S. Model selection via the AUC. In Proceedings of the 21st International Conference on Machine Learning, Banff Alberta, AL, Canada, 4–8 July 2004. [Google Scholar] [CrossRef]
- Hosmer, D.W.; Lemeshow, S.; Studivant, R.X. Applied Logistic Regression; John Wiley and Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
- Labatut, V.; Cherifi, H. Accuracy measures for the comparison of classifiers. arXiv 2012. [Google Scholar] [CrossRef]
- Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed]
- Chicco, D.; Tötsch, N.; Jurman, G. The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min. 2021, 14, 13. [Google Scholar] [CrossRef]
- Munoz, S.R.; Bangdiwala, S.I. Interpretation of Kappa and B statistics measures of agreement. J. Appl. Stat. 1997, 24, 105–112. [Google Scholar] [CrossRef]
- Torre, F.C.; Gonzalez-Trejo, J.I.; Real-Ramirez, C.A.; Hoyos-Reyes, L.F. Fractal dimension algorithms and their application to time series associated with natural phenomena. In Proceedings of the 4th National Meeting in Chaos, Complex System and Time Series, Veracruz, Mexico, 29 November–2 December 2011. [Google Scholar]
- Ning, J.; Moore, C.N.; Nelson, J. Preliminary wavelet analysis of genomic sequences. In Proceedings of the IEEE Bioinformatics Conference, Stanford, CA, USA, 11–14 August 2003. [Google Scholar] [CrossRef]
- Nair, A.S.; Sreenadhan, S.P. A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation 2006, 1, 197–202. [Google Scholar] [PubMed]
- Michno, J.M.; Stupar, R.M. The importance of genotype identity, genetic heterogeneity, and bioinformatic handling for properly assessing genomic variation in transgenic plants. BMC Biotechnol. 2018, 18, 38. [Google Scholar] [CrossRef] [PubMed]
- Sun, C.; Ma, S.; Chen, Y.; Kim, N.H.; Kailas, S.; Wang, Y.; Gu, W.; Chen, Y.; Tuason, J.P.W.; Bhan, C.; et al. Diagnostic value, prognostic value, and immune infiltration of LOX family members in liver cancer: Bioinformatic analysis. Front. Oncol. 2022, 12, 843880. [Google Scholar] [CrossRef] [PubMed]
- Vijayabaskar, M.S.; Goode, D.K.; Obier, N.; Lichtinger, M.; Emmett, A.M.L.; Abidin, F.N.Z.; Shar, N.; Hannah, R.; Assi, S.A.; Lie-A-Ling, M.; et al. Identification of gene specific cis-regulatory elements during differentiation of mouse embryonic stem cells: An integrative approach using high-throughput datasets. PLoS Comput. Biol. 2019, 15, e1007337. [Google Scholar] [CrossRef] [PubMed]
- Ho, C.L.; Geisler, M. Genome-wide computational identification of biologically significant cis-regulatory elements and associated transcription factors from rice. Plants 2019, 8, 441. [Google Scholar] [CrossRef] [PubMed]
- Khan, Z.U.; Ali, F.; Khan, I.A.; Hussain, Y.; Pi, D. iRSpot-SPI: Deep learning-based recombination spots prediction by incorporating secondary sequence information coupled with physio-chemical properties via Chou’s 5-step rule and pseudo components. Chemom. Intell. Lab. Syst. 2019, 189, 169–180. [Google Scholar] [CrossRef]
- Alam, W.; Tayara, H.; Chong, K.T. i4mC-Deep: An intelligent predictor of N4-methylcytosine sites using a deep learning approach with chemical properties. Genes 2021, 12, 1117. [Google Scholar] [CrossRef] [PubMed]
- Wangi, J.; Lunyak, V.V.; Jordan, K. Chromatin signature discovery via histone modification profile alignments. Nucleic Acids Res. 2012, 40, 10642–10656. [Google Scholar] [CrossRef] [PubMed]
- Hon, G.; Ren, B.; Wang, W. ChromaSig: A probabilistic approach to finding common chromatin signatures in the human genome. PLOS Comput. Biol. 2008, 4, e1000201. [Google Scholar] [CrossRef] [PubMed]
- Firpi, H.A.; Uçar, D.; Tan, K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics 2010, 26, 1579–1586. [Google Scholar] [CrossRef]
- Bonn, S.; Zinzen, R.P.; Girardot, C.; Gustafson, E.H.; Perez-Gonzalez, A.; Delhomme, N.; Ghavi-Helm, Y.; Wilczynski, B.; Riddell, A.; Furlong, E.E.M. Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development. Nat. Genet. 2012, 44, 148–156. [Google Scholar] [CrossRef]
- Yip, K.Y.; Cheng, C.; Bhardwaj, N.; Brown, J.B.; Leng, J.; Kundaje, A.; Rozowsky, J.; Birney, E.; Bickel, P.; Snyder, M.; et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 2012, 13, R48. [Google Scholar] [CrossRef]
First DNA Sequence | Integer Number | Atomic Number | EIIP | BFDNA |
---|---|---|---|---|
[CATCG] | [3 1 4 3 2] | [58 70 66 58 78] | [0.134 0.126 0.133 0.134 0.081] | [0.4 0.2 0.2 0.4 0.2] |
Second DNA Sequence | ||||
[CGAAT] | [3 2 1 1 4] | [58 78 70 70 66] | [0.134 0.081 0.126 0.126 0.133] | [0.2 0.2 0.4 0.4 0.2] |
AUC Score | Explanation |
---|---|
0.00–0.49 | No distinction |
0.50–0.69 | Poor classification |
0.70–0.79 | Acceptable classification |
0.80–0.89 | Great classification |
0.90–1.00 | Outstanding classification |
Kappa Coefficient | Explanation |
---|---|
0.00 | No agreement |
0.10–0.20 | Slight agreement |
0.21–0.40 | Fair agreement |
0.41–0.60 | Moderate agreement |
0.61–0.80 | Substantial agreement |
0.81–0.99 | Almost perfect agreement |
1.00 | Perfect agreement |
DNA Encoding Scheme | Accuracy | Precision | Recall | F1-Score | CSI | G-Mean | MCC | Kappa | AUC Score |
---|---|---|---|---|---|---|---|---|---|
Integer number | 76.96% | 78.53% | 75.76% | 77.12% | 0.5429 | 0.7698 | 0.5397 | 0.5393 | 0.82 |
Atomic number | 86.61% | 85.36% | 87.28% | 86.31% | 0.7264 | 0.8663 | 0.7323 | 0.7321 | 0.84 |
EIIP | 89.14% | 87.07% | 90.61% | 88.80% | 0.7768 | 0.8920 | 0.78.33 | 0.7826 | 0.87 |
BFDNA | 92.16% | 89.76% | 94.11% | 91.88% | 0.8387 | 0.9224 | 0.8440 | 0.8431 | 0.85 |
DNA Encoding Scheme | Accuracy | Precision | Recall | F1-Score | CSI | G-Mean | MCC | Kappa | AUC Score |
---|---|---|---|---|---|---|---|---|---|
Integer number | 73.68% | 74.58% | 68.96% | 71.66% | 0.4354 | 0.5416 | 0.5865 | 0.5731 | 0.89 |
Atomic number | 68.27% | 68.93% | 64.38% | 66.58% | 0.3281 | 0.4597 | 0.4968 | 0.4836 | 0.81 |
EIIP | 77.80% | 79.10% | 73.11% | 76.00% | 0.5221 | 0.6060 | 0.6462 | 0.6340 | 0.90 |
BFDNA | 84.59% | 85.64% | 80.35% | 82.91% | 0.6599 | 0.7104 | 0.7467 | 0.7394 | 0.92 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alakuş, T.B. A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning. Biomimetics 2023, 8, 218. https://doi.org/10.3390/biomimetics8020218
Alakuş TB. A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning. Biomimetics. 2023; 8(2):218. https://doi.org/10.3390/biomimetics8020218
Chicago/Turabian StyleAlakuş, Talha Burak. 2023. "A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning" Biomimetics 8, no. 2: 218. https://doi.org/10.3390/biomimetics8020218
APA StyleAlakuş, T. B. (2023). A Novel Repetition Frequency-Based DNA Encoding Scheme to Predict Human and Mouse DNA Enhancers with Deep Learning. Biomimetics, 8(2), 218. https://doi.org/10.3390/biomimetics8020218