Utilizing Deep Neural Networks to Fill Gaps in Small Genomes
Abstract
:1. Introduction
- We created new datasets based on the original genomes of Saccharomyces cerevisiae, Schizosaccharomyces pombe, Neurospora crassa, and Micromonas pusilla by supplementing them with homologous gene sequences. The original genome data sizes were 1.62 Gb, 84.04 Mb, 956.17 Mb, and 1.12 Gb, respectively. We refined these into datasets of 26.96 Mb, 72.31 Mb, 740.8 Mb, and 28.67 Mb, providing a training set for the DGCNet model.
- We constructed the DGCNet network model and developed a new prediction algorithm, Wave-Beam Search. The DGCNet model enhances the feature extraction capability of gene sequences and the contextual learning ability of sequences flanking gaps. The Wave-Beam Search algorithm further improves the prediction capability of the DGCNet model by avoiding premature pruning and excessive memory usage.
- We established a connection between deep learning and traditional assembly tools. We formulated new gap-filling standards and created and implemented a new evaluation method. We integrated deep learning with the traditional assembly tool Sealer, and experimental results show that this combination further improved Sealer’s gap-filling rate. Additionally, to adapt to the continually advancing gap-filling methods, we developed new gap-filling standards. These new standards offer more transparent and intuitive result displays and exhibit good generality across a wide range of gap-filling methods.
2. Results
2.1. Validation Results on the Dataset
2.2. Validation of Filling Quality
3. Discussion
4. Materials and Methods
4.1. DLGapCloser Algorithm
4.2. Short-Read Datasets
4.3. Sealer Assembly and Gap Extraction
4.4. Selecting Homologous Genomes and Dataset Creation
4.5. Deep Learning and Gap Sequence Prediction
4.6. Gene Sequence Encoding
4.7. DGCNet-Model
4.8. Wave-Beam Search Prediction Algorithm
4.9. Gap Filling
4.10. Evaluation of Gap Filling Results
- First, retrim the gene sequences on both sides of the gap according to the dataset creation steps. In this step, intentionally trim the gene sequences on both sides of the gap longer to improve the accuracy of the reference gene sequence in subsequent steps.
- Create the reference gene sequence. In this step, trim the gene sequences on both sides of the gap shorter so that the focus during sequence alignment can be on the gap prediction data rather than the gene sequence data on both sides.
- Use Exonerate to align the reference gene sequence and the predicted gene sequence. The files generated by the Exonerate tool after alignment need to be processed through a script, and the alignment consistency rates are sorted from 100% to 0% in descending order. In this study, we only compared cases in which the consistency rate of each assembly method was 100% and >90%.
- Model name: AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz (AMD, Sunnyvale, CA, USA)
- RAM: 16.0 GB
- Other sections were conducted on the following environment:
- Architecture: x86_64
- CPU(s): 384
- Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40 GHz (Intel Corporation, Santa Clara, CA, USA)
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Tarafder, S.; Islam, M.; Shatabda, S.; Rahman, A. Figbird: A probabilistic method for filling gaps in genome assemblies. Bioinformatics 2022, 38, 3717–3724. [Google Scholar] [CrossRef] [PubMed]
- Luo, R.; Liu, B.; Xie, Y.; Li, Z.; Huang, W.; Yuan, J.; He, G.; Chen, Y.; Pan, Q.; Liu, Y.; et al. SOAPdenovo2: An empirically improved memory-efficient short-read de novo assembler. GigaScience 2012, 1, 18. [Google Scholar] [CrossRef] [PubMed]
- Xu, M.; Guo, L.; Gu, S.; Wang, O.; Zhang, R.; A Peters, B.; Fan, G.; Liu, X.; Xu, X.; Deng, L.; et al. TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads. GigaScience 2020, 9, giaa094. [Google Scholar] [CrossRef] [PubMed]
- Boetzer, M.; Pirovano, W. Toward almost closed genomes with GapFiller. Genome Biol. 2012, 13, R56. [Google Scholar] [CrossRef] [PubMed]
- Salmela, L.; Sahlin, K.; Mäkinen, V.; Tomescu, A.I. Gap Filling as Exact Path Length Problem. J. Comput. Biol. 2016, 23, 347–361. [Google Scholar] [CrossRef]
- Paulino, D.; Warren, R.L.; Vandervalk, B.P.; Raymond, A.; Jackman, S.D.; Birol, I. Sealer: A scalable gap-closing application for finishing draft genomes. BMC Bioinform. 2015, 16, 1–8. [Google Scholar] [CrossRef]
- Dodsworth, S.; Leitch, A.R.; Leitch, I.J. Genome size diversity in angiosperms and its influence on gene space. Curr. Opin. Genet. Dev. 2015, 35, 73–78. [Google Scholar] [CrossRef] [PubMed]
- Meiser, A.; Otte, J.; Schmitt, I.; Dal Grande, F. Sequencing genomes from mixed DNA samples—Evaluating the metagenome skimming approach in lichenized fungi. Sci. Rep. 2017, 7, 14881. [Google Scholar] [CrossRef] [PubMed]
- Mak, Q.X.C.; Wick, R.R.; Holt, J.M.; Wang, J.R. Polishing De Novo Nanopore Assemblies of Bacteria and Eukaryotes with FMLRC2. Mol. Biol. Evol. 2023, 40, msad048. [Google Scholar] [CrossRef]
- Chen, E.; Chu, J.; Zhang, J.; Warren, R.L.; Birol, I. GapPredict—A Language Model for Resolving Gaps in Draft Genome Assemblies. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 18, 2802–2808. [Google Scholar] [CrossRef]
- Vandervalk, B.P.; Jackman, S.D.; Raymond, A.; Mohamadi, H.; Yang, C.; A Attali, D.; Chu, J.; Warren, R.L.; Birol, I. Konnector: Connecting paired-end reads using a bloom filter de Bruijn graph. In Proceedings of the 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Belfast, UK, 2–5 November 2014; pp. 51–58. [Google Scholar]
- Gurevich, A.; Saveliev, V.; Vyahhi, N.; Tesler, G. QUAST: Quality assessment tool for genome assemblies. Bioinformatics 2013, 29, 1072–1075. [Google Scholar] [CrossRef]
- Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef] [PubMed]
- Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef] [PubMed]
- Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv 2013. [Google Scholar] [CrossRef]
- Slater, G.S.C.; Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 2005, 6, 31. [Google Scholar] [CrossRef]
- Chu, J.; Mohamadi, H.; Erhan, E.; Tse, J.; Chiu, R.; Yeo, S.; Birol, I. Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters. Proc. Natl. Acad. Sci. USA 2020, 117, 16961–16968. [Google Scholar] [CrossRef]
- Koren, S.; Schatz, M.C.; Walenz, B.P.; Martin, J.; Howard, J.T.; Ganapathy, G.; Wang, Z.; A Rasko, D.; McCombie, W.R.; Jarvis, E.D.; et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 2012, 30, 693–700. [Google Scholar] [CrossRef]
- Zhang, D.-G.; Liu, S.; Zhang, T.; Liang, Z. Novel unequal clustering routing protocol considering energy balancing based on network partition & distance for mobile education. J. Netw. Comput. Appl. 2017, 88, 1–9. [Google Scholar] [CrossRef]
- Zhang, D.-G.; Zhou, S.; Tang, Y.-M. A Low Duty Cycle Efficient MAC Protocol Based on Self-Adaption and Predictive Strategy. Mob. Netw. Appl. 2017, 23, 828–839. [Google Scholar] [CrossRef]
- Liu, H.; Mi, X.-W.; Li, Y.-F. Wind speed forecasting method based on deep learning strategy using empirical wavelet transform, long short term memory neural network and Elman neural network. Energy Convers. Manag. 2018, 156, 498–514. [Google Scholar] [CrossRef]
- Ow, P.S.; Morton, T.E. Filtered beam search in scheduling†. Int. J. Prod. Res. 1988, 26, 35–62. [Google Scholar] [CrossRef]
Method | Sealer | GapPredict | DLGapCloser |
---|---|---|---|
Saccharomyces cerevisiae S288C | |||
Gap count | 273 | 273 | 273 |
Gap closed | 148 | 163 | 170 |
Schizosaccharomyces pombe | |||
Gap count | 196 | 196 | 196 |
Gap closed | 79 | 101 | 109 |
Neurospora crassa | |||
Gap count | 1207 | 1207 | 1207 |
Gap closed | 484 | 485 | 501 |
Micromonas pusilla | |||
Gap count | 141 | 141 | 141 |
Gap closed | 73 | 77 | 83 |
Dataset | Homologs Number | Accession Number | Refseq | #Bases |
---|---|---|---|---|
Saccharomyces cerevisiae | SRR23920092 | ERR156523 | GCA_000146045.2 | 316.5M |
Schizosaccharomyces pombe | SRR26143067 | ERR9706986 | GCF_000002945.1 | 1.6G |
Neurospora crassa | ERR11413973 | SRR19285165 | GCF_000182925.2 | 2.9G |
Micromonas pusilla | SRR14462310 | SRR14462380 | GCF_000090985.2 | 1.6G |
Algorithm | Beam Search | Wave-Beam Search |
---|---|---|
Saccharomyces cerevisiae S288C | 68 | 73 |
Schizosaccharomyces pombe | 7 | 9 |
Neurospora crassa | 7 | 10 |
Micromonas pusilla | 24 | 26 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Y.; Wang, G.; Zhang, T. Utilizing Deep Neural Networks to Fill Gaps in Small Genomes. Int. J. Mol. Sci. 2024, 25, 8502. https://doi.org/10.3390/ijms25158502
Chen Y, Wang G, Zhang T. Utilizing Deep Neural Networks to Fill Gaps in Small Genomes. International Journal of Molecular Sciences. 2024; 25(15):8502. https://doi.org/10.3390/ijms25158502
Chicago/Turabian StyleChen, Yu, Gang Wang, and Tianjiao Zhang. 2024. "Utilizing Deep Neural Networks to Fill Gaps in Small Genomes" International Journal of Molecular Sciences 25, no. 15: 8502. https://doi.org/10.3390/ijms25158502
APA StyleChen, Y., Wang, G., & Zhang, T. (2024). Utilizing Deep Neural Networks to Fill Gaps in Small Genomes. International Journal of Molecular Sciences, 25(15), 8502. https://doi.org/10.3390/ijms25158502