A Study on the Prediction of Cancer Using Whole-Genome Data and Deep Learning
Abstract
:1. Introduction
2. Results
2.1. Experiment Environment
2.2. Experimental Procedure
2.3. Dataset
2.4. Reliability Evaluation of Cancer Prediction Results
3. Discussion
4. Materials and Methods
4.1. Overall Flow Chart of This Study
4.2. Data Preprocessing
4.2.1. Data Classification
4.2.2. Data Screening
4.2.3. Data Transformation
4.3. Deep Learning Network Architecture
Backpropagation
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Siegel, R.L.; Miller, K.D.; Fuchs, H.E.; Jemal, A. Cancer Statistics, 2021. CA Cancer J. Clin. 2021, 71, 7–33. [Google Scholar] [CrossRef] [PubMed]
- Siegel, R.L.; Miller, K.D.; Jemal, A. Cancer statistics, 2020. CA Cancer J. Clin. 2020, 70, 7–30. [Google Scholar] [CrossRef] [PubMed]
- Rajkomar, A.; Oren, E.; Chen, K.; Dai, A.M.; Hajaj, N.; Hardt, M.; Liu, P.J.; Liu, X.; Marcus, J.; Sun, M.; et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 2018, 1, 18. [Google Scholar] [CrossRef] [PubMed]
- Greenman, C.; Stephens, P.; Smith, R.; Dalgliesh, G.L.; Hunter, C.; Bignell, G.; Davies, H.; Teague, J.; Butler, A.; Stevens, C.; et al. Patterns of somatic mutation in human cancer genomes. Nature 2007, 446, 153–158. [Google Scholar] [CrossRef] [PubMed]
- Emilsson, V.; Thorleifsson, G.; Zhang, B.; Leonardson, A.S.; Zink, F.; Zhu, J.; Carlson, S.; Helgason, A.; Walters, G.B.; Gunnarsdottir, S.; et al. Genetics of gene expression and its effect on disease. Nature 2008, 452, 423–428. [Google Scholar] [CrossRef]
- Chun, H.-W.; Tsuruoka, Y.; Kim, J.-D.; Shiba, R.; Nagata, N.; Hishiki, T.; Tsujii, J. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. Biocomputing 2006, 2006, 4–15. [Google Scholar]
- Shuch, B.; Vourganti, S.; Ricketts, C.J.; Middleton, L.; Peterson, J.; Merino, M.J.; Metwalli, A.R.; Srinivasan, R.; Linehan, W.M. Defining early-onset kidney cancer: Implications for germline and somatic mutation testing and clinical management. J. Clin. Oncol. 2014, 32, 431. [Google Scholar] [CrossRef] [PubMed]
- Gilissen, C.; Hoischen, A.; Brunner, H.G.; Veltman, J. Disease gene identification strategies for exome sequencing. Eur. J. Hum. Genet. 2012, 20, 490–497. [Google Scholar] [CrossRef]
- Van Dam, S.; Vosa, U.; van der Graaf, A.; Franke, L.; de Magalhaes, J.P. Gene co-expression analysis for functional classification and gene-disease predictions. Brief. Bioinform. 2018, 19, 575–592. [Google Scholar] [CrossRef]
- Martincorena, I.; Campbell, P.J. Somatic mutation in cancer and normal cells. Science 2015, 349, 1483–1489. [Google Scholar] [CrossRef]
- Antoniou, A.C.; Pharoah, P.D.P.; McMullan, G.; Day, N.E.; Stratton, M.R.; Peto, J.; Ponder, B.J.; Easton, D.F. A comprehensive model for familial breast cancer incorporating BRCA1, BRCA2 and other genes. Br. J. Cancer 2002, 86, 76–83. [Google Scholar] [CrossRef] [PubMed]
- Levy-Lahad, E.; Friedman, E. Cancer risks among BRCA1 and BRCA2 mutation carriers. Br. J. Cancer 2007, 96, 11–15. [Google Scholar] [CrossRef] [PubMed]
- Petrucelli, N.; Daly, M.B.; Pal, T. BRCA1-and BRCA2-Associated Hereditary Breast and Ovarian Cancer; University of Washington: Seattle, WA, USA, 2016. [Google Scholar]
- Gu, J.; Wang, Z.; Kuen, J.; Ma, L.; Shahroudy, A.; Shuai, B.; Liu, T.; Wang, X.; Wang, G.; Cai, J.; et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018, 77, 354–377. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Deng, L.; Hinton, G.; Kingsbury, B. New types of deep neural network learning for speech recognition and related applications: An overview. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 8599–8603. [Google Scholar]
- Erhan, D.; Szegedy, C.; Toshev, A.; Anguelov, D. Scalable object detection using deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2147–2154. [Google Scholar]
- Um, T.T.; Pfister, F.M.J.; Pichler, D.; Endo, S.; Lang, M.; Hirche, S.; Fietzek, U.; Kulić, D. Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, Glasgow, UK, 13–17 November 2017; pp. 216–220. [Google Scholar]
- Jiao, W.; Atwal, G.; Polak, P.; Karlic, R.; Cuppen, E.; Danyi, A.; de Ridder, J.; van Herpen, C.; Lolkema, M.P.; Steeghs, N.; et al. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nat. Commun. 2020, 11, 728. [Google Scholar] [CrossRef]
- Lee, Y.-J.; Park, J.-H.; Chung, H.-Y.; Kim, K.-M.; Lee, S.-H. A Data Augmentation Methodology for Predicting the Association of Microbiome Community and Diseases Based on Artificial Intelligence. J. Inst. Electron. Inf. Eng. 2021, 58, 59–66. [Google Scholar]
- Sun, Y.; Zhu, S.; Ma, K.; Liu, W.; Yue, Y.; Hu, G.; Lu, H.; Chen, W. Identification of 12 cancer types through genome deep learning. Sci. Rep. 2019, 9, 17256. [Google Scholar] [CrossRef]
- 1000 Genomes Project Consortium. A map of human genome variation from population scale sequencing. Nature 2010, 467, 1061. [Google Scholar] [CrossRef]
- Koomsubha, T.; Vateekul, P. A character-level convolutional neural network with dynamic input length for Thai text categorization. In Proceedings of the 2017 9th International Conference on Knowledge and Smart Technology (KST), Chon Buri, Thailand, 1–4 February 2017; IEEE: Piscataway, NJ, USA, 2017. [Google Scholar]
- Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
- Conneau, A.; Schwenk, H.; Barrault, L.; Lecun, Y. Very deep convolutional networks for text classification. arXiv 2016, arXiv:1606.01781. [Google Scholar]
- Tomczak, K.; Czerwińska, P.; Wiznerowicz, M. Review the Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol./Współczesna Onkol. 2015, 2015, 68–77. [Google Scholar] [CrossRef]
- 33 TCGA Cancer Projects Summary. Available online: https://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=phenDis&hgta_track=gdcCancer&hgta_table=allCancer&hgta_doSchema=describe+table+schema (accessed on 6 May 2019).
- Visa, S.; Ramsay, B.; Ralescu, A.L.; van der Knaap, E. Confusion matrix-based feature selection. MAICS 2011, 710, 120–127. [Google Scholar]
- Bradley, A.P. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997, 30, 1145–1159. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commmun. ACM 2017, 60, 84–90. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
- Ding, L.; Bailey, M.H.; Porta-Pardo, E.; Thorsson, V.; Colaprico, A.; Bertrand, D.; Gibbs, D.L.; Weerasinghe, A.; Huang, K.-L.; Tokheim, C.; et al. Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 2018, 173, 305–320. [Google Scholar] [CrossRef] [PubMed]
- Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
- Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 1998, 6, 107–116. [Google Scholar] [CrossRef]
- Sun, D.; Wulff, J.; Sudderth, E.B.; Pfister, H.; Black, M.J. A fully-connected layered model of foreground and background flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013. [Google Scholar]
- Kalchbrenner, N.; Grefenstette, E.; Blunsom, P. A convolutional neural network for modelling sentences. arXiv 2014, arXiv:1404.2188. [Google Scholar]
- Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010. [Google Scholar]
- Hecht-Nielsen, R. Theory of the backpropagation neural network. In Neural Networks for Perception; Academic Press: Cambridge, MA, USA, 1992; pp. 65–93. [Google Scholar]
- Werbos, P.J. Backpropagation through time: What it does and how to do it. Proc. IEEE 1990, 78, 1550–1560. [Google Scholar] [CrossRef] [Green Version]
Label | Cancer Name | Cancer | Training Set | Validation Set | Test Set | Total |
---|---|---|---|---|---|---|
1 | Bladder urothelial carcinoma | BLCA | 314 | 63 | 16 | 393 |
2 | Breast invasive carcinoma | BRCA | 776 | 157 | 38 | 971 |
3 | Colon adenocarcinoma | COAD | 258 | 52 | 13 | 323 |
4 | Glioblastoma multiforme | GBM | 304 | 61 | 16 | 381 |
5 | Skin cutaneous melanoma | SKCM | 285 | 58 | 14 | 357 |
6 | Kidney renal clear cell carcinoma | KIRC | 268 | 53 | 14 | 335 |
7 | Brain lower-grade glioma | LGG | 405 | 81 | 21 | 507 |
8 | Lung squamous cell carcinoma | LUSC | 379 | 76 | 19 | 474 |
9 | Ovarian serous cystadenocarcinoma | OV | 344 | 68 | 18 | 430 |
10 | Prostate adenocarcinoma | PRAD | 394 | 79 | 20 | 493 |
11 | Thyroid carcinoma | THCA | 393 | 79 | 20 | 492 |
12 | Uterine corpus endometrial carcinoma | UCEC | 318 | 65 | 15 | 398 |
Total | 12 | 12 | 4438 | 892 | 224 | 5554 |
Label | Cancer Name | Cancer | Test Set | Precision | Sensitivity | Specificity | F-Score |
---|---|---|---|---|---|---|---|
1 | Bladder urothelial carcinoma | BLCA | 16 | 58.8 | 62.5 | 96.63 | 60.6 |
2 | Breast invasive carcinoma | BRCA | 38 | 70.7 | 76.3 | 93.55 | 73.4 |
3 | Colon adenocarcinoma | COAD | 13 | 100.0 | 76.9 | 100.0 | 87.0 |
4 | Glioblastoma multiforme | GBM | 16 | 58.8 | 62.5 | 96.63 | 60.6 |
5 | Skin cutaneous melanoma | SKCM | 14 | 68.8 | 78.6 | 97.62 | 73.3 |
6 | Kidney renal clear cell carcinoma | KIRC | 14 | 68.4 | 92.9 | 97.14 | 78.8 |
7 | Brain lower-grade glioma | LGG | 21 | 93.8 | 71.4 | 99.51 | 81.1 |
8 | Lung squamous cell carcinoma | LUSC | 19 | 94.4 | 89.5 | 99.51 | 91.9 |
9 | Ovarian serous cystadenocarcinoma | OV | 18 | 76.5 | 72.2 | 98.06 | 74.3 |
10 | Prostate adenocarcinoma | PRAD | 20 | 64 | 80.0 | 95.59 | 71.1 |
11 | Thyroid carcinoma | THCA | 20 | 87.5 | 70.0 | 99.02 | 77.8 |
12 | Uterine corpus endometrial carcinoma | UCEC | 15 | 66.7 | 53.3 | 98.09 | 59.3 |
Average | 75.7 | 73.84 | 97.61 | 74.1 |
Method | Accuracy | Sensitivity | Specificity |
---|---|---|---|
AlexNet [33] | 56.68 | 56.89 | 95.62 |
ResNet18 [34] | 66.07 | 64.95 | 96.84 |
ResNet34 [34] | 67.69 | 66.21 | 96.99 |
The Proposed Method | 74.11 | 73.84 | 97.61 |
Method | Accuracy | Sensitivity | Specificity |
---|---|---|---|
Sun et al. [24] (using the WES tumor germline variants and somatic mutation data) | 70.4 | 65.92 | 96.27 |
The Proposed Method (using only somatic mutation data) | 74.11 | 73.84 | 97.61 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, Y.-J.; Park, J.-H.; Lee, S.-H. A Study on the Prediction of Cancer Using Whole-Genome Data and Deep Learning. Int. J. Mol. Sci. 2022, 23, 10396. https://doi.org/10.3390/ijms231810396
Lee Y-J, Park J-H, Lee S-H. A Study on the Prediction of Cancer Using Whole-Genome Data and Deep Learning. International Journal of Molecular Sciences. 2022; 23(18):10396. https://doi.org/10.3390/ijms231810396
Chicago/Turabian StyleLee, Young-Ji, Jun-Hyung Park, and Seung-Ho Lee. 2022. "A Study on the Prediction of Cancer Using Whole-Genome Data and Deep Learning" International Journal of Molecular Sciences 23, no. 18: 10396. https://doi.org/10.3390/ijms231810396
APA StyleLee, Y. -J., Park, J. -H., & Lee, S. -H. (2022). A Study on the Prediction of Cancer Using Whole-Genome Data and Deep Learning. International Journal of Molecular Sciences, 23(18), 10396. https://doi.org/10.3390/ijms231810396