TCGA-My: A Systematic Repository for Systems Biology of Malaysian Colorectal Cancer
Abstract
:1. Introduction
2. Materials and Methods
2.1. Data Collection and Organization
2.2. Functional Annotation
2.3. Data Normalization
- ∘
- 1NF: First Normal Form—Removal of redundant variants/genes data. The ANNOVAR output has the same variants for multiple genes and patients.
- ∘
- 2NF: Second Normal Form—Insertion of variants/genes primary keys. Each unique variant and gene obtained its own primary key.
- ∘
- 3NF: Third Normal Form—Insertion of foreign keys. The information and links between the variants and genes were converted into foreign keys.
- ∘
- 4NF: Forth Normal Form—Separation of variants and genes data into separate tables. The primary keys for variants and genes were transferred into a pivot table.
2.4. Database Organization and Architecture
3. Results
3.1. Database Summary
3.2. Main Datasets
3.2.1. Samples
3.2.2. Variants
3.2.3. Genes
3.2.4. Metabolites
3.3. Database Interface and Access
- Homepage displays statistics for the main datasets, i.e., sample, variant, gene and metabolites. The number of entries can be clicked, directing users to the respective dataset. Menus for about, browse and search functions were also included on this page. The search box on the homepage can be used to search any ID or terms in the datasets.
- About page provides general information for TCGA-My and CRC.
- Browse page lists all TCGA-My datasets, four of which are the main datasets and two datasets that contain functional annotation information. These datasets can also be retrieved from the dropdown tab, named Menu, which can be found at the header of each primary page. The datasets on this page are described as follows:
- Sample dataset: contains genome and metabolome data for CRC. The sample datasets are categorized into gender and ethnicity.
- Variant dataset: contains variants that were obtained from the 13 samples of genome data for CRC. This dataset is also categorized into chromosomes, DNA regions, tissue occurrence and type of mutations.
- Gene dataset: contains genes affected by the variants. A list of driver genes can also be obtained.
- Metabolite dataset: contains metabolites that are profiled in the metabolome data for CRC. This dataset is categorized into class and regulation.
- Gene ontology dataset: contains GOs information (biological process, molecular function and cellular component) for variant genes in CRC.
- Pathway dataset: contains pathways information for genes and metabolites of CRC.
- Search page allows two search options, i.e., simple search and variant advanced search. Simple search serves a similar function to the search box on the homepage and the header of the primary pages. Advanced search allows the users to find variant(s) with a combination of different keywords. Users can conduct a quick search for the variants from a certain sample that are linked to a specific driver gene.
- Help page provides an entity-relationship diagram and the table information deposited in this database. The entity-relationship diagram shows the relationship between datasets stored in this database. Table information defines all terms used in TCGA-My. Additionally, this page also provides the contact details for questions or invitations to collaborate.
4. Discussion
4.1. Strength of TCGA-My
4.2. Weaknesses of TCGA-My and Future Perspectives
4.3. Example of Applications
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA. Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef] [PubMed]
- Center, M.M.; Jemal, A.; Ward, E. International trends in colorectal cancer incidence rates. Cancer Epidemiol. Prev. Biomark. 2009, 18, 1688–1694. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Azizah, A.M.; Hashimah, B.; Nirmal, K.; Siti Zubaidah, A.R.; Puteri, N.A.; Nabihah, A.; Sukumaran, R.; Balqis, B.; Nadia, S.M.R.; Sharifah, S.S.S.; et al. Malaysia National Cancer Registry Report (MNCRR) 2012–2016; National Cancer Registry: Kuala Lumpur, Malaysia, 2019; ISBN 9789671614228. [Google Scholar]
- Abu Hassan, M.R.; Ismail, I.; Mohd Suan, M.A.; Ahmad, F.; Wan Khazim, W.K.; Othman, Z.; Mat Said, R.; Tan, W.L.; Mohammed, N.S.; Soelar, S.A.; et al. Incidence and mortality rates of colorectal cancer in Malaysia. Epidemiol. Health 2016, 38, e2016007. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chisanga, D.; Keerthikumar, S.; Pathan, M.; Ariyaratne, D.; Kalra, H.; Boukouris, S.; Mathew, N.A.; Al Saffar, H.; Gangoda, L.; Ang, C.S.; et al. Colorectal cancer atlas: An integrative resource for genomic and proteomic annotations from colorectal cancer cell lines and tissues. Nucleic Acids Res. 2016, 44, D969–D974. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhang, X.; Sun, X.F.; Cao, Y.; Ye, B.; Peng, Q.; Liu, X.; Shen, B.; Zhang, H. CBD: A biomarker database for colorectal cancer. Database 2018, 2018, bay046. [Google Scholar] [CrossRef] [Green Version]
- Agarwal, R.; Kumar, B.; Jayadev, M.; Raghav, D.; Singh, A. CoReCG: A comprehensive database of genes associated with colon-rectal cancer. Database 2016, 2016, baw059. [Google Scholar] [CrossRef] [Green Version]
- Wang, K.; Li, M.; Hakonarson, H. ANNOVAR: Functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010, 38, e164. [Google Scholar] [CrossRef]
- Amir Hashim, N.A.; Ab-Rahim, S.; Ngah, W.Z.W.; Nathan, S.; Mutalib, N.S.A.; Sagap, I.; Rahman, A.J.; Mazlan, M. Global metabolomics profiling of colorectal cancer in Malaysian patients. BioImpacts 2021, 11, 33–43. [Google Scholar] [CrossRef]
- Sherry, S.T.; Ward, M.H.; Kholodov, M.; Baker, J.; Phan, L.; Smigielski, E.M.; Sirotkin, K. DbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 2001, 29, 308–311. [Google Scholar] [CrossRef] [Green Version]
- Forbes, S.A.; Beare, D.; Gunasekaran, P.; Leung, K.; Bindal, N.; Boutselakis, H.; Ding, M.; Bamford, S.; Cole, C.; Ward, S.; et al. COSMIC: Exploring the world’s knowledge of somatic mutations in human cancer. Nucleic Acids Res. 2015, 43, D805–D811. [Google Scholar] [CrossRef]
- Stelzer, G.; Rosen, N.; Plaschkes, I.; Zimmerman, S.; Twik, M.; Fishilevich, S.; Stein, T.I.; Nudel, R.; Lieder, I.; Mazor, Y.; et al. The GeneCards suite: From gene data mining to disease genome sequence analyses. Curr. Protoc. Bioinform. 2016, 54, 1–30. [Google Scholar] [CrossRef] [PubMed]
- Berman, H.M.; Westbrook, J.D.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Wissig, H.; Shindyalov, I.N.; Bourne, P.E. The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235–242. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- O’Leary, N.A.; Wright, M.W.; Brister, J.R.; Ciufo, S.; Haddad, D.; McVeigh, R.; Rajput, B.; Robbertse, B.; Smith-White, B.; Ako-Adjei, D.; et al. Reference Sequence (RefSeq) database at NCBI: Current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016, 44, D733–D745. [Google Scholar] [CrossRef] [Green Version]
- The UniProt Consortium. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 2021, 49, D480–D489. [Google Scholar] [CrossRef]
- Abdullah, M.I.; Muhammad, N.A.N. Prediction of colorectal cancer driver genes from patients’ genome data. Sains Malaysiana 2018, 47, 3095–3105. [Google Scholar] [CrossRef]
- Kanehisha, M.; Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef] [PubMed]
- Reback, J.; jbrockmendel; McKinney, W.; Van den Bossche, J.; Augspurger, T.; Cloud, P.; Hawkins, S.; gfyoung; Sinhrks; Roeschke, M.; et al. Pandas-Dev/Pandas: Pandas 1.3.0rc1. 2021. Available online: https://doi.org/10.5281/zenodo.4940217 (accessed on 20 June 2021).
- Krzywinski, M.; Schein, J.; Birol, I.; Connors, J.; Gascoyne, R.; Horsman, D.; Jones, S.J.; Marra, M.A. Circos: An information aesthetic for comparative genomics. Genome Res. 2009, 19, 1639–1645. [Google Scholar] [CrossRef] [Green Version]
- Brown, G.R.; Hem, V.; Katz, K.S.; Ovetsky, M.; Wallin, C.; Ermolaeva, O.; Tolstoy, I.; Tatusova, T.; Pruitt, K.D.; Maglott, D.R.; et al. Gene: A gene-centered information resource at NCBI. Nucleic Acids Res. 2015, 43, D36–D42. [Google Scholar] [CrossRef] [PubMed]
- The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019, 47, D330–D338. [Google Scholar] [CrossRef] [Green Version]
- Nishimura, D. BioCarta. Biotech Softw. Internet Rep. Comput. Softw. J. Sci. 2001, 2, 117–120. [Google Scholar] [CrossRef]
- Martens, M.; Ammar, A.; Riutta, A.; Waagmeester, A.; Slenter, D.N.; Hanspers, K.; Miller, R.A.; Digles, D.; Lopes, E.N.; Ehrhart, F.; et al. WikiPathways: Connecting communities. Nucleic Acids Res. 2021, 49, D613–D621. [Google Scholar] [CrossRef] [PubMed]
- Blum, M.; Chang, H.-Y.; Chuguransky, S.; Grego, T.; Kandasaamy, S.; Mitchell, A.; Nuka, G.; Paysan-Lafosse, T.; Qureshi, M.; Raj, S.; et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021, 49, D344–D354. [Google Scholar] [CrossRef] [PubMed]
- Thul, P.J.; Lindskog, C. The human protein atlas: A spatial map of the human proteome. Protein Sci. 2018, 27, 233–244. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Piñero, J.; Ramírez-Anguita, J.M.; Saüch-Pitarch, J.; Ronzano, F.; Centeno, E.; Sanz, F.; Furlong, L.I. he DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020, 48, D845–D855. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mi, H.; Muruganujan, A.; Ebert, D.; Huang, X.; Thomas, P.D. PANTHER version 14: More genomes, a new PANTHER GO-slim and improvements in enrichment analysis tools. Nucleic Acids Res. 2019, 47, D419–D426. [Google Scholar] [CrossRef]
- Kaula, R. Normalizing with Entity Relationship Diagramming. Available online: https://tdan.com/normalizing-with-entity-relationship-diagramming/4583 (accessed on 20 June 2021).
- Edge, S.B.; Compton, C.C. The American Joint Committee on Cancer: The 7th edition of the AJCC cancer staging manual and the future of TNM. Ann. Surg. Oncol. 2010, 17, 1471–1474. [Google Scholar] [CrossRef]
- Haq, A.I.; Schneeweiss, J.; Kalsi, V.; Arya, M. The Dukes staging system: A cornerstone in the clinical management of colorectal cancer. Lancet Oncol. 2009, 10, 1128. [Google Scholar] [CrossRef]
- DePinho, R.A. The age of cancer. Nature 2000, 408, 248–254. [Google Scholar] [CrossRef]
- Tamborero, D.; Rubio-Perez, C.; Deu-Pons, J.; Schroeder, M.P.; Vivancos, A.; Rovira, A.; Tusquets, I.; Albanell, J.; Rodon, J.; Tabernero, J.; et al. Cancer genome interpreter annotates the biological and clinical relevance of tumor alterations. Genome Med. 2018, 10, 25. [Google Scholar] [CrossRef]
- Dong, C.; Guo, Y.; Yang, H.; He, Z.; Liu, X.; Wang, K. ICAGES: Integrated CAncer GEnome Score for comprehensively prioritizing driver genes in personal cancer genomes. Genome Med. 2016, 8, 135. [Google Scholar] [CrossRef] [Green Version]
- Rahimi, M.; Teimourpour, B.; Marashi, S.A. Cancer driver gene discovery in transcriptional regulatory networks using influence maximization approach. Biol. Med. 2019, 114, 103362. [Google Scholar] [CrossRef] [PubMed]
- Bailey, M.H.; Tokheim, C.; Porta-Pardo, E.; Sengupta, S.; Bertrand, D.; Weerasinghe, A.; Colaprico, A.; Wendl, M.C.; Kim, J.; Reardon, B.; et al. Comprehensive characterization of cancer driver genes and mutations. Cell 2018, 173, 371–385.e18. [Google Scholar] [CrossRef] [Green Version]
- White, A.; Ironmonger, L.; Steele, R.J.C.; Ormiston-Smith, N.; Crawford, C.; Seims, A. A Review of sex-related differences in colorectal cancer incidence, screening uptake, routes to diagnosis, cancer stage and survival in the UK. BMC Cancer 2018, 18, 906. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chlebowski, R.T.; Chen, Z.; Anderson, G.L.; Rohan, T.; Aragaki, A.; Lane, D.; Dolan, N.C.; Paskett, E.D.; McTiernan, A.; Hubbell, F.A.; et al. Ethnicity and breast cancer: Factors influencing differences in incidence and outcome. J. Natl. Cancer Inst. 2005, 97, 439–447. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Yau, T.O. Precision treatment in colorectal cancer: Now and the future. JGH Open 2019, 3, 361–369. [Google Scholar] [CrossRef]
- Manzoni, C.; Kia, D.A.; Vandrovcova, J.; Hardy, J.; Wood, N.W.; Lewis, P.A.; Ferrari, R. Genome, transcriptome and proteome: The rise of omics data and their integration in biomedical sciences. Brief. Bioinform. 2018, 19, 286–302. [Google Scholar] [CrossRef]
Dataset | Number of Entries |
---|---|
Sample | 50 |
Variant | 1,517,841 |
COSMIC | 1113 |
dbSNP | 291,397 |
Gene | 23,695 |
PDB | 4420 |
RefSeq ncRNA | 2637 |
UniProt | 17,910 |
Metabolite | 89,256 |
Pathway | 344 |
KEGG | 186 |
PANTHER | 158 |
Gene ontology | 17,459 |
Patient | Gender | Age | Ethnicity | Diagnosis | Anatomical Location | Stage | |
---|---|---|---|---|---|---|---|
TNM | Dukes | ||||||
C187 | Male | 63 | Malay | Well differentiated adenocarcinoma | Rectosigmoid | pT3 N2 MX | C2 |
C330 | Male | 71 | Chinese | Well differentiated adenocarcinoma | Sigmoid colon | T3 N0 MX | B2 |
C404 | Male | 68 | Chinese | Well differentiated adenocarcinoma | Rectum | pT3 pN1a MX | - |
Sessile polyp in ascending colon | pT1 | A | |||||
C414 | Male | 76 | Malay | Well differentiated adenocarcinoma (WHO Grade 1) | Sigmoid colon | pT3 pN1b pMX | C |
C449 | Male | 65 | Malay | Moderately differentiated adenocarcinoma | Rectosigmoid colon | pT3 N2 MX | C |
C476 | Male | 72 | Chinese | Well differentiated adenocarcinoma. | Recto-sigmoidectomy | pT4a N1 MX | |
C194 | Female | 70 | Malay | Well differentiated adenocarcinoma | Sigmoid colon | - | B |
C273 | Female | 73 | Malay | Moderately differentiated adenocarcinoma | Rectosigmoid colon | pT1 N0 MX | A |
C373 | Female | 74 | Chinese | Moderately differentiated adenocarcinoma | Anterior resection specimen | T2 N0 MX | B |
C388 | Female | 65 | Chinese | Moderately differentiated adenocarcinoma | Anterior resection specimen | pT2 pN1 pMx | C |
C398 | Female | 71 | Chinese | Moderately differentiated adenocarcinoma. | Sigmoid colon with bladder | pT4 N1 MX | C |
C467 | Female | 65 | Malay | Well differentiated adenocarcinoma | Rectum | T4b N1b pMX | C |
C474 | Female | 79 | Malay | Well-differentiated adenocarcinoma | Left hemicolectomy | pT3 N0 MX | B1 |
Staging System | Component | Explanation | |
---|---|---|---|
TNM | Primary Tumor (T) | T1 | Tumor invades submucosa. |
T2 | Tumor invades muscularis propria. | ||
T3 | Tumor invades into the subserosa or perirectal tissues via muscularis propria. | ||
T4 | Tumor has spread to other organs or structures directly and/or the visceral peritoneum. | ||
T4a | The tumor has expanded into the surface of the visceral peritoneum, where it has penetrated all layers of the colon. | ||
T4b | The tumor has spread to other organs or structures or has attached itself to them. | ||
Regional lymph node (N) | N0 | Negative regional lymph node metastases. | |
N1 | Metastases in one to three regional lymph nodes. | ||
N1a | Tumor cells have been detected in one regional lymph node. | ||
N1b | Tumor cells have been detected in two or three regional lymph nodes. | ||
N2 | Metastases in four or more regional lymph nodes. | ||
Distant metastases (M) | MX | Distant metastases could not be assessed. | |
Dukes | A | Tumor limited to the submucosa. | |
B | Tumor grows through the colon wall into muscular layers, no lymph nodes involved | ||
B1 | Into but not through the muscularis propria, nodes not involved. | ||
B2 | Through the muscularis propria, nodes not involved. | ||
C | Lymph node involved. | ||
C2 | Through the muscularis propria with nodes involved. |
DNA Region | Number of Variants | Description |
---|---|---|
Intergenic | 926,482 | Variant overlaps in intergenic region. |
Intronic | 409,632 | Variant overlaps in intronic region. |
Non-coding RNA, intronic | 84,913 | Non-coding transcript variant overlaps with one of the transcripts in the intronic region. |
Exonic | 8381 | Variant overlaps in exonic region. |
Upstream | 8855 | Variant overlaps a 1-kb region upstream of the transcription start site. |
Downstream | 9116 | Variant overlaps a 1-kb region downstream of the transcription termination site. |
UTR3 | 8603 | Variant overlap in 3′ untranslated region. |
Upstream, downstream | 922 | Variant overlaps in both upstream and downstream regions. |
UTR5 | 1176 | Variant overlaps in 5′ untranslated region. |
Splicing | 108 | Variant overlaps in splice region. |
Non-coding RNA, splicing | 34 | Non-coding transcript variant overlaps with one of the transcripts in the splice region. |
Exonic, splicing | 2 | Variant overlaps in both exonic and splice regions. |
Type of Mutations | Number of Variants | Description |
---|---|---|
Nonsynonymous SNV | 3922 | A single nucleotide change that alters an amino acid of a protein. |
Frameshift insertion | 510 | Insertion of one or more nucleotides that shifts the codon reading frame. |
Frameshift deletion | 917 | Deletion of one or more nucleotides that shifts the codon reading frame. |
Stop-gain | 271 | Mutations caused by nonsynonymous SNV, frameshift insertion and frameshift deletion that leads to the gain of a stop codon. |
Stop-loss | 9 | Mutations caused by nonsynonymous SNV, frameshift insertion and frameshift deletion that leads to the loss of a stop codon. |
Non-frameshift deletion | 587 | Deletion of a set of nucleotides divisible by three that may not shift a reading frame. |
Synonymous SNV | 2226 | A change of a single nucleotide that retains an amino acid of a protein. |
Non-frameshift insertion | 153 | Insertion of a set of nucleotides divisible by three that may not shift a reading frame. |
Unknown | 223 | Unknown mutation. |
Patient | Number of Genes | Number of Driver Genes |
---|---|---|
C187 | 11,988 | 6 |
C194 | 11,644 | 7 |
C273 | 11,837 | 5 |
C373 | 11,951 | 11 |
C404 | 13,188 | 6 |
C414 | 12,446 | 9 |
C449 | 13,888 | 5 |
C474 | 23,213 | 12 |
C330 | 11,989 | 2 |
C388 | 11,515 | 8 |
C398 | 11,763 | 2 |
C467 | 12,489 | 7 |
C476 | 11,666 | 3 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Azuwar, M.A.; Muhammad, N.A.N.; Afiqah-Aleng, N.; Ab Mutalib, N.-S.; Md. Yusof, N.F.; Mohd Yunos, R.I.; Ishak, M.; Saidin, S.; Rose, I.M.; Sagap, I.; et al. TCGA-My: A Systematic Repository for Systems Biology of Malaysian Colorectal Cancer. Life 2022, 12, 772. https://doi.org/10.3390/life12060772
Azuwar MA, Muhammad NAN, Afiqah-Aleng N, Ab Mutalib N-S, Md. Yusof NF, Mohd Yunos RI, Ishak M, Saidin S, Rose IM, Sagap I, et al. TCGA-My: A Systematic Repository for Systems Biology of Malaysian Colorectal Cancer. Life. 2022; 12(6):772. https://doi.org/10.3390/life12060772
Chicago/Turabian StyleAzuwar, Mohd Amin, Nor Azlan Nor Muhammad, Nor Afiqah-Aleng, Nurul-Syakima Ab Mutalib, Najwa Farhah Md. Yusof, Ryia Illani Mohd Yunos, Muhiddin Ishak, Sazuita Saidin, Isa Mohamed Rose, Ismail Sagap, and et al. 2022. "TCGA-My: A Systematic Repository for Systems Biology of Malaysian Colorectal Cancer" Life 12, no. 6: 772. https://doi.org/10.3390/life12060772
APA StyleAzuwar, M. A., Muhammad, N. A. N., Afiqah-Aleng, N., Ab Mutalib, N. -S., Md. Yusof, N. F., Mohd Yunos, R. I., Ishak, M., Saidin, S., Rose, I. M., Sagap, I., Mazlan, L., Mohd Azman, Z. A., Mazlan, M., Ab Rahim, S., Wan Ngah, W. Z., Nathan, S., Hashim, N. A. A., Mohamed-Hussein, Z. -A., & Jamal, R. (2022). TCGA-My: A Systematic Repository for Systems Biology of Malaysian Colorectal Cancer. Life, 12(6), 772. https://doi.org/10.3390/life12060772