Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect
Abstract
:1. The Problem of Transcriptomic Data Harmonization
2. Principles of Harmonization Algorithms
- (1)
- Methods based on statistical transformations (considering quantiles, ranks, means, medians of gene expression levels, etc.):
- (a)
- Those using ranking of expression levels and setting the output levels according to the averaged values, such as QN [28], Feature-Specific QN (FCQN) [64], Quantile Discretization (QD) [36], Gene Quantiles (GQ) [43], Normalized Discretization (NorDi) [37], Distribution Transformation (DisTran) [36,38], Median Rank Scores (MRS) [36], YuGene [71], and Rank-in [68];
- (b)
- (2)
- Methods using regression and/or maximum likelihood models for validation of predefined statistical hypotheses:
- (a)
- (b)
- Those using log-normal distribution with either covariance analysis [73], or with conditional/Bayesian models, as for the methods Universal exPression Code (UPC) [63,74], Empirical Bayes (ComBat) [39], Robust Microarray Analysis (RMA) [75], GeneChip Robust Multiarray Analysis (gcRMA) [76], Model-Based Expression Indices (MBEI) [77], Probe Logarithmic Intensity ERror (PLIER) estimation [78], frozen Robust Microarray Analysis (fRMA) [79,80,81,82], MatchMixeR (MM) [66], Cross-Platform Comparison (XPC) [83];
- (c)
- Those using Dirichlet and gamma distributions as for the method PLatform-Independent Latent Dirichlet Allocation (PLIDA) [30];
- (d)
- (e)
- Those using the Least Absolute Shrinkage and Selection Operator (LASSO) regression models [69];
- (3)
- Methods finding similar clusters in gene expression matrices of the datasets under normalization and then using iterative corrections to fit each cluster as close as possible to the target model:
- (4)
- Methods utilizing machine learning (ML) to find and artificially remove dissimilarities between datasets to be normalized:
3. Evaluation of the Quality of Harmonization
- (1)
- First, different statistical criteria may be used to estimate the following endpoints:
- (2)
- Alternatively, one may classify the samples according to gene expression data after normalization, involving various machine learning (ML) methods:
Reference for Comparison | Methods | Materials | Experimental Platform | Qualitative Criteria | Quantitative Criteria | Best Methods |
---|---|---|---|---|---|---|
[29] | Cross-Platform Normalization, XPN [29]; Column Sample (CS); Median Center (MC); Empirical Bayes (EB) [39]; Distance-Weighted Discrimination (DWD) [41,42] | Three breast cancer datasets [96,97,98] | Affymetrix GeneChip U95Av2 arrays [96]; 25K Agilent oligonucleotide arrays [97,98] | ⸺ | Average distance to nearest sample in another platform; correlation with column standardization data; global integrative correlation; preservation of significantly differential genes | XPN |
[31] | XPN; DWD; EB (ComBat) [39]; Median Rank Scores (MRS) [36]; Quantile Discretization (QD) [36]; Normalized Discretization (NorDi) [37]; Distribution Transformation (DisTran) [36,38]; Gene Quantiles (GQ) [43]; Quantile Normalization (QN) [28] | MAQC dataset [17,18,19] | Human Genome Survey Microarray v2.0; Agilent-012391 Whole Human Genome Oligo Microarray G4112A; Affymetrix Human Genome U133 Plus 2.0 Array; Illumina Sentrix Human-6 Expression Beadchip | ⸺ | Mean-mean regression; cross-dataset data transfer for linear SVM [94] and nearest shrunken centroids [95] classification | XPN (for datasets of comparable size); DWD (for datasets of non-comparable size) |
[30] | XPN; DWD; platform-independent latent Dirichlet allocation (PLIDA) [30] | Prostate cancer datasets [99,100]; Breast cancer datasets [97,101]; MAQC | Affymetrix Human Genome U133 Array; Agilent Human 1A (V2); Human Genome Survey Microarray v2.0; Agilent-012391 Whole Human Genome Oligo Microarray G4112A; Affymetrix Human Genome U133 Plus 2.0 Array; Illumina Sentrix Human-6 Expression Beadchip | Visual inspection of PCA plots. | Correlation analysis between the profiles before and after normalization; cross-dataset data transfer for logistic regression classification [92] | PLIDA |
[67] | MatchMixeR (MM) [66]; DWD; XPM; ComBat | NCI60 cell lines (dataset 1:58 lines; dataset 2:59 lines) | Affymetrix Human Genome U133A array; Human Genome U133 Plus 2.0 Array; Agilent Human Genome Whole Microarray; Illumina HiSeq 2000 | ⸺ | R2 score (R2 is the proportion of the variation in the dependent variable that is predictable from the independent variable [102] analysis; F1 score (F1 score is the harmonic mean of precision and recall [103,104]) analysis | MM |
[32] | Shambhala-1; QN; Differential Gene Expression in Sequencing 2 (DESeq2) [59,60,61] | MAQC; SEQC datasets [27] | Agilent-012391 Whole Human Genome Oligo Microarray G4112A; Affymetrix Human Genome U133 Plus 2.0 Array; Illumina Sentrix Human-6 Expression Beadchip; Illumina HiSeq 2000; Illumina HumanHT-12 V4.0 expression beadchip; Affymetrix Human Gene 2.0 ST Array; Affymetrix GeneChip® PrimeView™ Human Gene Expression Array | Visual inspection of cluster dendrograms | ⸺ | Shambhala-1 (linear Shambhala) |
[34] | CuBlock [34]; ComBat [39] YuGene [71]; DBNorm [105]; Shambhala-1 [32]; Universal exPression Code (UPC) [63] | MAQC | Agilent-012391 Whole Human Genome Oligo Microarray G4112A; Affymetrix Human Genome U133 Plus 2.0 Array; Illumina Sentrix Human-6 Expression Beadchip | Visual inspection of cluster dendrograms and PCA plots | Cross-dataset data transfer for support vector machine (SVM) classification [93] | CuBlock |
[33] | Shambhala-2; Shambhala-1; QN; DESeq2; CuBlock; robust QN (QNR) [91]; Training Distribution Machine (TDM) [62]; UPC | GTEx [11], The Cancer Genome Atlas (TCGA) [10]; Oncobox Atlas of Normal Tissue Expression (ANTE) [13]; MAQC; SEQC | Illumina HiSeq 2000; Illumina HiSeq 3000; Agilent-012391 Whole Human Genome Oligo Microarray G4112A; Affymetrix Human Genome U133 Plus 2.0 Array; Illumina Sentrix Human-6 Expression Beadchip; Illumina HumanHT-12 V4.0 expression beadchip; Affymetrix Human Gene 2.0 ST Array; Affymetrix GeneChip® PrimeView™ Human Gene Expression Array | Visual inspection of PCA plots | Watermelon Multisection metric for quantitative assessment of clustering on dendrograms [95] | Shambhala-2 (cubic Shambhala) |
4. Application Notes
5. Conclusions
Reference | Method | Mathematical Principle | Algorithmic Complexity | Advantages | Shortcomings |
---|---|---|---|---|---|
[28] | Quantile normalization (QN) | Ranking the expression levels of different genes within each profile and setting the expression level of each gene to the mean value (over all profiles) for the respective rank | Relatively simple | Gold standard method for intra-platform normalization of the MH data | Avoiding being used for cross-platform harmonization of the MH data; requiring recalculation of all gene expression-based values after addition of new samples |
[59,60,61] | Differential Gene Expression in Sequencing 2 (DESeq2) | Transform based on the negative binomial distribution | Moderately complex | Gold standard for intra-platform normalization of RNAseq data | Requiring recalculation of all gene expression-based values after addition of new samples |
[29] | Cross-Platform Normalization (XPN) | Piecewise linear iterative transform | Relatively complex | The method of choice for harmonization of two datasets of comparable size | Allowing normalization of more than two datasets; not recommending subsequent application to other datasets; requiring recalculation of all gene expression-based values after addition of new samples |
[34] | CuBlock | Piecewise cubic iterative transform | Relatively complex | The method of choice for cross-platform normalization of more than two MH datasets | Requiring recalculation of all gene expression-based values after addition of new samples |
[32] | Shambhala-1 (linear Shambhala) | Uniformly shaped harmonization based on the XPN method. | Complex | Working for harmonization of unlimited number of datasets of any size, for both MH and RNAseq data or their combinations; not requiring recalculation of gene expression-based values after addition of new samples | Resource-demanding |
[33] | Shambhala-2 (cubic Shambhala) | Uniformly shaped harmonization based on the CuBlock method. | Complex | Working for harmonization of the unlimited number of datasets of any size, for both MH and RNAseq data or their combinations; not requiring recalculation of gene expression-based values after addition of new samples | Resource-demanding |
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
List of Acronyms:
ANTE | Atlas of Normal Tissue Expression |
ComBat | COMpensation of BATch effects |
CS | Column Sample |
CuBlock | Cubic Blocks |
DBNorm | Distribution-Based Normalization |
DisTran | Distribution Transformation |
DESeq(2) | Differential Gene Expression in Sequencing (2) |
DWD | Distance-Weighted Distribution |
EB | Empirical Bayes |
ESLR | Elastic Shared LASSO Regularization |
FCQN | Feature-Specific QN |
fRMA | Frozen Robust Microarray Analysis |
gcRMA | GeneChip Robust Microarray Analysis |
GEO | Gene Expression Omnibus |
GQ | Gene Quantiles |
GTEx | Genotype Tissue Expression |
IBN | Integrative Bayesian Network |
IG | Information Gain |
LASSO | Least Absolute Shrinkage and Selection Operator |
MAQC | Microarray Quality Control |
MBEI | Model-Based Expression Indices |
MC | Median Center |
MH | Microarray Hybridization |
ML | Machine Learning |
MM | MatchMixeR |
MRS | Median Rank Score |
NGS | Next-Generation Sequencing |
NorDi | Normalized Discretization |
PAM | Prediction Analysis for Microarrays |
PCA | Principal Component Analysis |
PILER | Probe Logarithmic Intensity ERror |
PLIDA | PLatform-Independent Latent Dirichlet Allocation |
PRIDE | PRoteomics Identification DatabasE |
QD | Quantile Discretization |
QN | Quantile Normalization |
QNR | Qunatile Normalization (Robust) |
RMA | Robust Microarray Analysis |
SEQC | Sequencing Quality Control |
SVM | support vector machine |
TCGA | The Cancer Genome Atlas |
TDM | Training Distribution Machine |
UPC | Universal exPression Code |
WM | Watermelon Multisection |
XPC | Cross-Platform Comparison |
XPN | Cross-Platform Normalization |
References
- Lashkari, D.A.; DeRisi, J.L.; McCusker, J.H.; Namath, A.F.; Gentile, C.; Hwang, S.Y.; Brown, P.O.; Davis, R.W. Yeast Microarrays for Genome Wide Parallel Genetic and Gene Expression Analysis. Proc. Natl. Acad. Sci. USA 1997, 94, 13057–13062. [Google Scholar] [CrossRef] [PubMed]
- King, H.C.; Sinha, A.A. Gene Expression Profile Analysis by DNA Microarrays: Promise and Pitfalls. JAMA 2001, 286, 2280. [Google Scholar] [CrossRef]
- Bednár, M. DNA Microarray Technology and Application. Med. Sci. Monit. 2000, 6, 796–800. [Google Scholar] [PubMed]
- Rew, D.A. DNA Microarray Technology in Cancer Research. Eur. J. Surg. Oncol. 2001, 27, 504–508. [Google Scholar] [CrossRef] [PubMed]
- Edgar, R.; Domrachev, M.; Lash, A.E. Gene Expression Omnibus: NCBI Gene Expression and Hybridization Array Data Repository. Nucleic Acids Res. 2002, 30, 207–210. [Google Scholar] [CrossRef] [PubMed]
- Brazma, A.; Hingamp, P.; Quackenbush, J.; Sherlock, G.; Spellman, P.; Stoeckert, C.; Aach, J.; Ansorge, W.; Ball, C.A.; Causton, H.C.; et al. Minimum Information about a Microarray Experiment (MIAME)-toward Standards for Microarray Data. Nat. Genet. 2001, 29, 365–371. [Google Scholar] [CrossRef]
- Rocca-Serra, P.; Brazma, A.; Parkinson, H.; Sarkans, U.; Shojatalab, M.; Contrino, S.; Vilo, J.; Abeygunawardena, N.; Mukherjee, G.; Holloway, E.; et al. ArrayExpress: A Public Database of Gene Expression Data at EBI. Comptes Rendus Biol. 2003, 326, 1075–1078. [Google Scholar] [CrossRef] [PubMed]
- Parkinson, H.; Kapushesky, M.; Shojatalab, M.; Abeygunawardena, N.; Coulson, R.; Farne, A.; Holloway, E.; Kolesnykov, N.; Lilja, P.; Lukk, M.; et al. ArrayExpress—a Public Database of Microarray Experiments and Gene Expression Profiles. Nucleic Acids Res. 2007, 35, D747–D750. [Google Scholar] [CrossRef]
- The Cancer Genome Atlas Research Network. Comprehensive Genomic Characterization Defines Human Glioblastoma Genes and Core Pathways. Nature 2008, 455, 1061–1068. [Google Scholar] [CrossRef] [PubMed]
- Tomczak, K.; Czerwińska, P.; Wiznerowicz, M. The Cancer Genome Atlas (TCGA): An Immeasurable Source of Knowledge. Contemp. Oncol. 2015, 19, A68–A77. [Google Scholar] [CrossRef] [PubMed]
- Lonsdale, J.; Thomas, J.; Salvatore, M.; Phillips, R.; Lo, E.; Shad, S.; Hasz, R.; Walters, G.; Garcia, F.; Young, N. The Genotype-Tissue Expression (GTEx) Project. Nature Genetics 2013, 45, 580–585. [Google Scholar] [CrossRef] [PubMed]
- The GTEx Consortium; Ardlie, K.G.; Deluca, D.S.; Segrè, A.V.; Sullivan, T.J.; Young, T.R.; Gelfand, E.T.; Trowbridge, C.A.; Maller, J.B.; Tukiainen, T.; et al. The Genotype-Tissue Expression (GTEx) Pilot Analysis: Multitissue Gene Regulation in Humans. Science 2015, 348, 648–660. [Google Scholar] [CrossRef]
- Suntsova, M.; Gaifullin, N.; Allina, D.; Reshetun, A.; Li, X.; Mendeleeva, L.; Surin, V.; Sergeeva, A.; Spirin, P.; Prassolov, V.; et al. Atlas of RNA Sequencing Profiles for Normal Human Tissues. Sci. Data 2019, 6, 36. [Google Scholar] [CrossRef]
- Yang, W.; Soares, J.; Greninger, P.; Edelman, E.J.; Lightfoot, H.; Forbes, S.; Bindal, N.; Beare, D.; Smith, J.A.; Thompson, I.R.; et al. Genomics of Drug Sensitivity in Cancer (GDSC): A Resource for Therapeutic Biomarker Discovery in Cancer Cells. Nucleic Acids Res. 2013, 41, D955–D961. [Google Scholar] [CrossRef] [PubMed]
- Chen, Y.; Li, Y.; Narayan, R.; Subramanian, A.; Xie, X. Gene Expression Inference with Deep Learning. Bioinformatics 2016, 32, 1832–1839. [Google Scholar] [CrossRef] [PubMed]
- Subramanian, A.; Kuehn, H.; Gould, J.; Tamayo, P.; Mesirov, J.P. GSEA-P: A Desktop Application for Gene Set Enrichment Analysis. Bioinformatics 2007, 23, 3251–3253. [Google Scholar] [CrossRef]
- Liang, P. MAQC Papers over the Cracks. Nat. Biotechnol. 2007, 25, 27–28, author reply 28–29. [Google Scholar] [CrossRef] [PubMed]
- Chen, J.J.; Hsueh, H.-M.; Delongchamp, R.R.; Lin, C.-J.; Tsai, C.-A. Reproducibility of Microarray Data: A Further Analysis of Microarray Quality Control (MAQC) Data. BMC Bioinform. 2007, 8, 412. [Google Scholar] [CrossRef] [PubMed]
- Shi, L.; Shi, L.; Reid, L.H.; Jones, W.D.; Shippy, R.; Warrington, J.A.; Baker, S.C.; Collins, P.J.; de Longueville, F.; Kawasaki, E.S.; et al. The MicroArray Quality Control (MAQC) Project Shows Inter- and Intraplatform Reproducibility of Gene Expression Measurements. Nature Biotechnol. 2006, 24, 1151–1161. [Google Scholar] [CrossRef]
- Mane, S.P.; Evans, C.; Cooper, K.L.; Crasta, O.R.; Folkerts, O.; Hutchison, S.K.; Harkins, T.T.; Thierry-Mieg, D.; Thierry-Mieg, J.; Jensen, R.V. Transcriptome Sequencing of the Microarray Quality Control (MAQC) RNA Reference Samples Using next Generation Sequencing. BMC Genom. 2009, 10, 264. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wen, Z.; Wang, C.; Shi, Q.; Huang, Y.; Su, Z.; Hong, H.; Tong, W.; Shi, L. Evaluation of Gene Expression Data Generated from Expired Affymetrix GeneChip® Microarrays Using MAQC Reference RNA Samples. BMC Bioinform. 2010, 11, S10. [Google Scholar] [CrossRef]
- Stelpflug, S.C.; Sekhon, R.S.; Vaillancourt, B.; Hirsch, C.N.; Buell, C.R.; Leon, N.; Kaeppler, S.M. An Expanded Maize Gene Expression Atlas Based on RNA Sequencing and Its Use to Explore Root Development. Plant Genome 2016, 9, 27898762. [Google Scholar] [CrossRef] [PubMed]
- Han, S.; Van Treuren, W.; Fischer, C.R.; Merrill, B.D.; DeFelice, B.C.; Sanchez, J.M.; Higginbottom, S.K.; Guthrie, L.; Fall, L.A.; Dodd, D.; et al. A Metabolomics Pipeline for the Mechanistic Interrogation of the Gut Microbiome. Nature 2021, 595, 415–420. [Google Scholar] [CrossRef] [PubMed]
- Tanaka, N.; Takahara, A.; Hagio, T.; Nishiko, R.; Kanayama, J.; Gotoh, O.; Mori, S. Sequencing Artifacts Derived from a Library Preparation Method Using Enzymatic Fragmentation. PLoS ONE 2020, 15, e0227427. [Google Scholar] [CrossRef]
- Demetrashvili, N.; Kron, K.; Pethe, V.; Bapat, B.; Briollais, L. How to Deal with Batch Effect in Sequential Microarray Experiments? Mol. Inform. 2010, 29, 387–393. [Google Scholar] [CrossRef]
- Lazar, C.; Meganck, S.; Taminau, J.; Steenhoff, D.; Coletta, A.; Molter, C.; Weiss-Solís, D.Y.; Duque, R.; Bersini, H.; Nowé, A. Batch Effect Removal Methods for Microarray Gene Expression Data Integration: A Survey. Brief. Bioinform. 2013, 14, 469–490. [Google Scholar] [CrossRef] [PubMed]
- Xu, J.; Gong, B.; Wu, L.; Thakkar, S.; Hong, H.; Tong, W. Comprehensive Assessments of RNA-Seq by the SEQC Consortium: FDA-Led Efforts Advance Precision Medicine. Pharmaceutics 2016, 8, 8. [Google Scholar] [CrossRef] [PubMed]
- Bolstad, B.M.; Irizarry, R.A.; Astrand, M.; Speed, T.P. A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias. Bioinformatics 2003, 19, 185–193. [Google Scholar] [CrossRef]
- Shabalin, A.A.; Tjelmeland, H.; Fan, C.; Perou, C.M.; Nobel, A.B. Merging Two Gene-Expression Studies via Cross-Platform Normalization. Bioinformatics 2008, 24, 1154–1160. [Google Scholar] [CrossRef]
- Deshwar, A.G.; Morris, Q. PLIDA: Cross-Platform Gene Expression Normalization Using Perturbed Topic Models. Bioinformatics 2014, 30, 956–961. [Google Scholar] [CrossRef] [Green Version]
- Rudy, J.; Valafar, F. Empirical Comparison of Cross-Platform Normalization Methods for Gene Expression Data. BMC Bioinform. 2011, 12, 467. [Google Scholar] [CrossRef] [PubMed]
- Borisov, N.; Shabalina, I.; Tkachev, V.; Sorokin, M.; Garazha, A.; Pulin, A.; Eremin, I.I.; Buzdin, A. Shambhala: A Platform-Agnostic Data Harmonizer for Gene Expression Data. BMC Bioinform. 2019, 20, 66. [Google Scholar] [CrossRef] [PubMed]
- Borisov, N.; Sorokin, M.; Zolotovskaya, M.; Borisov, C.; Buzdin, A. Shambhala-2: A Protocol for Uniformly Shaped Harmonization of Gene Expression Profiles of Various Formats. Current Protocols 2022, 2, e444. [Google Scholar] [CrossRef] [PubMed]
- Junet, V.; Farrés, J.; Mas, J.M.; Daura, X. CuBlock: A Cross-Platform Normalization Method for Gene-Expression Microarrays. Bioinformatics 2021, 37, 2365–2373. [Google Scholar] [CrossRef] [PubMed]
- Carter, S.L.; Eklund, A.C.; Mecham, B.H.; Kohane, I.S.; Szallasi, Z. Redefinition of Affymetrix Probe Sets by Sequence Overlap with CDNA Microarray Probes Reduces Cross-Platform Inconsistencies in Cancer-Associated Gene Expression Measurements. BMC Bioinform. 2005, 6, 107. [Google Scholar] [CrossRef]
- Warnat, P.; Eils, R.; Brors, B. Cross-Platform Analysis of Cancer Microarray Data Improves Gene Expression Based Classification of Phenotypes. BMC Bioinform. 2005, 6, 265. [Google Scholar] [CrossRef]
- Martinez, R.; Pasquier, N.; Pasquier, C. GenMiner: Mining Non-Redundant Association Rules from Integrated Gene Expression Data and Annotations. Bioinformatics 2008, 24, 2643–2644. [Google Scholar] [CrossRef]
- Jiang, H.; Deng, Y.; Chen, H.-S.; Tao, L.; Sha, Q.; Chen, J.; Tsai, C.-J.; Zhang, S. Joint Analysis of Two Microarray Gene-Expression Data Sets to Select Lung Adenocarcinoma Marker Genes. BMC Bioinform. 2004, 5, 81. [Google Scholar] [CrossRef]
- Johnson, W.E.; Li, C.; Rabinovic, A. Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods. Biostatistics 2007, 8, 118–127. [Google Scholar] [CrossRef]
- Huang, H.; Lu, X.; Liu, Y.; Haaland, P.; Marron, J.S. R/DWD: Distance-Weighted Discrimination for Classification, Visualization and Batch Adjustment. Bioinformatics 2012, 28, 1182–1183. [Google Scholar] [CrossRef]
- Marron, J.S.; Todd, M.J.; Ahn, J. Distance-Weighted Discrimination. J. Am. Stat. Assoc. 2007, 102, 1267–1271. [Google Scholar] [CrossRef]
- Benito, M.; Parker, J.; Du, Q.; Wu, J.; Xiang, D.; Perou, C.M.; Marron, J.S. Adjustment of Systematic Microarray Data Biases. Bioinformatics 2004, 20, 105–114. [Google Scholar] [CrossRef] [PubMed]
- Xia, X.-Q.; McClelland, M.; Porwollik, S.; Song, W.; Cong, X.; Wang, Y. WebArrayDB: Cross-Platform Microarray Data Analysis and Public Data Repository. Bioinformatics 2009, 25, 2425–2429. [Google Scholar] [CrossRef] [PubMed]
- Chu, Y.; Corey, D.R. RNA Sequencing: Platform Selection, Experimental Design, and Data Interpretation. Nucleic Acid. Ther. 2012, 22, 271–274. [Google Scholar] [CrossRef] [PubMed]
- Nagalakshmi, U.; Wang, Z.; Waern, K.; Shou, C.; Raha, D.; Gerstein, M.; Snyder, M. The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing. Science 2008, 320, 1344–1349. [Google Scholar] [CrossRef]
- Maher, C.A.; Kumar-Sinha, C.; Cao, X.; Kalyana-Sundaram, S.; Han, B.; Jing, X.; Sam, L.; Barrette, T.; Palanisamy, N.; Chinnaiyan, A.M. Transcriptome Sequencing to Detect Gene Fusions in Cancer. Nature 2009, 458, 97–101. [Google Scholar] [CrossRef]
- Ingolia, N.T.; Brar, G.A.; Rouskin, S.; McGeachy, A.M.; Weissman, J.S. The Ribosome Profiling Strategy for Monitoring Translation in Vivo by Deep Sequencing of Ribosome-Protected MRNA Fragments. Nat. Protoc. 2012, 7, 1534–1550. [Google Scholar] [CrossRef]
- Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: A Revolutionary Tool for Transcriptomics. Nat. Rev. Genet. 2009, 10, 57–63. [Google Scholar] [CrossRef]
- Korir, P.K.; Geeleher, P.; Seoighe, C. Seq-Ing Improved Gene Expression Estimates from Microarrays Using Machine Learning. BMC Bioinform. 2015, 16, 286. [Google Scholar] [CrossRef]
- Taylor, K.C.; Evans, D.S.; Edwards, D.R.V.; Edwards, T.L.; Sofer, T.; Li, G.; Liu, Y.; Franceschini, N.; Jackson, R.D.; Giri, A.; et al. A Genome-Wide Association Study Meta-Analysis of Clinical Fracture in 10,012 African American Women. Bone Rep. 2016, 5, 233–242. [Google Scholar] [CrossRef] [Green Version]
- Hollern, D.P.; Xu, N.; Thennavan, A.; Glodowski, C.; Garcia-Recio, S.; Mott, K.R.; He, X.; Garay, J.P.; Carey-Ewend, K.; Marron, D.; et al. B Cells and T Follicular Helper Cells Mediate Response to Checkpoint Inhibitors in High Mutation Burden Mouse Models of Breast Cancer. Cell 2019, 179, 1191–1206.e21. [Google Scholar] [CrossRef] [PubMed]
- Thind, A.S.; Monga, I.; Thakur, P.K.; Kumari, P.; Dindhoria, K.; Krzak, M.; Ranson, M.; Ashford, B. Demystifying Emerging Bulk RNA-Seq Applications: The Application and Utility of Bioinformatic Methodology. Brief. Bioinform. 2021, 22, bbab259. [Google Scholar] [CrossRef]
- Vellichirammal, N.N.; Albahrani, A.; Li, Y.; Guda, C. Identification of Fusion Transcripts from Unaligned RNA-Seq Reads Using ChimeRScope. In Chimeric RNA; Li, H., Elfman, J., Eds.; Methods in Molecular Biology; Springer: New York, NY, USA, 2020; Volume 2079, pp. 13–25. ISBN 978-1-4939-9903-3. [Google Scholar]
- Kekeeva, T.; Tanas, A.; Kanygina, A.; Alexeev, D.; Shikeeva, A.; Zavalishina, L.; Andreeva, Y.; Frank, G.A.; Zaletaev, D. Novel Fusion Transcripts in Bladder Cancer Identified by RNA-Seq. Cancer Lett. 2016, 374, 224–228. [Google Scholar] [CrossRef]
- Gu, J.; Chukhman, M.; Lu, Y.; Liu, C.; Liu, S.; Lu, H. RNA-Seq Based Transcription Characterization of Fusion Breakpoints as a Potential Estimator for Its Oncogenic Potential. BioMed. Res. Int. 2017, 2017, 9829175. [Google Scholar] [CrossRef] [PubMed]
- Schmidt, B.M.; Davidson, N.M.; Hawkins, A.D.K.; Bartolo, R.; Majewski, I.J.; Ekert, P.G.; Oshlack, A. Clinker: Visualizing Fusion Genes Detected in RNA-Seq Data. GigaScience 2018, 7, giy079. [Google Scholar] [CrossRef] [PubMed]
- Borisov, N.; Sorokin, M.; Tkachev, V.; Garazha, A.; Buzdin, A. Cancer Gene Expression Profiles Associated with Clinical Outcomes to Chemotherapy Treatments. BMC Med. Genom. 2020, 13, 111. [Google Scholar] [CrossRef]
- Anders, S.; Huber, W. Differential Expression Analysis for Sequence Count Data. Genome Biol. 2010, 11, R106. [Google Scholar] [CrossRef]
- Love, M.I.; Huber, W.; Anders, S. Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef]
- Varet, H.; Brillet-Guéguen, L.; Coppée, J.-Y.; Dillies, M.-A. SARTools: A DESeq2- and EdgeR-Based R Pipeline for Comprehensive Differential Analysis of RNA-Seq Data. PLoS ONE 2016, 11, e0157022. [Google Scholar] [CrossRef]
- Maza, E. In Papyro Comparison of TMM (EdgeR), RLE (DESeq2), and MRN Normalization Methods for a Simple Two-Conditions-Without-Replicates RNA-Seq Experimental Design. Front. Genet. 2016, 7, 164. [Google Scholar] [CrossRef] [Green Version]
- Thompson, J.A.; Tan, J.; Greene, C.S. Cross-Platform Normalization of Microarray and RNA-Seq Data for Machine Learning Applications. PeerJ 2016, 4, e1621. [Google Scholar] [CrossRef]
- Piccolo, S.R.; Withers, M.R.; Francis, O.E.; Bild, A.H.; Johnson, W.E. Multiplatform Single-Sample Estimates of Transcriptional Activation. Proc. Natl. Acad. Sci. USA 2013, 110, 17778–17783. [Google Scholar] [CrossRef] [PubMed]
- Franks, J.M.; Cai, G.; Whitfield, M.L. Feature Specific Quantile Normalization Enables Cross-Platform Classification of Molecular Subtypes Using Gene Expression Data. Bioinformatics 2018, 34, 1868–1874. [Google Scholar] [CrossRef] [PubMed]
- Fauteux, F.; Surendra, A.; McComb, S.; Pan, Y.; Hill, J.J. Identification of Transcriptional Subtypes in Lung Adenocarcinoma and Squamous Cell Carcinoma through Integrative Analysis of Microarray and RNA Sequencing Data. Sci. Rep. 2021, 11, 8709. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Shao, J.; Yu, D.; Qiu, X.; Zhang, J. MatchMixeR: A Cross-Platform Normalization Method for Gene Expression Data Integration. Bioinformatics 2020, 36, 2486–2491. [Google Scholar] [CrossRef] [PubMed]
- Maleknia, S.; Salehi, Z.; Rezaei Tabar, V.; Sharifi-Zarchi, A.; Kavousi, K. An Integrative Bayesian Network Approach to Highlight Key Drivers in Systemic Lupus Erythematosus. Arthritis Res. Ther. 2020, 22, 156. [Google Scholar] [CrossRef]
- Tang, K.; Ji, X.; Zhou, M.; Deng, Z.; Huang, Y.; Zheng, G.; Cao, Z. Rank-in: Enabling Integrative Analysis across Microarray and RNA-Seq for Cancer. Nucleic Acids Res. 2021, 49, e99. [Google Scholar] [CrossRef] [PubMed]
- Huang, H.-H.; Rao, H.; Miao, R.; Liang, Y. A Novel Meta-Analysis Based on Data Augmentation and Elastic Data Shared Lasso Regularization for Gene Expression. BMC Bioinform. 2022, 23, 353. [Google Scholar] [CrossRef]
- Dinalankara, W.; Ke, Q.; Xu, Y.; Ji, L.; Pagane, N.; Lien, A.; Matam, T.; Fertig, E.J.; Price, N.D.; Younes, L.; et al. Digitizing Omics Profiles by Divergence from a Baseline. Proc. Natl. Acad. Sci. USA 2018, 115, 4545–4552. [Google Scholar] [CrossRef] [PubMed]
- Lê Cao, K.-A.; Rohart, F.; McHugh, L.; Korn, O.; Wells, C.A. YuGene: A Simple Approach to Scale Gene Expression Data Derived from Different Platforms for Integrated Analyses. Genomics 2014, 103, 239–251. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, T.N.; Nguyen, H.Q.; Le, D.-H. Unveiling Prognostics Biomarkers of Tyrosine Metabolism Reprogramming in Liver Cancer by Cross-Platform Gene Expression Analyses. PLoS ONE 2020, 15, e0229276. [Google Scholar] [CrossRef] [PubMed]
- Ou-Yang, L.; Zhang, X.-F.; Wu, M.; Li, X.-L. Node-Based Learning of Differential Networks from Multi-Platform Gene Expression Data. Methods 2017, 129, 41–49. [Google Scholar] [CrossRef]
- Piccolo, S.R.; Sun, Y.; Campbell, J.D.; Lenburg, M.E.; Bild, A.H.; Johnson, W.E. A Single-Sample Microarray Normalization Method to Facilitate Personalized-Medicine Workflows. Genomics 2012, 100, 337–344. [Google Scholar] [CrossRef] [PubMed]
- Irizarry, R.A.; Hobbs, B.; Collin, F.; Beazer-Barclay, Y.D.; Antonellis, K.J.; Scherf, U.; Speed, T.P. Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4, 249–264. [Google Scholar] [CrossRef]
- Wu, Z.; Irizarry, R.A.; Gentleman, R.; Martinez-Murillo, F.; Spencer, F. A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. J. Am. Stat. Assoc. 2004, 99, 909–917. [Google Scholar] [CrossRef]
- Li, C.; Wong, W.H. Model-Based Analysis of Oligonucleotide Arrays: Expression Index Computation and Outlier Detection. Proc. Natl. Acad. Sci. USA 2001, 98, 31–36. [Google Scholar] [CrossRef]
- Therneau, T.M.; Ballman, K.V. What Does PLIER Really Do? Cancer Inform 2008, 6, 117693510800600. [Google Scholar] [CrossRef]
- McCall, M.N.; Bolstad, B.M.; Irizarry, R.A. Frozen Robust Multiarray Analysis (FRMA). Biostatistics 2010, 11, 242–253. [Google Scholar] [CrossRef] [PubMed]
- McCall, M.N.; Uppal, K.; Jaffee, H.A.; Zilliox, M.J.; Irizarry, R.A. The Gene Expression Barcode: Leveraging Public Data Repositories to Begin Cataloging the Human and Murine Transcriptomes. Nucleic Acids Res. 2011, 39, D1011–D1015. [Google Scholar] [CrossRef] [PubMed]
- McCall, M.N.; Murakami, P.N.; Lukk, M.; Huber, W.; Irizarry, R.A. Assessing Affymetrix GeneChip Microarray Quality. BMC Bioinform. 2011, 12, 137. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- McCall, M.N.; Jaffee, H.A.; Irizarry, R.A. FRMA ST: Frozen Robust Multiarray Analysis for Affymetrix Exon and Gene ST Arrays. Bioinformatics 2012, 28, 3153–3154. [Google Scholar] [CrossRef] [PubMed]
- Zhang, L.; Cham, J.; Cooley, J.; He, T.; Hagihara, K.; Yang, H.; Fan, F.; Cheung, A.; Thompson, D.; Kerns, B.J.; et al. Cross-Platform Comparison of Immune-Related Gene Expression to Assess Intratumor Immune Responses Following Cancer Immunotherapy. J. Immunol. Methods 2021, 494, 113041. [Google Scholar] [CrossRef] [PubMed]
- Lee, J.S.; Nair, N.U.; Dinstag, G.; Chapman, L.; Chung, Y.; Wang, K.; Sinha, S.; Cha, H.; Kim, D.; Schperberg, A.V.; et al. Synthetic Lethality-Mediated Precision Oncology via the Tumor Transcriptome. Cell 2021, 184, 2487–2502.e13. [Google Scholar] [CrossRef] [PubMed]
- Borisov, N.; Sorokin, M.; Garazha, A.; Buzdin, A. Quantitation of Molecular Pathway Activation Using RNA Sequencing Data. In Nucleic Acid Detection and Structural Investigations; Astakhova, K., Bukhari, S.A., Eds.; Springer: New York, NY, USA, 2020; Volume 2063, pp. 189–206. ISBN 978-1-07-160137-2. [Google Scholar]
- Poddubskaya, E.; Buzdin, A.; Garazha, A.; Sorokin, M.; Glusker, A.; Aleshin, A.; Allina, D.; Moiseev, A.; Sekacheva, M.; Suntsova, M.; et al. Oncobox, Gene Expression-Based Second Opinion System for Predicting Response to Treatment in Advanced Solid Tumors. J. Clin. Oncol. 2019, 37, e13143. [Google Scholar] [CrossRef]
- Tkachev, V.; Sorokin, M.; Garazha, A.; Borisov, N.; Buzdin, A. Oncobox Method for Scoring Efficiencies of Anticancer Drugs Based on Gene Expression Data. In Nucleic Acid Detection and Structural Investigations; Astakhova, K., Bukhari, S.A., Eds.; Springer US: New York, NY, USA, 2020; Volume 2063, pp. 235–255. ISBN 978-1-07-160137-2. [Google Scholar]
- Tkachev, V.; Sorokin, M.; Mescheryakov, A.; Simonov, A.; Garazha, A.; Buzdin, A.; Muchnik, I.; Borisov, N. FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier. Front. Genet. 2019, 9, 717. [Google Scholar] [CrossRef]
- Tkachev, V.; Sorokin, M.; Borisov, C.; Garazha, A.; Buzdin, A.; Borisov, N. Flexible Data Trimming Improves Performance of Global Machine Learning Methods in Omics-Based Personalized Oncology. Int. J. Mol. Sci. 2020, 21, 713. [Google Scholar] [CrossRef] [PubMed]
- Turki, T.; Wang, J.T.L. Clinical Intelligence: New Machine Learning Techniques for Predicting Clinical Drug Response. Comput. Biol. Med. 2019, 107, 302–322. [Google Scholar] [CrossRef]
- Bolstad, B. Preprocessing and Normalization for Affymetrix GeneChip Expression Microarrays. In Methods in Microarray Normalization; Stafford, P., Ed.; Drug Discovery Series; CRC Press: Boca Raton, FL, USA, 2008; Volume 0, pp. 41–59. ISBN 978-1-4200-5278-7. [Google Scholar]
- Friedman, J.; Hastie, T.; Tibshirani, R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef]
- Vapnik, V.; Chapelle, O. Bounds on Error Expectation for Support Vector Machines. Neural Comput. 2000, 12, 2013–2036. [Google Scholar] [CrossRef]
- Tibshirani, R.; Hastie, T.; Narasimhan, B.; Chu, G. Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proc. Natl. Acad. Sci. USA 2002, 99, 6567–6572. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zolotovskaia, M.A.; Sorokin, M.I.; Petrov, I.V.; Poddubskaya, E.V.; Moiseev, A.A.; Sekacheva, M.I.; Borisov, N.M.; Tkachev, V.S.; Garazha, A.V.; Kaprin, A.D.; et al. Disparity between Inter-Patient Molecular Heterogeneity and Repertoires of Target Drugs Used for Different Types of Cancer in Clinical Oncology. Int. J. Mol. Sci. 2020, 21, 1580. [Google Scholar] [CrossRef] [PubMed]
- Huang, E.; Cheng, S.H.; Dressman, H.; Pittman, J.; Tsou, M.H.; Horng, C.F.; Bild, A.; Iversen, E.S.; Liao, M.; Chen, C.M.; et al. Gene Expression Predictors of Breast Cancer Outcomes. Lancet 2003, 361, 1590–1596. [Google Scholar] [CrossRef]
- Hu, Z.; Fan, C.; Oh, D.S.; Marron, J.; He, X.; Qaqish, B.F.; Livasy, C.; Carey, L.A.; Reynolds, E.; Dressler, L.; et al. The Molecular Portraits of Breast Tumors Are Conserved across Microarray Platforms. BMC Genom. 2006, 7, 96. [Google Scholar] [CrossRef] [PubMed]
- van’t Veer, L.J.; Dai, H.; van de Vijver, M.J.; He, Y.D.; Hart, A.A.M.; Mao, M.; Peterse, H.L.; van der Kooy, K.; Marton, M.J.; Witteveen, A.T.; et al. Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer. Nature 2002, 415, 530–536. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Xia, X.-Q.; Jia, Z.; Sawyers, A.; Yao, H.; Wang-Rodriquez, J.; Mercola, D.; McClelland, M. In Silico Estimates of Tissue Components in Surgical Samples Based on Expression Profiling Data. Cancer Res. 2010, 70, 6448–6455. [Google Scholar] [CrossRef] [PubMed]
- Jia, Z.; Wang, Y.; Sawyers, A.; Yao, H.; Rahmatpanah, F.; Xia, X.-Q.; Xu, Q.; Pio, R.; Turan, T.; Koziol, J.A.; et al. Diagnosis of Prostate Cancer Using Differentially Expressed Genes in Stroma. Cancer Res. 2011, 71, 2476–2487. [Google Scholar] [CrossRef]
- Desmedt, C.; Piette, F.; Loi, S.; Wang, Y.; Lallemand, F.; Haibe-Kains, B.; Viale, G.; Delorenzi, M.; Zhang, Y.; d’Assignies, M.S.; et al. Strong Time Dependence of the 76-Gene Prognostic Signature for Node-Negative Breast Cancer Patients in the TRANSBIG Multicenter Independent Validation Series. Clin. Cancer Res. 2007, 13, 3207–3214. [Google Scholar] [CrossRef] [PubMed]
- Chicco, D.; Warrens, M.J.; Jurman, G. The Coefficient of Determination R-Squared Is More Informative than SMAPE, MAE, MAPE, MSE and RMSE in Regression Analysis Evaluation. PeerJ Comput. Sci. 2021, 7, e623. [Google Scholar] [CrossRef] [PubMed]
- Chicco, D. Ten Quick Tips for Machine Learning in Computational Biology. BioData Min. 2017, 10, 35. [Google Scholar] [CrossRef]
- Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Meng, Q.; Catchpoole, D.; Skillicorn, D.; Kennedy, P.J. DBNorm: Normalizing High-Density Oligonucleotide Microarray Data Based on Distributions. BMC Bioinform. 2017, 18, 527. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Borisov, N.; Buzdin, A. Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect. Biomedicines 2022, 10, 2318. https://doi.org/10.3390/biomedicines10092318
Borisov N, Buzdin A. Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect. Biomedicines. 2022; 10(9):2318. https://doi.org/10.3390/biomedicines10092318
Chicago/Turabian StyleBorisov, Nicolas, and Anton Buzdin. 2022. "Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect" Biomedicines 10, no. 9: 2318. https://doi.org/10.3390/biomedicines10092318
APA StyleBorisov, N., & Buzdin, A. (2022). Transcriptomic Harmonization as the Way for Suppressing Cross-Platform Bias and Batch Effect. Biomedicines, 10(9), 2318. https://doi.org/10.3390/biomedicines10092318