rox: A Statistical Model for Regression with Missing Values
Abstract
:1. Introduction
2. Methods
2.1. rox Core Model
2.2. Debiased Weighted Rox Model
2.3. Self-Adjusting Rox for Partial LOD and Non-LOD
2.4. Rox-Based Semiparametric Multivariable Model
2.5. Hypothesis Testing
2.6. Simulation Framework
2.7. Metabolomics Datasets
3. Results
3.1. Simulation Results: Strict LOD
3.2. Simulation Results: Probabilistic LOD
3.3. Simulation Results: Multivariable Setting
3.4. Evaluation on Metabolomics Data: Recovering High-Confidence Hits
3.5. Evaluation on Metabolomics Data: Multiplatform Validation
4. Discussion
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jin, L.; Bi, Y.; Hu, C.; Qu, J.; Shen, S.; Wang, X.; Tian, Y. A comparative study of evaluating missing value imputation methods in label-free proteomics. Sci. Rep. 2021, 11, 1–11. [Google Scholar] [CrossRef] [PubMed]
- Lin, H.; Peddada, S.D. Analysis of microbial compositions: A review of normalization and differential abundance analysis. NPJ Biofilms Microbiomes 2020, 6, 1–13. [Google Scholar] [CrossRef]
- Do, K.T.; Wahl, S.; Raffler, J.; Molnos, S.; Laimighofer, M.; Adamski, J.; Suhre, K.; Strauch, K.; Peters, A.; Gieger, C.; et al. Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies. Metabolomics 2018, 14, 128. [Google Scholar] [CrossRef] [Green Version]
- Suhre, K.; Shin, S.Y.; Petersen, A.K.; Mohney, R.P.; Meredith, D.; Wägele, B.; Altmaier, E.; Deloukas, P.; Erdmann, J.; Grundberg, E.; et al. Human metabolic individuality in biomedical and pharmaceutical research. Nature 2011, 477, 54–60. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Gloor, G.B.; Macklaim, J.M.; Pawlowsky-Glahn, V.; Egozcue, J.J. Microbiome Datasets Are Compositional: And This Is Not Optional. Front. Microbiol. 2017, 8, 2224. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- White, I.R.; Carlin, J.B. Bias and efficiency of multiple imputation compared with complete-case analysis for missing covariate values. Stat. Med. 2010, 29, 2920–2931. [Google Scholar] [CrossRef]
- Helsel, D.R. Fabricating data: How substituting values for nondetects can ruin results, and what can be done about it. Chemosphere 2006, 65, 2434–2439. [Google Scholar] [CrossRef]
- Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef] [Green Version]
- Helsel, D.R. Nondetects and Data Analysis. Statistics for Censored Environmental Data; Wiley-Interscience: Hoboken, NJ, USA, 2005. [Google Scholar]
- Moulton, L.H.; Halsey, N.A. A Mixture Model with Detection Limits for Regression Analyses of Antibody Response to Vaccine. Biometrics 1995, 51, 1570. [Google Scholar] [CrossRef]
- Richardson, D.B. Effects of Exposure Measurement Error When an Exposure Variable Is Constrained by a Lower Limit. Am. J. Epidemiol. 2003, 157, 355–363. [Google Scholar] [CrossRef]
- Kendall, M.G. Rank and Product-Moment Correlation. Biometrika 1949, 36, 177–193. [Google Scholar] [CrossRef] [PubMed]
- Newson, R. Parameters behind “nonparametric” statistics: Kendall’s tau, Somers’ D and median differences. Stata J. 2002, 2, 45–64. [Google Scholar] [CrossRef] [Green Version]
- Somers, R.H. A new asymmetric measure of association for ordinal variables. Am. Sociol. Rev. 1962, 27, 799–811. [Google Scholar] [CrossRef]
- Harrell, F.E.; Califf, R.M.; Pryor, D.B.; Lee, K.L.; Rosati, R.A. Evaluating the yield of medical tests. JAMA 1982, 247, 2543–2546. [Google Scholar] [CrossRef]
- Therneau, T.; Atkinson, E. Concordance. en. Vignette of Survival Package. Available online: https://cran.r-project.org/web/packages/survival/vignettes/concordance.pdf (accessed on 1 September 2020).
- Dunkler, D.; Schemper, M.; Heinze, G. Gene selection in microarray survival studies under possibly non-proportional hazards. Bioinformatics 2010, 26, 784–790. [Google Scholar] [CrossRef] [Green Version]
- Therneau, T.M.; Watson, D.A. The Concordance Statistic and the Cox Model; Technical Report; Department of Health Science Research, Mayo Clinic: Rochester, MN, USA, 2017; p. 18. [Google Scholar]
- Wager, S.; Hastie, T.; Efron, B. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. J. Mach. Learn. Res. 2014, 15, 1625–1651. [Google Scholar]
- Wald, A. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans. Am. Math. Soc. 1943, 54, 426–482. [Google Scholar] [CrossRef]
- Chetnik, K.; Benedetti, E.; Gomari, D.P.; Schweickart, A.; Batra, R.; Buyukozkan, M.; Wang, Z.; Arnold, M.; Zierer, J.; Suhre, K.; et al. maplet: An extensible R toolbox for modular and reproducible metabolomics pipelines. Bioinformatics 2022, 38, 1168–1170. [Google Scholar] [CrossRef]
- Dieterle, F.; Ross, A.; Schlotterbeck, G.; Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 2006, 78, 4281–4290. [Google Scholar] [CrossRef]
- Do, K.T.; Pietzner, M.; Rasp, D.J.; Friedrich, N.; Nauck, M.; Kocher, T.; Suhre, K.; Mook-Kanamori, D.O.; Kastenmüller, G.; Krumsiek, J. Phenotype-driven identification of modules in a hierarchical map of multifluid metabolic correlations. NPJ Syst. Biol. Appl. 2017, 3, 1–12. [Google Scholar] [CrossRef] [Green Version]
- Terunuma, A.; Putluri, N.; Mishra, P.; Mathé, E.A.; Dorsey, T.H.; Yi, M.; Wallace, T.A.; Issaq, H.J.; Zhou, M.; Killian, J.K.; et al. MYC-driven accumulation of 2-hydroxyglutarate is associated with breast cancer prognosis. J. Clin. Investig. 2014, 124, 398–412. [Google Scholar] [CrossRef] [PubMed]
- Hakimi, A.A.; Reznik, E.; Lee, C.H.; Creighton, C.J.; Brannon, A.R.; Luna, A.; Aksoy, B.A.; Liu, E.M.; Shen, R.; Lee, W.; et al. An integrated metabolic atlas of clear cell renal cell carcinoma. Cancer Cell 2016, 29, 104–116. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Scholtens, D.M.; Muehlbauer, M.J.; Daya, N.R.; Stevens, R.D.; Dyer, A.R.; Lowe, L.P.; Metzger, B.E.; Newgard, C.B.; Bain, J.R.; Lowe, W.L., Jr.; et al. Metabolomics reveals broad-scale metabolic perturbations in hyperglycemic mothers during pregnancy. Diabetes Care 2014, 37, 158–166. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Do, K.T.; Rasp, D.J.P.; Kastenmüller, G.; Suhre, K.; Krumsiek, J. MoDentify: Phenotype-driven module identification in metabolomics networks at different resolutions. Bioinformatics 2019, 35, 532–534. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mook-Kanamori, D.O.; Selim, M.M.E.D.; Takiddin, A.H.; Al-Homsi, H.; Al-Mahmoud, K.A.; Al-Obaidli, A.; Zirie, M.A.; Rowe, J.; Yousri, N.A.; Karoly, E.D.; et al. 1, 5-Anhydroglucitol in saliva is a noninvasive marker of short-term glycemic control. J. Clin. Endocrinol. Metab. 2014, 99, E479–E483. [Google Scholar] [CrossRef]
- Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
- Beretta, L.; Santaniello, A. Nearest neighbor imputation algorithms: A critical evaluation. BMC Med. Inform. Decis. Mak. 2016, 16, 197–208. [Google Scholar] [CrossRef] [Green Version]
- Stekhoven, D.J.; Bühlmann, P. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef] [Green Version]
- Karpievitch, Y.; Stanley, J.; Taverner, T.; Huang, J.; Adkins, J.N.; Ansong, C.; Heffron, F.; Metz, T.O.; Qian, W.J.; Yoon, H.; et al. A statistical framework for protein quantitation in bottom-up MS-based proteomics. Bioinformatics 2009, 25, 2028–2034. [Google Scholar] [CrossRef] [Green Version]
- Hart, G.W.; Copeland, R.J. Glycomics hits the big time. Cell 2010, 143, 672–676. [Google Scholar] [CrossRef] [Green Version]
- Silverman, J.D.; Roche, K.; Mukherjee, S.; David, L.A. Naught all zeros in sequence count data are the same. Comput. Struct. Biotechnol. J. 2020, 18, 2789–2798. [Google Scholar] [CrossRef] [PubMed]
Cohort | Number of Samples (Controls/Cases) | Number of Metabolites | Phenotype | Specimen | Reference |
---|---|---|---|---|---|
QMDiab-Plasma (HD2) | 358 (177/181) | 758 | Type 2 Diabetes | Blood | [23] |
QMDiab-Urine | 360 (174/186) | 891 | Type 2 Diabetes | Urine | [23] |
QMDiab-Saliva | 330 (171/159) | 602 | Type 2 Diabetes | Saliva | [23] |
BRCA | 132 (65/67) | 536 | Breast Cancer | Breast Tissue | [24] |
RCC | 276 (138/138) | 877 | Kidney Cancer | Kidney Tissue | [25] |
HAPO | 115 (67/48) | 49 | Hyperglycemia | Plasma | [26] |
QMDiab-Plasma Validation (HD4) | 292 (137/155) | 359 | Type 2 Diabetes | Plasma | [27,28] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Buyukozkan, M.; Benedetti, E.; Krumsiek, J. rox: A Statistical Model for Regression with Missing Values. Metabolites 2023, 13, 127. https://doi.org/10.3390/metabo13010127
Buyukozkan M, Benedetti E, Krumsiek J. rox: A Statistical Model for Regression with Missing Values. Metabolites. 2023; 13(1):127. https://doi.org/10.3390/metabo13010127
Chicago/Turabian StyleBuyukozkan, Mustafa, Elisa Benedetti, and Jan Krumsiek. 2023. "rox: A Statistical Model for Regression with Missing Values" Metabolites 13, no. 1: 127. https://doi.org/10.3390/metabo13010127
APA StyleBuyukozkan, M., Benedetti, E., & Krumsiek, J. (2023). rox: A Statistical Model for Regression with Missing Values. Metabolites, 13(1), 127. https://doi.org/10.3390/metabo13010127