Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series
Abstract
:1. Introduction
- (i)
- variable selection;
- (ii)
- estimation of causal effects;
- (iii)
- propensity score weighting;
- (iv)
- missing data.
2. Review of Methods
2.1. CART
2.2. Random Forest
2.3. Boosting
2.4. BART
3. Utilities of Tree-Based Methods
3.1. Variable Selection
3.2. Counterfactual Prediction
3.3. Propensity Score Weighting
3.4. Missing Data
4. Case Studies of Tree-Based Methods
4.1. Confounder Selection
4.2. Comparative Effectiveness Analysis
4.3. Propensity Score Weight Estimator
4.4. Handing Missing Data
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hernández, B.; Pennington, S.R.; Parnell, A.C. Bayesian methods for proteomic biomarker development. EuPA Open Proteom. 2015, 9, 54–64. [Google Scholar] [CrossRef]
- Hu, L.; Gu, C.; Lopez, M.; Ji, J.; Wisnivesky, J. Estimation of causal effects of multiple treatments in observational studies with a binary outcome. Stat. Methods Med. Res. 2020, 29, 3218–3234. [Google Scholar] [CrossRef] [PubMed]
- Hu, L.; Gu, C. Estimation of causal effects of multiple treatments in healthcare database studies with rare outcomes. Health Serv. Outcomes Res. Methodol. 2021, 21, 287–308. [Google Scholar] [CrossRef]
- Mazumdar, M.; Lin, J.Y.J.; Zhang, W.; Li, L.; Liu, M.; Dharmarajan, K.; Sanderson, M.; Isola, L.; Hu, L. Comparison of statistical and machine learning models for healthcare cost data: A simulation study motivated by Oncology Care Model (OCM) data. BMC Health Serv. Res. 2020, 20, 350. [Google Scholar] [CrossRef] [Green Version]
- Hu, L.; Liu, B.; Ji, J.; Li, Y. Tree-Based Machine Learning to Identify and Understand Major Determinants for Stroke at the Neighborhood Level. J. Am. Heart Assoc. 2020, 9, e016745. [Google Scholar] [CrossRef]
- Hu, L.; Liu, B.; Li, Y. Ranking sociodemographic, health behavior, prevention, and environmental factors in predicting neighborhood cardiovascular health: A Bayesian machine learning approach. Prev. Med. 2020, 141, 106240. [Google Scholar] [CrossRef]
- Liu, Y.; Traskin, M.; Lorch, S.A.; George, E.I.; Small, D. Ensemble of trees approaches to risk adjustment for evaluating a hospital’s performance. Health Care Manag. Sci. 2015, 18, 58–66. [Google Scholar] [CrossRef]
- Lin, J.Y.J.; Hu, L.; Huang, C.; Jiayi, J.; Lawrence, S.; Govindarajulu, U. A flexible approach for variable selection in large-scale healthcare database studies with missing covariate and outcome data. BMC Med. Res. Methodol. 2022, 22, 132. [Google Scholar] [CrossRef]
- Hu, L.; Ji, J.; Ennis, R.D.; Hogan, J.W. A flexible approach for causal inference with multiple treatments and clustered survival outcomes. Stat. Med. 2022; in press. [Google Scholar] [CrossRef]
- Hu, L.; Ji, J. CIMTx: An R package for causal inference with multiple treatments using observational data. R J. 2022; in press. [Google Scholar]
- Hu, L.; Ji, J.; Liu, H.; Ennis, R. A flexible approach for assessing heterogeneity of causal treatment effects on patient survival using large datasets with clustered observations. Int. J. Environ. Res. Public Health 2022, 19, 14903. [Google Scholar]
- Hu, L.; Ji, J.; Li, F. Estimating heterogeneous survival treatment effect in observational data using machine learning. Stat. Med. 2021, 40, 4691–4713. [Google Scholar] [CrossRef]
- Hu, L.; Hogan, J.W.; Mwangi, A.W.; Siika, A. Modeling the causal effect of treatment initiation time on survival: Application to HIV/TB co-infection. Biometrics 2018, 74, 703–713. [Google Scholar] [CrossRef]
- Hu, L.; Hogan, J.W. Causal comparative effectiveness analysis of dynamic continuous-time treatment initiation rules with sparsely measured outcomes and death. Biometrics 2019, 75, 695–707. [Google Scholar] [CrossRef] [Green Version]
- Little, R.J.; D’Agostino, R.; Cohen, M.L.; Dickersin, K.; Emerson, S.S.; Farrar, J.T.; Frangakis, C.; Hogan, J.W.; Molenberghs, G.; Murphy, S.A.; et al. The prevention and treatment of missing data in clinical trials. N. Engl. J. Med. 2012, 367, 1355–1360. [Google Scholar] [CrossRef] [Green Version]
- Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley Sons: New York, NY, USA, 2004. [Google Scholar]
- Hu, L.; Lin, J.; Ji, J. Variable selection with missing data in both covariates and outcomes: Imputation and machine learning. Stat. Methods Med. Res. 2021, 30, 2651–2671. [Google Scholar] [CrossRef]
- Hu, L.; Zou, J.; Gu, C.; Ji, J.; Lopez, M.; Kale, M. A flexible sensitivity analysis approach for unmeasured confounding with multiple treatments and a binary outcome with application to SEER-Medicare lung cancer data. Ann. Appl. Stat. 2022, 16, 1014–1037. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Chipman, H.A.; George, E.I.; McCulloch, R.E. BART: Bayesian additive regression trees. Ann. Appl. Stat. 2010, 4, 266–298. [Google Scholar] [CrossRef]
- Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; ChapmanHall CRC: Boca Raton, FL, USA, 2017. [Google Scholar]
- Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
- Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer: New York, NY, USA, 2013. [Google Scholar]
- Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
- Hu, L.; Lin, J.Y.; Sigel, K.; Kale, M. Estimating heterogeneous survival treatment effects of lung cancer screening approaches: A causal machine learning analysis. Ann. Epidemiol. 2021, 62, 36–42. [Google Scholar] [CrossRef]
- Dorie, V.; Hill, J.; Shalit, U.; Scott, M.; Cervone, D. Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Stat. Sci. 2019, 34, 43–68. [Google Scholar] [CrossRef]
- Bleich, J.; Kapelner, A.; George, E.I.; Jensen, S.T. Variable selection for BART: An application to gene regulation. Ann. Appl. Stat. 2014, 8, 1750–1781. [Google Scholar] [CrossRef] [Green Version]
- Hapfelmeier, A.; Ulm, K. A new variable selection approach using random forests. Comput. Stat. Data Anal. 2013, 60, 50–69. [Google Scholar] [CrossRef]
- Díaz-Uriarte, R.; Alvarez de Andrés, S. Gene selection and classification of microarray data using random forest. BMC Bioinform. 2006, 7, 3. [Google Scholar] [CrossRef] [Green Version]
- Lee, B.K.; Lessler, J.; Stuart, E.A. Improving propensity score weighting using machine learning. Stat. Med. 2010, 29, 337–346. [Google Scholar] [CrossRef] [Green Version]
- Hill, J.L. Bayesian nonparametric modeling for causal inference. J. Comput. Graph. Stat. 2011, 20, 217–240. [Google Scholar] [CrossRef]
- Wager, S.; Athey, S. Estimation and inference of heterogeneous treatment effects using random forests. J. Am. Stat. Assoc. 2018, 113, 1228–1242. [Google Scholar] [CrossRef] [Green Version]
- Hu, L.; Li, F.; Ji, J.; Joshi, H.; Scott, E. Estimating the causal effects of multiple intermittent treatments with application to COVID-19. arXiv 2022, arXiv:2109.13368. [Google Scholar]
- Hu, L. A new tool for clustered survival data and multiple treatments: Estimation of treatment effect heterogeneity and variable selection. arXiv 2022, arXiv:2206.08271. [Google Scholar]
- Horvitz, D.G.; Thompson, D.J. A generalization of sampling without replacement from a finite universe. J. Am. Stat. Assoc. 1952, 47, 663–685. [Google Scholar] [CrossRef]
- Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef] [Green Version]
- Stekhoven, D.J.; Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef] [PubMed]
- Waljee, A.K.; Mukherjee, A.; Singal, A.G.; Zhang, Y.; Warren, J.; Balis, U.; Marrero, J.; Zhu, J.; Higgins, P.D. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 2013, 3, e002847. [Google Scholar] [CrossRef] [PubMed]
- Xu, D.; Daniels, M.J.; Winterstein, A.G. Sequential BART for imputation of missing covariates. Biostatistics 2016, 17, 589–602. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mickey, R.M.; Greenland, S. The impact of confounder selection criteria on effect estimation. Am. J. Epidemiol. 1989, 129, 125–137. [Google Scholar] [CrossRef] [PubMed]
Methods | Selected Variables | AUC |
---|---|---|
BART | Chalson comorbidity score, gender, married, histology, year of diagnosis | 0.85 |
XGBoost | Age, year of diagnosis | 0.72 |
RF | Chalson comorbidity score, histology | 0.74 |
Methods | RAS vs. OT | RAS vs. VATS | OT vs. VATS |
---|---|---|---|
BART | 0.94 (0.72, 1.16) | 1.09 (0.84, 1.34) | 1.12 (0.87, 1.37) |
XGBoost | 0.91 (0.64, 1.13) | 1.04 (0.79, 1.28) | 1.08 (0.84, 1.33) |
RF | 0.90 (0.63, 1.14) | 1.03 (0.78, 1.29) | 1.06 (0.82, 1.35) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hu, L.; Li, L. Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series. Int. J. Environ. Res. Public Health 2022, 19, 16080. https://doi.org/10.3390/ijerph192316080
Hu L, Li L. Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series. International Journal of Environmental Research and Public Health. 2022; 19(23):16080. https://doi.org/10.3390/ijerph192316080
Chicago/Turabian StyleHu, Liangyuan, and Lihua Li. 2022. "Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series" International Journal of Environmental Research and Public Health 19, no. 23: 16080. https://doi.org/10.3390/ijerph192316080
APA StyleHu, L., & Li, L. (2022). Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series. International Journal of Environmental Research and Public Health, 19(23), 16080. https://doi.org/10.3390/ijerph192316080