Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework
Abstract
:Simple Summary
Abstract
1. Introduction
2. A Research-Centric Evaluation Framework for ML-Extracted RWD to Support Research and Evidence Generation
2.1. Test Set
2.2. Overall Performance Assessment
2.3. Stratified Performance
- Whether model performance may be worse for a particular sub-cohort of interest given analyses often only cover specific sub-populations. For example, if recently diagnosed patients are of interest for an analysis, stratification by initial diagnosis year group can help detect whether there are any changes in documentation patterns (e.g., data shift) over time that is impacting model performance in the target population.
- Whether model performance may be worse for a demographic group. High model performance does not preclude model errors from being concentrated in specific demographic groups due to bias introduced during model development or chance alone. Evaluating performance stratified by demographic subgroups, such as race and gender, may help to minimize unintentional discrimination caused by the model and ensure model fairness and generalizability. Readers should keep in mind that these assessments should also be performed during the validation process to ensure all actions needed to develop fair models were taken into account and at this stage, only the final findings are reported as confirmation or data use limitations.
- Whether the performance is differential with respect to important covariates, such as treatment status, stage, biomarker status, and demographic characteristics, as many analytic use cases are focused on sub-cohorts that are receiving specific treatments. Differential error makes it harder to predict if and how model errors lead to bias in an analysis in which these variables are used as covariates.
2.4. Quantitative Error Analysis
2.5. Replication of Analytic Use Cases
- Replication of analytic use cases allows us to focus on specific sub-cohorts of interest.
- The impact of model errors may differ based on how the variable is used. For example, a categorical variable can be used to select or stratify the study cohort, or it can be used as a variable in the analysis itself (e.g., as a covariate or in a propensity matching algorithm). A date variable can be used as the index date in a time-to-event analysis (e.g., rwOS from metastatic diagnosis date) or to select the study cohort (e.g., patients that started a particular therapy after metastatic diagnosis).
- Model errors may be correlated rather than randomly distributed across the patient population. Replication analyses can shed light on the combined model performance which may be higher or lower for a selected cohort.
- For event-level (rather than patient-level) variables, such as biomarker testing events, model performance metrics relevant to the actual use case can be difficult to define and interpret. For example, a biomarker model may over-predict the number of biomarker tests patients receive, resulting in lower test-level precision metrics. However if the majority of use cases are only interested in the biomarker test result closest to diagnosis, additional false-positive predictions at other temporal points would not introduce bias to the analysis results as long as the patient’s biomarker status at diagnosis is correctly predicted.
3. Illustrative Use Case
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Booth, C.M.; Karim, S.; Mackillop, W.J. Real-World Data: Towards Achieving the Achievable in Cancer Care. Nat. Rev. Clin. Oncol. 2019, 16, 312–325. [Google Scholar] [CrossRef] [PubMed]
- Bourla, A.B.; Meropol, N.J. Bridging the Divide between Clinical Research and Clinical Care in Oncology: An Integrated Real-World Evidence Generation Platform. Digit. Health 2021, 7, 20552076211059975. [Google Scholar] [CrossRef] [PubMed]
- Beacher, F.D.; Mujica-Parodi, L.; Gupta, S.; Ancora, L.A. Machine Learning Predicts Outcomes of Phase III Clinical Trials for Prostate Cancer. Algorithms 2021, 14, 147. [Google Scholar] [CrossRef]
- Berger, M.L.; Curtis, M.D.; Smith, G.; Harnett, J.; Abernethy, A.P. Opportunities and Challenges in Leveraging Electronic Health Record Data in Oncology. Future Oncol. 2016, 12, 1261–1274. [Google Scholar] [CrossRef]
- Amoroso, N.; Pomarico, D.; Fanizzi, A.; Didonna, V.; Giotta, F.; La Forgia, D.; Latorre, A.; Monaco, A.; Pantaleo, E.; Petruzzellis, N.; et al. A Roadmap Towards Breast Cancer Therapies Supported by Explainable Artificial Intelligence. Appl. Sci. 2021, 11, 4881. [Google Scholar] [CrossRef]
- Mitsala, A.; Tsalikidis, C.; Pitiakoudis, M.; Simopoulos, C.; Tsaroucha, A.K. Artificial Intelligence in Colorectal Cancer Screening, Diagnosis and Treatment. A New Era. Curr. Oncol. 2021, 28, 1581–1607. [Google Scholar] [CrossRef]
- Da Silva Barros, M.H.L.F.; Alves, G.O.; Souza, L.M.F.; da Silva Rocha, E.; de Oliveira, J.F.L.; Lynn, T.; Sampaio, V.; Endo, P.T. Benchmarking Machine Learning Models to Assist in the Prognosis of Tuberculosis. Informatics 2021, 8, 27. [Google Scholar] [CrossRef]
- Kreimeyer, K.; Foster, M.; Pandey, A.; Arya, N.; Halford, G.; Jones, S.F.; Forshee, R.; Walderhaug, M.; Botsis, T. Natural Language Processing Systems for Capturing and Standardizing Unstructured Clinical Information: A Systematic Review. J. Biomed. Inform. 2017, 73, 14–29. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, L.; Rastegar-Mojarad, M.; Moon, S.; Shen, F.; Afzal, N.; Liu, S.; Zeng, Y.; Mehrabi, S.; Sohn, S.; et al. Clinical Information Extraction Applications: A Literature Review. J. Biomed. Inform. 2018, 77, 34–49. [Google Scholar] [CrossRef]
- Yim, W.; Yetisgen, M.; Harris, W.P.; Kwan, S.W. Natural Language Processing in Oncology: A Review. JAMA Oncol. 2016, 2, 797–804. [Google Scholar] [CrossRef]
- Savova, G.K.; Tseytlin, E.; Finan, S.; Castine, M.; Miller, T.; Medvedeva, O.; Harris, D.; Hochheiser, H.; Lin, C.; Chavan, G.; et al. DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records. Cancer Res. 2017, 77, e115–e118. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Birnbaum, B.; Nussbaum, N.; Seidl-Rathkopf, K.; Agrawal, M.; Estevez, M.; Estola, E.; Haimson, J.; He, L.; Larson, P.; Richardson, P. Model-Assisted Cohort Selection with Bias Analysis for Generating Large-Scale Cohorts from the EHR for Oncology Research. arXiv 2020, arXiv:2001.09765. [Google Scholar]
- Maarseveen, T.D.; Maurits, M.P.; Niemantsverdriet, E.; van der Helm-van Mil, A.H.M.; Huizinga, T.W.J.; Knevel, R. Handwork Vs Machine: A Comparison of Rheumatoid Arthritis Patient Populations as Identified from EHR Free-Text by Diagnosis Extraction through Machine-Learning Or Traditional Criteria-Based Chart Review. Arthritis Res. Ther. 2021, 23, 174. [Google Scholar] [CrossRef]
- Hu, Z.; Melton, G.B.; Moeller, N.D.; Arsoniadis, E.G.; Wang, Y.; Kwaan, M.R.; Jensen, E.H.; Simon, G.J. Accelerating Chart Review using Automated Methods on Electronic Health Record Data for Postoperative Complications. In AMIA Annual Symposium Proceedings; American Medical Informatics Association: Bethesda, MD, USA, 2016; pp. 1822–1831. [Google Scholar]
- Collins, G.S.; Moons, K.G.M. Reporting of Artificial Intelligence Prediction Models. Lancet 2019, 393, 1577–1579. [Google Scholar] [CrossRef]
- Sounderajah, V.; Ashrafian, H.; Golub, R.M.; Shetty, S.; De Fauw, J.; Hooft, L.; Moons, K.; Collins, G.; Moher, D.; Bossuyt, P.M.; et al. Developing a Reporting Guideline for Artificial Intelligence-Centred Diagnostic Test Accuracy Studies: The STARD-AI Protocol. BMJ Open 2021, 11, e047709. [Google Scholar] [CrossRef] [PubMed]
- Vasey, B.; Clifton, D.A.; Collins, G.S.; Denniston, A.K.; Faes, L.; Geerts, B.F.; Liu, X.; Morgan, L.; Watkinson, P.; McCulloch, P.; et al. DECIDE-AI: New Reporting Guidelines to Bridge the Development-to-Implementation Gap in Clinical Artificial Intelligence. Nat. Med. 2021, 27, 186–187. [Google Scholar]
- Sendak, M.P.; Gao, M.; Brajer, N.; Balu, S. Presenting Machine Learning Model Information to Clinical End Users with Model Facts Labels. NPJ Digit. Med. 2020, 3, 41. [Google Scholar] [CrossRef]
- Hernandez-Boussard, T.; Bozkurt, S.; Ioannidis, J.P.A.; Shah, N.H. MINIMAR (MINimum Information for Medical AI Reporting): Developing Reporting Standards for Artificial Intelligence in Health Care. J. Am. Med. Inform. Assoc. 2020, 27, 2011–2015. [Google Scholar] [CrossRef]
- Bates, D.W.; Auerbach, A.; Schulam, P.; Wright, A.; Saria, S. Reporting and Implementing Interventions Involving Machine Learning and Artificial Intelligence. Ann. Intern. Med. 2020, 172, S137–S144. [Google Scholar] [CrossRef]
- Girman, C.J.; Ritchey, M.E.; Lo Re, V., III. Real-World Data: Assessing Electronic Health Records and Medical Claims Data to Support Regulatory Decision-Making for Drug and Biological Products. Pharmacoepidemiol. Drug Saf. 2022, 31, 717–720. [Google Scholar] [CrossRef]
- Mullainathan, S.; Vogeli, C.; Powers, B.; Obermeyer, Z. Dissecting Racial Bias in an Algorithm used to Manage the Health of Populations. Science 2019, 366, 447–453. [Google Scholar]
- Willemink, M.J.; Koszek, W.A.; Hardell, C.; Wu, J.; Fleischmann, D.; Harvey, H.; Folio, L.R.; Summers, R.M.; Rubin, D.L.; Lungren, M.P. Preparing Medical Imaging Data for Machine Learning. Radiology 2020, 295, 4–15. [Google Scholar] [CrossRef] [PubMed]
- Varoquaux, G.; Cheplygina, V. Machine Learning for Medical Imaging: Methodological Failures and Recommendations for the Future. NPJ Digit. Med. 2022, 5, 48. [Google Scholar] [CrossRef] [PubMed]
- Tan, W.K.; Heagerty, P.J. Surrogate-Guided Sampling Designs for Classification of Rare Outcomes from Electronic Medical Records Data. Biostatistics 2020, 23, 345–361. [Google Scholar] [CrossRef]
- Figueroa, R.L.; Zeng-Treitler, Q.; Kandula, S.; Ngo, L.H. Predicting Sample Size Required for Classification Performance. BMC Med. Inform. Decis. Mak. 2012, 12, 8. [Google Scholar] [CrossRef] [Green Version]
- Rokem, A.; Wu, Y.; Lee, A. Assessment of the Need for Separate Test Set and Number of Medical Images Necessary for Deep Learning: A Sub-Sampling Study. bioRxiv 2017, 196659. [Google Scholar] [CrossRef] [Green Version]
- Lakens, D. Sample Size Justification. Collabra Psychol. 2022, 8, 33267. [Google Scholar]
- Kelly, C.J.; Karthikesalingam, A.; Suleyman, M.; Corrado, G.; King, D. Key Challenges for Delivering Clinical Impact with Artificial Intelligence. BMC Med. 2019, 17, 195. [Google Scholar]
- Lash, T.L.; Fox, M.P.; MacLehose, R.F.; Maldonado, G.; McCandless, L.C.; Greenland, S. Good Practices for Quantitative Bias Analysis. Int. J. Epidemiol. 2014, 43, 1969–1985. [Google Scholar] [CrossRef] [Green Version]
- Agrawal, M.; Adams, G.; Nussbaum, N.; Birnbaum, B. TIFTI: A Framework for Extracting Drug Intervals from Longitudinal Clinic Notes. arXiv 2018, arXiv:1811.12793. [Google Scholar]
- Jager, K.J.; van Dijk, P.C.; Zoccali, C.; Dekker, F.W. The Analysis of Survival Data: The Kaplan–Meier Method. Kidney Int. 2008, 74, 560–565. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- US Food and Drug Administration. Framework for FDA’S Real-World Evidence Program; US Department of Health and Human Services Food and Drug Administration: Silver Spring, MD, USA, 2018.
- Desai, K.; Chandwani, S.; Ru, B.; Reynolds, M.; Christian, J.B.; Estiri, H. PCN37 an Oncology Real-World Data Assessment Framework for Outcomes Research. Value Health 2021, 24, S25. [Google Scholar] [CrossRef]
- National Academies of Sciences, Engineering, and Medicine Division; Health and Medicine Division; Board on Health Sciences Policy. Forum on Drug Discovery, Development, and Translation. Examining the Impact of Real-World Evidence on Medical Product Development; Shore, C., Gee, A.W., Kahn, B., Forstag, E.H., Eds.; National Academies Press: Washington, DC, USA, 2019. [Google Scholar]
- Hernandez-Boussard, T.; Monda, K.L.; Crespo, B.C.; Riskin, D. Real World Evidence in Cardiovascular Medicine: Ensuring Data Validity in Electronic Health Record-Based Studies. J. Am. Med. Inform. Assoc. 2019, 26, 1189–1194. [Google Scholar] [CrossRef] [PubMed]
- Miksad, R.A.; Abernethy, A.P. Harnessing the Power of Real-World Evidence (RWE): A Checklist to Ensure Regulatory-Grade Data Quality. Clin. Pharmacol. Ther. 2018, 103, 202–205. [Google Scholar] [CrossRef] [PubMed]
- Suresh, H.; Guttag, J.V. A Framework for Understanding Sources of Harm Throughout the Machine Learning Life Cycle. In Equity and Access in Algorithms, Mechanisms, and Optimization; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar]
Variable Type | Example ML-Extracted Variable | Example Performance Metric |
---|---|---|
Categorical | Diagnosis (yes/no) | Sensitivity Positive predictive value (PPV or precision) Specificity Negative predictive value (NPV) Accuracy Predicted prevalence vs. abstracted prevalence Calibration plots (if applicable) |
Date | Diagnosis date | Sensitivity with a ± n-day window PPV with a ± n-day window 1 Distribution of date errors |
Continuous | Lab value | Sensitivity, PPV, and accuracy for classifying the result as within vs. outside the normal range Sensitivity, PPV, and accuracy for classifying the result within ±X of the true value Mean absolute error (MAE) |
Goal | Example Strata Variables |
---|---|
(I.) Understand performance in sub-cohorts of interest | Year of diagnosis (e.g., before vs. after year x) Treatment status (treated vs. not treated) Biomarker status (positive vs. negative) |
(II.) Fairness | Race and ethnicity group Gender Age group Insurance status |
(III.) Risk for statistical bias in analysis | Treatment setting (Academic vs. Community) Cancer stage at diagnosis Age at diagnosis Cancer histology Smoking status Treatment status (treated vs. not treated) Biomarker status (positive vs. negative) |
Comparison | Usefulness/Interpretation |
---|---|
True positives vs. False negatives | This comparison informs whether patients incorrectly excluded from the study cohort differ from those correctly included with respect to patient characteristics or outcomes. If true positive and false negative patients appear similar:
|
True positives vs. False positives | This comparison informs whether patients incorrectly included in the study cohort differ from those correctly included with respect to patient characteristics or outcomes. If true positive and false positive patients appear similar:
|
Variable: Metastatic Diagnosis (yes/no) | ||
Model Description1 Inputs to the model include unstructured documents from the EHR (e.g., visit notes, pathology/radiology reports). The output of the model is a binary prediction (yes/no) for whether the patient has a metastatic diagnosis at any time in the record. | ||
Target Dataset/Population The model is used in a dataset that contains patients with non-small cell lung cancer (NSCLC). | ||
Common Analytic Use Case
| ||
ML-Extracted Variable Evaluation | ||
Components | Description | Hypothetical Results and Findings |
Test Set | The size of the test set is selected to achieve a target margin of error for the primary evaluation metric (e.g, sensitivity or PPV) within the minority class (metastatic disease). To measure model performance, a random sample of patients is taken from a NSCLC cohort and withheld from model development. | Patients selected from the target population which is not included in model development |
Overall Performance | As the primary use of this variable is to select a cohort of metastatic patients, sensitivity, PPV, specificity, and NPV are measured. To evaluate how well this variable selects a metastatic cohort, emphasis is placed on sensitivity and PPV to understand the proportion of patients missed and the proportion of patients incorrectly included in the final cohort. | Sensitivity 2 = 0.94 PPV 3 = 0.91 Specificity 4 = 0.90 NPV 5 = 0.90 |
Stratified Performance | Sensitivity and PPV for both Metastatic and Non-metastatic classes are calculated across strata of variables of interest. Stratifying variables are selected with the following goals in mind:
| Example finding for race and ethnicity:
|
Quantitative Error Analysis | To understand the impact of model errors on the selected study cohort, baseline characteristics and rwOS are evaluated for the following groups
| Example findings from rwOS analysis *:
|
Replication of Use Cases | Evaluate rwOS from metastatic diagnosis date for patients selected as metastatic by the ML-extracted variable vs. abstracted counterpart (outcomes in the general population) | rwOS for ML extracted cohort: 9.8 months (95% CI 8.92–10.75) rwOS for abstracted cohort: 9.8 months (95% CI 8.92–10.69) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Estevez, M.; Benedum, C.M.; Jiang, C.; Cohen, A.B.; Phadke, S.; Sarkar, S.; Bozkurt, S. Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework. Cancers 2022, 14, 3063. https://doi.org/10.3390/cancers14133063
Estevez M, Benedum CM, Jiang C, Cohen AB, Phadke S, Sarkar S, Bozkurt S. Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework. Cancers. 2022; 14(13):3063. https://doi.org/10.3390/cancers14133063
Chicago/Turabian StyleEstevez, Melissa, Corey M. Benedum, Chengsheng Jiang, Aaron B. Cohen, Sharang Phadke, Somnath Sarkar, and Selen Bozkurt. 2022. "Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework" Cancers 14, no. 13: 3063. https://doi.org/10.3390/cancers14133063
APA StyleEstevez, M., Benedum, C. M., Jiang, C., Cohen, A. B., Phadke, S., Sarkar, S., & Bozkurt, S. (2022). Considerations for the Use of Machine Learning Extracted Real-World Data to Support Evidence Generation: A Research-Centric Evaluation Framework. Cancers, 14(13), 3063. https://doi.org/10.3390/cancers14133063