Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records
Abstract
:Simple Summary
Abstract
1. Introduction
2. Materials and Methods
2.1. Experiment Design
2.2. Data Collection and Preprocessing
- Data cleaning: features with more than 20% of missing values were excluded. For those accounting for less than 20% of missing values, we applied data imputation techniques such as the use of the mode (ECOG), imputation based on similar values of other variables (cTNM, pTNM), and the use of linear regression using subsets of variables as predictors (weight, height, Ki67, ER, PR, HER2).
- Feature transformation: different transformations were applied to the extracted data for their subsequent processing by ML algorithms. Nominal features were transformed into binary class data. Dates were transformed into numerical (age and age at diagnosis) and binary (recurrence). Some features were aggregated to derive one integrated feature, for example, in the case of BMI (Body Mass Index). A more detailed process was applied to extract comorbidities. Using all extracted diagnosis codes may present challenges when training ML algorithms due to the high number of different codes and the low representativeness of each of them in our dataset. This has been solved by mapping all the diagnoses found in the list of 31 categories used in the Elixhauser Comorbidity Index [37] and counting the number of different diagnoses per category for each patient. Finally, we retained only the categories that contained 50+ instances in the dataset.
- Scaling: we normalized all the variables to the range 0–1 prior to modeling to help with the learning process and avoid large weight values.
2.3. Automatic Information Retrieval from Unstructured Data
2.4. Predictive Models
2.4.1. Logistic Regression (LR)
2.4.2. Decision Tree (DT)
2.4.3. Gradient Boosting (GB)
2.4.4. eXtreme Gradient Boosting (XGB)
2.4.5. Deep Neural Network (DNN)
2.5. Model Building and Statistical Analysis
- Precision: proportion of predicted positives that are actual positive cases.
- Recall: proportion of actual positives that are correctly classified.
- F1-score: harmonic mean of precision and recall.
- Area Under the Receiver Operating Characteristic (AUROC): shows how well the probabilities from the positive classes are separated from the negative ones, i.e., how adequately predictions are ranked.
3. Results
3.1. Study Characteristics
3.2. ML Algorithm Selection
3.3. Comparison of Datasets
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Algorithm | Dataset | Precision | Recall | F1 | AUROC |
---|---|---|---|---|---|
LR | STR | 0.86 | 0.8 | 0.82 | 0.72 |
UNS | 0.87 | 0.86 | 0.86 | 0.65 | |
COMB | 0.82 | 0.78 | 0.8 | 0.55 | |
Average | 0.850 | 0.813 | 0.827 | 0.640 | |
DT | STR | 0.87 | 0.86 | 0.86 | 0.70 |
UNS | 0.86 | 0.83 | 0.84 | 0.69 | |
COMB | 0.83 | 0.86 | 0.84 | 0.54 | |
Average | 0.853 | 0.850 | 0.847 | 0.643 | |
GB | STR | 0.91 | 0.90 | 0.91 | 0.80 |
UNS | 0.86 | 0.82 | 0.83 | 0.73 | |
COMB | 0.88 | 0.86 | 0.87 | 0.8 | |
Average | 0.883 | 0.860 | 0.870 | 0.777 | |
XGB | STR | 0.92 | 0.93 | 0.92 | 0.84 |
UNS | 0.89 | 0.90 | 0.88 | 0.80 | |
COMB | 0.89 | 0.89 | 0.89 | 0.78 | |
Average | 0.900 | 0.907 | 0.897 | 0.807 | |
DNN | STR | 0.91 | 0.92 | 0.91 | 0.75 |
UNS | 0.89 | 0.90 | 0.89 | 0.82 | |
COMB | 0.85 | 0.87 | 0.86 | 0.57 | |
Average | 0.883 | 0.897 | 0.887 | 0.713 |
References
- Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef]
- Roux, A.; Cholerton, R.; Sicsic, J.; Moumjid, N.; French, D.P.; Giorgi Rossi, P.; Balleyguier, C.; Guindy, M.; Gilbert, F.J.; Burrion, J.B.; et al. Study protocol comparing the ethical, psychological and socio-economic impact of personalised breast cancer screening to that of standard screening in the “My Personal Breast Screening” (MyPeBS) randomised clinical trial. BMC Cancer 2022, 22, 1–13. [Google Scholar] [CrossRef] [PubMed]
- Esserman, L.J. The WISDOM Study: Breaking the deadlock in the breast cancer screening debate. NPJ Breast Cancer 2017, 3, 34. [Google Scholar] [CrossRef]
- Hortobagyi, G.N.; Stephen, B.E.; Armando, G. New and important changes in the TNM staging system for breast cancer. Am. Soc. Clin. Oncol. Educ. Book 2018, 38, 457–467. [Google Scholar] [CrossRef]
- van Maaren, M.C.; de Munck, L.; Strobbe, L.J.; Sonke, G.S.; Westenend, P.J.; Smidt, M.L.; Poortmans, P.M.P.; Siesling, S. Ten-year recurrence rates for breast cancer subtypes in the Netherlands: A large population-based study. Int. J. Cancer 2019, 144, 263–272. [Google Scholar] [CrossRef] [PubMed]
- Liu, F.-F.; Shi, W.; Done, S.J.; Miller, N.; Pintilie, M.; Voduc, D.; Nielsen, T.O.; Nofech-Mozes, S.; Chang, M.C.; Whelan, T.J.; et al. Identification of a low-risk luminal A breast cancer cohort that may not benefit from breast radiotherapy. J. Clin. Oncol. 2015, 33, 2035–2040. [Google Scholar] [CrossRef]
- Tsutsui, S.; Ohno, S.; Murakami, S.; Hachitanda, Y.; Oda, S. Prognostic value of c-erbB2 expression in breast cancer. J. Surg. Oncol. 2002, 79, 216–223. [Google Scholar] [CrossRef] [PubMed]
- Tobin, N.P.; Harrell, J.C.; Lövrot, J.; Brage, S.E.; Stolt, M.F.; Carlsson, L.; Einbeigi, Z.; Linderholm, B.; Loman, L.; Malmberg, M.; et al. Molecular subtype and tumor characteristics of breast cancer metastases as assessed by gene expression significantly influence patient post-relapse survival. Ann. Oncol. 2015, 26, 81–88. [Google Scholar] [CrossRef]
- Dent, R.; Trudeau, M.; Pritchard, K.I.; Hanna, W.M.; Kahn, H.K.; Sawka, C.A.; Lickley, L.A.; Rawlinson, E.; Sun, P.; Narod, S.A. Triple-negative breast cancer: Clinical features and patterns of recurrence. Clin. Cancer Res. 2007, 13, 4429–4434. [Google Scholar] [CrossRef]
- Boyle, P. Triple-negative breast cancer: Epidemiological considerations and recommendations. Ann. Oncol. 2012, 23, vi7–vi12. [Google Scholar] [CrossRef]
- Luz, E.J.d.S.; Schwartz, W.R.; Cámara-Chávez, G.; Menotti, D. ECG-based heartbeat classification for arrhythmia detection: A survey. Comput. Methods Programs Biomed. 2016, 127, 144–164. [Google Scholar] [CrossRef] [PubMed]
- Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef] [PubMed]
- Mahmoudi, E.; Kamdar, N.; Kim, N.; Gonzales, G.; Singh, K.; Waljee, A.K. Use of electronic medical records in development and validation of risk prediction models of hospital readmission: Systematic review. BMJ 2020, 369, m958. [Google Scholar] [CrossRef] [PubMed]
- Liu, X.; Song, L.; Liu, S.; Zhang, Y. A review of deep-learning-based medical image segmentation methods. Sustainability 2021, 13, 1224. [Google Scholar] [CrossRef]
- Bullard, J.; Dust, K.; Funk, D.; Strong, J.E.; Alexander, D.; Garnett, L.; Boodman, C.; Bello, A.; Hedley, A.; Schiffman, Z.; et al. Predicting infectious severe acute respiratory syndrome coronavirus 2 from diagnostic samples. Clin. Infect. Dis. 2020, 71, 2663–2666. [Google Scholar] [CrossRef]
- Agrebi, S.; Anis, L. Use of Artificial Intelligence in Infectious Diseases. Artificial Intelligence in Precision Health; Academic Press: Cambridge, MA, USA, 2020; pp. 415–438. [Google Scholar] [CrossRef]
- Moncada-Torres, A.; van Maaren, M.C.; Hendriks, M.P.; Siesling, S.; Geleijnse, G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci. Rep. 2021, 11, 6968. [Google Scholar] [CrossRef]
- Othman, M.; and Mohd, A.M.B. Probabilistic neural network for brain tumor classification. In Proceedings of the 2011 Second International Conference on Intelligent Systems, Modelling and Simulation, Phnom Penh, Cambodia, 25–27 January 2011. [Google Scholar] [CrossRef]
- Choi, Y.J.; Baek, J.H.; Park, H.S.; Shim, W.H.; Kim, T.Y.; Shong, Y.K.; Lee, J.H. A computer-aided diagnosis system using artificial intelligence for the diagnosis and characterization of thyroid nodules on ultrasound: Initial clinical assessment. Thyroid 2017, 27, 546–552. [Google Scholar] [CrossRef]
- Mambou, S.J.; Maresova, P.; Krejcar, O.; Selamat, A.; Kuca, K. Breast cancer detection using infrared thermal imaging and a deep learning model. Sensors 2018, 18, 2799. [Google Scholar] [CrossRef]
- Stark, G.F.; Hart, G.R.; Nartowt, B.J.; Deng, J. Predicting breast cancer risk using personal health data and machine learning models. PLoS ONE 2019, 14, e0226765. [Google Scholar] [CrossRef]
- Parikh, R.B.; Manz, C.; Chivers, C.; Regli, S.H.; Braun, J.; Draugelis, M.E.; Schuchter, L.M.; Schulman, L.N.; Navathe, A.S.; Patel, M.S.; et al. Machine learning approaches to predict 6-month mortality among patients with cancer. JAMA Netw. Open 2019, 2, e1915997. [Google Scholar] [CrossRef] [PubMed]
- Alabi, R.O.; Elmusrati, M.; Sawazaki-Calone, I.; Kowalski, L.P.; Haglund, C.; Coletta, R.D.; Mäkitie, A.A.; Salo, T.; Almangush, A.; Leivo, I. Comparison of supervised machine learning classification techniques in prediction of locoregional recurrences in early oral tongue cancer. Int. J. Med. Inform. 2020, 136, 104068. [Google Scholar] [CrossRef]
- Xu, Y.; Ju, L.; Tong, J.; Zhou, C.M.; Yang, J.J. Machine learning algorithms for predicting the recurrence of stage IV colorectal cancer after tumor resection. Sci. Rep. 2020, 10, 2519. [Google Scholar] [CrossRef] [PubMed]
- Lou, S.-J.; Hou, M.F.; Chang, H.T.; Chiu, C.C.; Lee, H.H.; Yeh, S.C.J.; Shi, H.Y. Machine learning algorithms to predict recurrence within 10 years after breast cancer surgery: A prospective cohort study. Cancers 2020, 12, 3817. [Google Scholar] [CrossRef] [PubMed]
- Boeri, C.; Chiappa, C.; Galli, F.; De Berardinis, V.; Bardelli, L.; Carcano, G.; Rovera, F. Machine Learning techniques in breast cancer prognosis prediction: A primary evaluation. Cancer Med. 2020, 9, 3234–3243. [Google Scholar] [CrossRef] [PubMed]
- Yang, P.-T.; Wu, W.S.; Wu, C.C.; Shih, Y.N.; Hsieh, C.H.; Hsu, J.L. Breast cancer recurrence prediction with ensemble methods and cost-sensitive learning. Open Med. 2021, 16, 754–768. [Google Scholar] [CrossRef]
- Ngiam, K.Y.; Khor, W. Big data and machine learning algorithms for health-care delivery. Lancet Oncol. 2019, 20, e262–e273. [Google Scholar] [CrossRef]
- Chen, M.; Hao, Y.; Hwang, K.; Wang, L.; Wang, L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access 2017, 5, 8869–8879. [Google Scholar] [CrossRef]
- Zhang, D.; Yin, C.; Zeng, J.; Yuan, X.; Zhang, P. Combining structured and unstructured data for predictive models: A deep learning approach. BMC Med. Inform. Decis. Mak. 2020, 20, 1–11. [Google Scholar] [CrossRef]
- Zeng, Z.; Espino, S.; Roy, A.; Li, X.; Khan, S.A.; Clare, S.E.; Jiang, X.; Neapolitan, R.; Luo, Y. Using natural language processing and machine learning to identify breast cancer local recurrence. BMC Bioinform. 2018, 19, 65–74. [Google Scholar] [CrossRef]
- Karimi, Y.H.; Blayney, D.W.; Kurian, A.W.; Shen, J.; Yamashita, R.; Rubin, D.; Banerjee, I. Development and use of natural language processing for identification of distant cancer recurrence and sites of distant recurrence using unstructured electronic health record data. JCO Clin. Cancer Inform. 2021, 5, 469–478. [Google Scholar] [CrossRef]
- Datta, S.; Bernstam, E.V.; Roberts, K. A frame semantic overview of NLP-based information extraction for cancer-related EHR notes. J. Biomed. Inform. 2019, 100, 103301. [Google Scholar] [CrossRef] [PubMed]
- Barber, E.L.; Garg, R.; Persenaire, C.; Simon, M. Natural language processing with machine learning to predict outcomes after ovarian cancer surgery. Gynecol. Oncol. 2021, 160, 182–186. [Google Scholar] [CrossRef] [PubMed]
- Ribelles, N.; Jerez, J.M.; Rodriguez-Brazzarola, P.; Jimenez, B.; Diaz-Redondo, T.; Mesa, H.; Marquez, A.; Sanchez-Muñoz, A.; Pajares, B.; Carabantes, F.; et al. Machine learning and natural language processing (NLP) approach to predict early progression to first-line treatment in real-world hormone receptor-positive (HR+)/HER2-negative advanced breast cancer patients. Eur. J. Cancer 2021, 144, 224–231. [Google Scholar] [CrossRef] [PubMed]
- González-Castro, L.; Cal-González, V.M.; Del Fiol, G.; López-Nores, M. CASIDE: A data model for interoperable cancer survivorship information based on FHIR. J. Biomed. Inform. 2021, 124, 103953. [Google Scholar] [CrossRef]
- Quan, H.; Sundararajan, V.; Halfon, P.; Fong, A.; Burnand, B.; Luthi, J.C.; Saunders, L.D.; Beck, C.A.; Feasby, T.E.; Ghali, W.A. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med. Care 2005, 43, 1130–1139. [Google Scholar] [CrossRef]
- Bonaccorso, G. Machine Learning Algorithms; Packt Publishing Ltd.: Birmingham, UK, 2017. [Google Scholar]
- Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef]
- Kantarjian, H.; Yu, P.P. Artificial intelligence, big data, and cancer. JAMA Oncol. 2015, 1, 573–574. [Google Scholar] [CrossRef]
- Vinayak, R.K.; Gilad-Bachrach, R. Dart: Dropouts meet multiple additive regression trees. In Proceedings of the Artificial Intelligence and Statistics, PMLR, San Diego, CA, USA, 9–12 May 2015; pp. 489–497. [Google Scholar]
- Tomašev, N.; Harris, N.; Baur, S.; Mottram, A.; Glorot, X.; Rae, J.W.; Zielinski, M.; Askham, H.; Saraiva, A.; Magliulo, V.; et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat. Protoc. 2021, 16, 2765–2787. [Google Scholar] [CrossRef]
- Gupta, M.; Phan, T.L.T.; Bunnell, H.T.; Beheshti, R. Obesity Prediction with EHR Data: A deep learning approach with interpretable elements. ACM Trans. Comput. Healthc. (HEALTH) 2022, 3, 1–19. [Google Scholar] [CrossRef]
- Pham, T.; Tran, T.; Phung, D.; Venkatesh, S. Predicting healthcare trajectories from medical records: A deep learning approach. J. Biomed. Inform. 2017, 69, 218–229. [Google Scholar] [CrossRef]
- Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Information Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]
- Schuster, M.; and Paliwal, K.K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Gianni, C.; Palleschi, M.; Schepisi, G.; Casadei, C.; Bleve, S.; Merloni, F. Circulating inflammatory cells in patients with metastatic breast cancer: Implications for treatment. Front. Oncol. 2022, 12, 882896. [Google Scholar] [CrossRef] [PubMed]
- Onesti, C.E.; Josse, C.; Boulet, D.; Thiry, J.; Beaumecker, B.; Bours, V.; Jerusalem, G. Blood eosinophilic relative count is prognostic for breast cancer and associated with the presence of tumor at diagnosis and at time of relapse. Oncoimmunology 2020, 9, 1761176. [Google Scholar] [CrossRef]
- Onesti, C.E.; Josse, C.; Poncin, A.; Frères, P.; Poulet, C.; Bours, V.; Jerusalem, G. Predictive and prognostic role of peripheral blood eosinophil count in triple-negative and hormone receptor-negative/HER2-positive breast cancer patients undergoing neoadjuvant treatment. Oncotarget 2018, 9, 33719. [Google Scholar] [CrossRef] [PubMed]
Feature | Possible Values | Description |
---|---|---|
Sex | nominal: male, female | Male or female |
Age at diagnosis | numerical | Age of the patient at the time of diagnosis |
BMI | numerical | A patient’s weight in kilograms divided by the square of his/her height in meters |
ECOG | ordinal: 1, 2, 3, 4 | Eastern Cooperative Oncology Group (ECOG) performance status score. Patients’ level of functioning in terms of their ability to care for themselves, daily activity, and physical ability |
Comorbidities (Elixhauser categories) | nominal: Elixhauser categories | Medical condition existing simultaneously but independently with another condition in a patient |
Tumor site | nominal: C501, C502, C503, [...] | Tumor body location |
Grade | ordinal: 1, 2, 3, 4 | The degree of differentiation of the cancer cells |
TNM staging (clinical and pathological) | categorical: T1, T2, T3, T4, TX, [...] | TNM system describes the amount and spread of cancer in a patient’s body. T: tumor size N: lymph node involvement M: presence or absence of metastases |
Estrogen Receptor | numerical | Percentage of cancer cells expressing estrogen receptors in the tumor tissue sample. |
Progesterone Receptor | numerical | Percentage of cancer cells expressing progesterone receptors in the tumor tissue sample. |
HER2 | ordinal | Human Epidermal growth factor Receptor |
Ki67 | numerical | Antigen KI67 |
No. surgeries | numerical | tumorectomy, mastectomy |
No. chemotherapies | numerical | Treatment of cancer by cytotoxic and/or other drugs. |
No. radiotherapies | numerical | Treatment of the tumor using X-rays. |
Concept Type | Number of Features |
---|---|
Disease | 35 |
Symptom | 53 |
Medicine | 16 |
Procedure | 8 |
Risk factor | 6 |
Feature | Total (n = 823) | Non-Recurrence (n = 718) | Recurrence (n = 105) | Completeness |
---|---|---|---|---|
Sex | 100% | |||
male | 5 (0.6%) | 4 (0.6%) | 1 (1.0%) | |
female | 818 (99.4%) | 714 (99.4%) | 104 (99.0%) | |
Age at diagnosis | 60.39 ± 12.71 | 60.58 ± 12.70 | 59.13 ± 12.71 | 100% |
BMI | 25.77 ± 4.83 | 25.70 ± 4.90 | 26.28 ± 4.28 | 86.15% |
ECOG | 99.88% | |||
ECOG 0 | 745 (91.0%) | 657 (91.5%) | 88 (83.8%) | |
ECOG 1 | 65 (7.9%) | 51 (7.1%) | 14 (13.3%) | |
ECOG 2 | 9 (1.1%) | 7 (1.0%) | 2 (1.9%) | |
ECOG 3 | 3 (0.4%) | 2 (0.3%) | 1 (1.0%) | |
ECOG 4 | 1 (0.1%) | 1 (0.1%) | 0 (0%) | |
Comorbidities | 100% | |||
hypertension uncomplicated | 1.01 ± 2.81 | 0.99 ± 2.65 | 1.16 ± 3.73 | |
chronic pulmonary disease | 0.24 ± 0.94 | 0.22 ± 0.90 | 0.34 ± 1.26 | |
diabetes uncomplicated | 0.31 ± 1.86 | 0.31 ± 1.84 | 0.41 ± 2.04 | |
hypothyroidism | 0.49 ± 1.92 | 0.45 ± 1.72 | 0.76 ± 2.91 | |
metastasic cancer | 3.32 ± 10.53 | 2.44 ± 7.27 | 9.36 ± 21.67 | |
solid tumor without metastsis | 7.60 ± 9.54 | 6.84 ± 8.53 | 12.84 ± 13.63 | |
obesity | 0.24 ± 0.83 | 0.22 ± 0.60 | 0.37 ± 1.73 | |
alcohol abuse | 0.19 ± 1.43 | 0.21 ± 1.52 | 0.09 ± 0.37 | |
Tumor site | 100% | |||
C500 | 1 (0.1%) | 1 (0.1%) | 0 (0%) | |
C501 | 71 (8.6%) | 63 (8.8%) | 8 (7.6%) | |
C502 | 83 (10.1%) | 66 (9.2%) | 17 (16.2%) | |
C503 | 48 (5.8%) | 45 (6.3%) | 3 (2.9%) | |
C504 | 259 (31.5%) | 231 (32.2%) | 28 (26.7%) | |
C505 | 63 (7.7%) | 57 (7.9%) | 6 (5.7%) | |
C506 | 11 (1.3%) | 8 (1.1%) | 3 (2.9%) | |
C508 | 267 (32.4%) | 230 (%32.2) | 37 (35.2%) | |
C509 | 20 (2.4%) | 17 (2.4%) | 3 (2.9%) | |
Grade | 89.79% | |||
G1 | 78 (9.5%) | 74 (10.3%) | 4 (3.8%) | |
G2 | 133 (16.2%) | 118 (16.4%) | 15 (14.3%) | |
G3 | 611 (74.2%) | 525 (73.1%) | 86 (81.9%) | |
G4 | 1 (0.1%) | 1 (0.1%) | 0 (0%) | |
clinical TNM | 87.61% | |||
T | ||||
T1 | 400 (48.6%) | 369 (51.4%) | 31 (29.5%) | |
T2 | 260 (31.6%) | 210 (29.2%) | 50 (47.6%) | |
T3 | 30 (3.6%) | 24 (3.3%) | 6 (5.7%) | |
T4 | 28 (3.4%) | 16 (2.2%) | 12 (11.4%) | |
TX | 105 (12.6%) | 99 (13.8%) | 6 (5.7%) | |
N | ||||
N0 | 552 (67.1%) | 503 (70.1%) | 49 (46.7%) | |
N1 | 113 (13.7%) | 78 (10.9%) | 35 (33.3%) | |
N2 | 11 (1.3%) | 7 (1.0%) | 4 (3.8%) | |
N3 | 10 (1.2%) | 6 (0.8%) | 4 (3.8%) | |
NX | 137 (16.6%) | 124 (17.3%) | 13 (12.4%) | |
M | ||||
M0 | 444 (53.9%) | 371 (51.7%) | 73 (69.5%) | |
M1 | 20 (2.4%) | 12 (1.7%) | 8 (7.6%) | |
MX | 359 (43.6%) | 335 (46.7%) | 24 (22.9%) | |
pathological TNM | 97.33% | |||
T | ||||
T0 | 24 (2.9%) | 19 (2.6%) | 5 (4.8%) | |
T1 | 417 (50.7%) | 380 (52.9%) | 37 (35.2%) | |
T2 | 232 (28.2%) | 190 (26.5%) | 42 (40.0%) | |
T3 | 35 (4.3%) | 22 (3.1%) | 13 (12.4%) | |
T4 | 15 (1.8%) | 12 (1.7%) | 3 (2.9%) | |
TX | 100 (12.2%) | 95 (12.2%) | 5 (4.8%) | |
N | ||||
N0 | 544 (66.1%) | 494 (68.8%) | 50 (47.6%) | |
N1 | 153 (18.6%) | 127 (17.7%) | 26 (24.8%) | |
N2 | 48 (5.8%) | 32 (4.5%) | 16 (15.2%) | |
N3 | 18 (2.2%) | 9 (1.3%) | 9 (8.6%) | |
NX | 60 (7.3%) | 56 (7.8%) | 4 (3.8%) | |
M | ||||
M0 | 11 (1.3%) | 9 (1.3%) | 2 (1.9%) | |
M1 | 13 (1.6%) | 7 (1.0%) | 6 (5.7%) | |
MX | 799 (97.1%) | 702 (97.8%) | 97 (92.4%) | |
ER | 79.63 ± 34.92 | 82.34 ± 32.51 | 61.13 ± 44.28 | 87.85% |
PR | 53.48 ± 38.53 | 56.46 ± 37.78 | 33.13 ± 37.59 | 87.85% |
HER2 | 87.97% | |||
0 | 152 (18.5%) | 133 (18.5%) | 19 (18.1%) | |
1+ | 475 (57.7%) | 429 (59.8%) | 46 (43.8%) | |
2+ | 128 (15.6%) | 107 (14.9%) | 21 (20.0) | |
3+ | 68 (8.3%) | 49 (6.8%) | 19 (18.1%) | |
Ki67 | 18.07 ± 19.56 | 16.35 ± 17.86 | 29.8 ± 25.8 | 87.97% |
No. surgeries | 1.44 ± 0.81 | 1.44 ± 0.79 | 1.45 ± 0.93 | 100% |
No. chemotherapies | 10.88 ± 22.51 | 8.42 ± 16.41 | 27.70 ± 42.67 | 100% |
No. radiotherapies | 4.24 ± 3.43 | 4.15 ± 3.41 | 4.87 ± 3.49 | 100% |
Concept | Concept Type | Total | Non-Recurrence | Recurrence |
---|---|---|---|---|
Chemotherapy | M | 6.65 ± 9.13 | 5.95 ± 8.92 | 11.47 ± 9.2 |
Axillary Lymphadenopathy | S | 1.08 ± 3.82 | 0.87 ± 3.52 | 2.5 ± 5.22 |
Antineoplastic Agent | M | 3.16 ± 4.91 | 2.82 ± 4.8 | 5.45 ± 5.1 |
Carcinoma | D | 8.21 ± 6.75 | 7.67 ± 6.61 | 11.87 ± 6.59 |
Biopsy | P | 2.44 ± 4.17 | 2.19 ± 3.94 | 4.17 ± 5.21 |
Concept | Total | Non-Recurrence | Recurrence |
---|---|---|---|
Axillary Lymphadenopathy | 1.08 ± 3.82 | 0.87 ± 3.52 | 2.5 ± 5.22 |
Ulcer of Nipple | 0.02 ± 0.56 | 0 ± 0 | 0.15 ± 1.56 |
Breast Mass | 0.77 ± 3.36 | 0.66 ± 3.22 | 1.58 ± 4.11 |
Colon Polyp | 0.04 ± 0.68 | 0.01 ± 0.14 | 0.22 ± 1.87 |
Lymphangitis | 0.08 ± 0.88 | 0.05 ± 0.56 | 0.33 ± 1.98 |
Algorithm | Precision | Recall | F1 | AUROC |
---|---|---|---|---|
LR | 0.850 | 0.813 | 0.827 | 0.640 |
DT | 0.853 | 0.850 | 0.847 | 0.643 |
GB | 0.883 | 0.860 | 0.870 | 0.777 |
XGB | 0.900 | 0.907 | 0.897 | 0.807 |
DNN | 0.883 | 0.897 | 0.887 | 0.713 |
Feature Set | Precision (CI 95%) | Recall (CI 95%) | F1 (CI 95%) | AUROC (CI 95%) |
---|---|---|---|---|
STR | 0.926 (0.924–0.928) | 0.928 (0.927–0.930) | 0.919 (0.917–0.921) | 0.847 (0.843–0.852) |
UNS | 0.894 (0.892–0.897) | 0.903 (0.901–0.905) | 0.882 (0.880–0.885) | 0.793 (0.787–0.799) |
COMB | 0.891 (0.888–0.893) *† | 0.891 (0.889–0.893) *† | 0.889 (0.886–0.891) *† | 0.778 (0.771–0.783) *† |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
González-Castro, L.; Chávez, M.; Duflot, P.; Bleret, V.; Martin, A.G.; Zobel, M.; Nateqi, J.; Lin, S.; Pazos-Arias, J.J.; Del Fiol, G.; et al. Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records. Cancers 2023, 15, 2741. https://doi.org/10.3390/cancers15102741
González-Castro L, Chávez M, Duflot P, Bleret V, Martin AG, Zobel M, Nateqi J, Lin S, Pazos-Arias JJ, Del Fiol G, et al. Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records. Cancers. 2023; 15(10):2741. https://doi.org/10.3390/cancers15102741
Chicago/Turabian StyleGonzález-Castro, Lorena, Marcela Chávez, Patrick Duflot, Valérie Bleret, Alistair G. Martin, Marc Zobel, Jama Nateqi, Simon Lin, José J. Pazos-Arias, Guilherme Del Fiol, and et al. 2023. "Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records" Cancers 15, no. 10: 2741. https://doi.org/10.3390/cancers15102741
APA StyleGonzález-Castro, L., Chávez, M., Duflot, P., Bleret, V., Martin, A. G., Zobel, M., Nateqi, J., Lin, S., Pazos-Arias, J. J., Del Fiol, G., & López-Nores, M. (2023). Machine Learning Algorithms to Predict Breast Cancer Recurrence Using Structured and Unstructured Sources from Electronic Health Records. Cancers, 15(10), 2741. https://doi.org/10.3390/cancers15102741