Next Article in Journal
Optimizing Image Quality with High-Resolution, Deep-Learning-Based Diffusion-Weighted Imaging in Breast Cancer Patients at 1.5 T
Previous Article in Journal
Predicting Complete Cytoreduction with Preoperative [18F]FDG PET/CT in Patients with Ovarian Cancer: A Systematic Review and Meta-Analysis
Previous Article in Special Issue
Can Machine Learning Assist in Diagnosis of Primary Immune Thrombocytopenia? A Feasibility Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Comparison of Interpretable Machine Learning Approaches to Identify Outpatient Clinical Phenotypes Predictive of First Acute Myocardial Infarction

1
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
2
Department of Anesthesiology, University of Michigan, Ann Arbor, MI 48109, USA
3
Michigan Institute for Data Science, University of Michigan, Ann Arbor, MI 48109, USA
4
Max Harry Weil Institute for Critical Care Research and Innovation, University of Michigan, Ann Arbor, MI 48109, USA
5
Department of Emergency Medicine, University of Michigan, Ann Arbor, MI 48109, USA
6
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI 48109, USA
*
Author to whom correspondence should be addressed.
Diagnostics 2024, 14(16), 1741; https://doi.org/10.3390/diagnostics14161741
Submission received: 15 June 2024 / Revised: 30 July 2024 / Accepted: 6 August 2024 / Published: 10 August 2024

Abstract

:
Background: Acute myocardial infarctions are deadly to patients and burdensome to healthcare systems. Most recorded infarctions are patients’ first, occur out of the hospital, and often are not accompanied by cardiac comorbidities. The clinical manifestations of the underlying pathophysiology leading to an infarction are not fully understood and little effort exists to use explainable machine learning to learn predictive clinical phenotypes before hospitalization is needed. Methods: We extracted outpatient electronic health record data for 2641 case and 5287 matched-control patients, all without pre-existing cardiac diagnoses, from the Michigan Medicine Health System. We compare six different interpretable, feature extraction approaches, including temporal computational phenotyping, and train seven interpretable machine learning models to predict the onset of first acute myocardial infarction within six months. Results: Using temporal computational phenotypes significantly improved the model performance compared to alternative approaches. The mean cross-validation test set performance exhibited area under the receiver operating characteristic curve values as high as 0.674. The most consistently predictive phenotypes of a future infarction include back pain, cardiometabolic syndrome, family history of cardiovascular diseases, and high blood pressure. Conclusions: Computational phenotyping of longitudinal health records can improve classifier performance and identify predictive clinical concepts. State-of-the-art interpretable machine learning approaches can augment acute myocardial infarction risk assessment and prioritize potential risk factors for further investigation and validation.

1. Introduction

An acute myocardial infarction (AMI), or “heart attack”, is myocardial necrosis due to sudden ischemia caused by blood clotting around ruptured or exposed plaque in the coronary arteries [1,2]. Globally, more than 3 million people have an AMI each year [3]. Fortunately, AMI incidence rates have declined as researchers and clinicians have identified and managed risk factors [4]. The vast majority of AMIs occur out of the hospital, where patients have limited monitoring [5]. Additionally, most reported AMIs are the patients’ first and are often unaccompanied by comorbidities [1,6]. These observations highlight the inherent difficulty in predicting AMI events.
There have been significant efforts to predict a variety of severe adverse cardiovascular events, including AMIs. Many studies predict AMI onset in patients while they are in the hospital [7,8,9,10]. However, ideally, prediction occurs earlier so clinicians can intervene to avoid hospitalization. Using electronic health record (EHR) data for over 20,000 AMI cases in a cohort of 2.27 million patients from the UCHealth hospital system, Mandair et al. predicted the 6 month risk of a first AMI using several machine learning models [11]. The best-performing model achieved an AUROC of 0.835 and F1 of 0.092. Unfortunately, their model exhibited poor calibration, ignored timing, and did not utilize laboratory values, nor did they provide any insight into how their model made predictions. Moore and Bell used XGBoost on data from over 500,000 patients in the UK Biobank to predict self-reported “heart attack” (11,849 cases) [12]. They interpreted their models using SHAP values. However, they did not give any information regarding the timing of recorded features or heart attack. Wang et al. predicted AMI within 10 years in 11,635 patients but did not provide any model interpretability [13]. Tsarapatsani et al. predicted 10 year AMI onset in a cohort of 3267 patients that had electrocardiogram and angiography data available, and used SHAP values for model explainability [14]. Sievering et al. predicted 5 year AMI onset in 500 patients with coronary artery disease using angiography images and 11 clinical features [15]. While significant effort has been put into predicting AMIs, the resulting models often focus on patients who already have cardiovascular comorbidities, ignore temporal relationships in the data, and lack sufficient interpretability and explainability.
Interpretability and explainability in machine learning models generally derive from two approaches: model structure and post hoc analyses. Common post hoc methods for quantifying feature importance include SHAP [16] and LIME [17] values, albeit with potentially questionable reliability [18]. Many canonical models incorporate inherent interpretability into their structure. Logistic regression models provide variable coefficients that indicate the effect of features on the outcome. Tree-based models like random forest [19] and Extreme Gradient Boosting (XGBoost) [20] compute feature importance based on location in trees and metrics like impurity and gain, respectively. Generalized additive models, such as the Explainable Boosting Machine (EBM) [21], learn a nonlinear function for each variable, or interaction of variables, which describes their impact on the model. Attention mechanisms embedded in deep learning models can explain what data heavily weighs the outcome and relationships between data. For example, TabNet provides global and local feature importance scores [22]. However, the interpretability and reliability of attention is disputed [23]. Matrix and tensor factorizations methods learn interpretable factors that provide a low-rank approximation of the data and can be used for clustering, phenotyping, dimensionality reduction, and feature engineering. Applied to EHR data, tensor factorization can automatically discover patterns of co-occurring medical variables across patients and their evolution across time [24]. This has proven useful as the irregular temporal nature of EHR data is a primary challenge. Fuzzy neural networks are models that use fuzzy logic within a neural network structure to map features to interpretable concepts and learn logical rules for prediction. Specifically, the tropical geometry fuzzy neural network (TGFNN) developed by Yao et al. has shown recent promise [25]. We employ several of these interpretable methods in this work.
In this work, we assess whether state-of-the-art interpretable machine learning approaches can learn clinical profiles that predict a patient’s first AMI, before hospitalization is required. We extract five years of longitudinal outpatient EHR data for patients without cardiovascular diagnoses before AMI onset, from the University of Michigan Health System (2698 positive and 5396 matched negative samples). Using tensor factorization, we reduce the dimensionality of the longitudinal health history while preserving interpretability and temporal relationships. Using the EHR phenotypes and other patient data, we train seven state-of-the-art interpretable machine learning models, including TGFNN, to predict AMI onset within six months. We evaluate whether incorporating temporal information via computational phenotyping improves model performance, overall model performance, and the consistency of important features. We present and clinically validate the learned phenotypes, rules, and relationships that explain the models’ predicted outcomes. We anticipate that these findings can assist researchers and clinicians in better understanding the risk factors of AMI, identifying at-risk patients, and providing preventative care.

2. Materials and Methods

2.1. Dataset

In this study, we used outpatient data collected from adult patients of the University of Michigan Health System (UMHS) from 1 January 2012 to 1 May 2023.
To define our cohort, we retrieved data from adult patients (23–89 years) who had at least three outpatient visits within the five years before their latest visit or their first cardiac diagnosis. We defined cardiac diagnoses as ICD9 codes 410.*–429.* and 785.0–785.1, and ICD10 codes I20.*–I52.* and R00.*. Cases, or positive samples, were defined as those patients in the cohort whose first cardiac diagnosis was an AMI (ICD9: 410.*; ICD10: I21.*). Controls, or negative samples, were any other UMHS patient who met the above criteria but did not have a cardiac diagnosis. We matched two negative patients to each positive patient based on sex, ±2 years in age (at time of AMI or last encounter), and ±2 points in the hospital frailty risk score [26]. We computed each patient’s hospital frailty risk score using all diagnoses on their EHR in the five-year period. Positive patients without control matches were excluded. This resulted in a cohort of 2641 positive patients (those who develop an AMI) and 5287 negative patients (those who do not develop an AMI). We split the patients into training and testing sets with a 70–30 split. For demographic information on the cohort, see Table 1.

2.2. Data Preprocessing

We extracted each patient’s data within the five years before their AMI or last recorded encounter. These data included time-dependent data like diagnoses, medications, vitals, laboratory results, and substance use. Time-independent data were also extracted, including demographics and family health history. We cleaned the data to remove erroneous and ambiguous values (e.g., text entry in numeric variable column, values outside of possible range, etc.). We converted all temperature values to Fahrenheit. Continuous variables like laboratory values and vitals were only included if >60% of patients in the training set had at least one measurement. We arbitrarily selected 60% as a missingness cutoff to prioritize the most common and accessible clinical variables as well as limit errors in downstream imputation. Removing rare variables reduces the need for data imputation and prioritizes results based off common, accessible variables. Categorical variables, like race, were one-hot encoded. We excluded patients missing information on their sex. We determined whether patients had a family history of cardiovascular diseases by whether they had at least one familial occurrence of heart disease, heart attack, coronary artery disease, heart failure, heart defect, aortic disease, sudden cardiac death, cardiomyopathy, cardiovascular disease, or rheumatic heart disease. We excluded procedure data. Diagnosis features were originally recorded as codes from the International Statistical Classification of Diseases (ICD) version 9 or 10 [27] and we converted all ICD9 codes to ICD10 via a conversion table provided at https://github.com/bhanratt/ICD9CMtoICD10CM, accessed on 3 March 2024. We removed all “Z” chapter ICD10 codes. Diagnoses were encoded as binary variables to indicate the presence of the ICD10 code, regardless of how often it was recorded. To condense the diagnosis data, we added the higher-level IDC10 categories as features if one of their children diagnoses was present, e.g., if a patient had ICD10 code E11.0 present, they would also have E11 marked as present. Medication information was also encoded as binary variables indicating its prescription at every encounter between its start and stop dates. Medication feature names were taken directly as recorded in the EHR. For both diagnosis and medication data, we employed carry forward imputation followed by zero imputation. We removed variables present in less than 1% of both case and control patients in the training set. All data preprocessing was completed in Python (Version 3.9) and all code used in the study is available at https://github.com/kayvanlabs/interpretable-ami-prediction, accessed on 3 March 2024.
We extracted three different sets of features from the data: (1) the demographics, or time-independent data, and latest recorded values within six months of AMI onset or the last encounter, (2) summary statistics of the entire five-year history, and (3) computational phenotypes of five-year health history using unsupervised tensor factorization. We selected these approaches for condensing patients EHR history because they are common, interpretable, easy to compute, and have been shown to be effective in other studies. Additionally, we evaluated the combination of the feature sets. In total, we tested six distinct feature sets:
  • Latest data and demographics;
  • Summary statistics;
  • Computational phenotypes;
  • Latest data, demographics, and summary statistics;
  • Latest data, demographics, and computational phenotypes;
  • Latest data, demographics, summary statistics, and computational phenotypes.

2.3. Latest Recorded Data of Health History

For each patient in the dataset, we extracted the most recent measurement of each variable before AMI onset or the last recorded visit, for positive and negative samples, respectively.

2.4. Summary Statistics of Health History

Summary statistics of clinical variables are fast to compute, easily understandable, and can be predictive of important outcomes [28]. We summarized laboratory and vital data over the five-year observation window by computing the mean, standard deviation, minimum, and maximum of each variable, for each patient. Categorical variables were aggregated by taking the maximum value, indicating whether a patient ever had the feature present in the five-year window. These statistics reduced the multiple, longitudinal measurements of each variable to a single, interpretable value.

2.5. Computational Phenotypes of Health History

Computational phenotyping of high-dimensional EHR data via tensor decomposition enables automated, low-dimensional representation of co-occurring medical events across patients, different types of variables, and time [24]. In this work, we use tensor decomposition to discover temporal, clinical phenotypes that act as interpretable features for downstream classification. While various tensor factorization approaches exist, we used the unsupervised, non-negative PARAFAC decomposition, with the hierarchical alternating least squares algorithm, implemented in TensorLy (Version 0.8.1) because of its simplicity, wide use, and ease of use [29]. Non-negative PARAFAC decomposition approximates the original data with the sum of rank-one component tensors. Each component tensor is defined by the outer product of vectors, one for each mode of the original data. The values in these vectors are learned via alternating least squares and describe the components. In this work, we decompose three-dimensional tensors with the modes: patients, time, and features. Thus, after applying non-negative PARAFAC decomposition, each component of the factorization can be interpreted as a phenotype defined by three vectors that encode the membership, weight, or importance of patients, time points, and features.
We learned temporal phenotypes for laboratory values and vitals separately from diagnoses and medication. All laboratory values and vitals for the five years preceding AMI onset or the patient’s last encounter were separated into ten, six-month segments. Only the last recorded value of each feature was kept per segment. To simplify computation and interpretation, each feature was discretized into quintiles based on the feature distributions in the training set. The resulting three-dimensional tensor representation of the data consisted of modes: patient × time × feature and size 7928 × 10 × 320.
We used diagnosis and medication data over the 5 observation years to generate temporal phenotypes. First, we split the 5 years into ten, six-month intervals. Within each interval, we recorded the diagnoses documented at encounters with a “1” and undocumented diagnoses as “0”. If there was no encounter in the six-month interval, diagnosis variables were left as null. We then performed carry forward imputation of all diagnosis codes until the next interval with an encounter. All remaining null values were imputed with zero. For medication data, we marked a “1” if the medication was prescribed during a given interval; otherwise a “0” was inserted. We formatted these data into a three-dimensional tensor of modes: patient × time × feature of size 7928 × 10 × 734.
Determining the optimal number of phenotypes, or rank, of a tensor decomposition is an open problem [24]. One common approach is to evaluate and plot the predictive performance of various ranks and choose the rank at the “elbow” of the curve—effectively identifying the rank after which performance increases are marginal. We carried this out by first factoring the training set using ranks at increments of two, between one and 50, running three replicates at each rank. Next, using the normalized patient membership to the phenotypes as features, we removed 30% of the training set for a validation set, trained a random forest model to discriminate between positive and negative samples, and visualized the performance according to common machine learning metrics, against the rank. For both the lab/vital and diagnoses/medications (Dx/Rx) phenotypes, the F1 score plateaued by a rank of ten. However, both AUROC and AUPRC gradually increased till approximately 30 for lab/vital phenotypes and continued to increase to a rank of 50 for Dx/Rx phenotypes (see Figure A1). We decided to use those ranks for each decomposition. While the predictive performance of the Dx/Rx phenotypes may continue to increase beyond rank 50, in practice, more phenotypes may become increasingly redundant or less interpretable.
Using the ranks of 30 and 50, we decomposed the training set tensors and extracted the learned lab/vital and Dx/Rx phenotypes, respectively. To determine the test set patients’ membership of these phenotypes, we fixed the feature and time dimensions of the phenotypes to those of the training set and then set the decomposition to fit only the patient membership mode. This projects the phenotypes onto the test set patients to determine their membership of each, without changing the phenotypes themselves. Lastly, we concatenated the lab/vital phenotype features with the Dx/Rx phenotype features into a single patient × feature table with 80 temporal phenotypes as features to describe each patient’s EHR history.

2.6. Feature Selection

We used the Minimum Redundancy and Maximum Relevance (mRMR) [30] approach to select the most relevant and least redundant subset of features from each feature set: latest/demographics, summary statistics, and computational phenotypes. We opted to use this feature selection approach because it not only identifies the most relevant features but also limits collinearity between the selected features, unlike other feature selection methods. Using mRMR can improve model performance, speed, and interpretability [31]. Feature relevance is determined by random forest feature importance and feature redundancy by Pearson’s correlation. To assess the optimal number of features for each feature set, we incrementally increased the number of features to use when running mRMR and then assessed their performance in three random forest models. By looking at the results, for each feature set, we determined a reasonably small number of features with near-optimal performance. We selected 20 features for the latest and demographic feature set, 30 for the summary statistics feature set, and 30 for the phenotypes feature set. After feature selection, if any variables had missing values, we imputed them using k-nearest neighbors, as implemented in Scikit-learn (Version 1.2.2) [32], fit on the training set only, and applied to both training and testing sets. Next, we performed the same experiment using combinations of the three feature sets and opted to use 20 features in the latest/demographic/summary statistics feature set, 30 in the latest/demographic/phenotypes feature set, and 60 in the “All” feature set (latest/demographic/summary statistics/phenotypes).

2.7. Model Training and Cross-Validation

We selected a set of machine learning models to evaluate in this work based on their interpretability and accessibility, including decision tree (DT), logistic regression with L2 penalty (LR), random forest (RF), EBM, XGBoost (XGB), TabNet (TNET), and TGFNN. We used the decision tree, logistic regression, and random forest implementations from scikit-learn (Version 1.2.2), the EBM implementation from InterpretML (Version 0.3.2) [16], and XGBoost (Version 1.7.5) [20], TabNet (Version 4.1.0) [22], and the TGFNN as described in [25] and implemented in Pytorch (Version 2.1.0). To compensate for dataset imbalance, we up-weighted the minority class (positive) and down-weighted the majority class (negative). In models not allowing class weights (EBM, XGBoost, and TabNet), we randomly downsampled the majority class (negative) to a 1:1 ratio with the minority class (positive).
We performed three-fold cross-validation on the 90% of the training set (10% withheld as a validation set) to determine the optimal hyperparameters for each model. We randomly sampled 500 combinations of hyperparameters for each model, except TGFNN. Because of the longer runtime of TGFNN, we evaluated 300 combinations for all feature sets besides evaluating 200 on the “All” feature set, due to slower training from the additional features. After evaluating their performance, the combination of hyperparameters with the highest F1 score was selected. Next, we performed five-fold cross-validation to evaluate the performance of the models and datasets on the training and test sets. This trains five instances of each model on a different subset of data, providing information on the variance in performance. To evaluate model calibration, we calibrated the best-performing replicate according to F1 score on the “All” feature set. Each model was calibrated on the training set according to Platt’s method and plotted with the mean probability of ten uniform bins.

2.8. Tropical Geometry Fuzzy Neural Network

Due to its lesser-known architecture, we briefly describe the TGFNN, though a full description can be found in [25]. The TGFNN is fuzzy logic classifier built in a neural network architecture that allows flexible and interpretable variable concept encoding, logical rule learning, and inference for a classification task. TGFNN consists of three modules: the encoding module, the rule module, and the inference module.
The encoding module “fuzzifies” continuous input variables into their membership to the concepts “low”, “medium”, and “high”. This encoding is performed via parameterized membership functions that map each variable to three values in the range [0, 1] that represent how much it belongs to each concept. Categorical variables are one-hot encoded. Membership functions are learned during training and help model the intuition and uncertainty in clinician decision making. For example, TGFNN can learn the concept of “low blood pressure” and use that concept in the decision-making rules.
The rule module learns combinations of variable concepts that are predictive of an outcome. The first layer of the rule module learns which concepts are important for each variable within each rule. The second layer learns the importance of each variable within each rule. The more important the variable, the greater the weight within the network, and thus the more it will contribute to inference when activated. Rule activation strength is calculated via a parameterized T-norm which models an AND operation in fuzzy logic via either a product or minimum function. This enables the easily interpretable and logical structure of the decision rules, for example, “if x 1 is low and x 2 is high”.
The final layer of the TGFNN is the inference layer, which learns the importance of each rule in determining the model output. The importances, or contributions, of the rules are aggregated in a T-conorm function, followed by softmax activation. This calculates the probability of each output class, given the activation of the rules by the input sample. Implementation of tropical geometry allows the OR operation to be changed between an addition or maximum function.

2.9. Statistical Analysis

To evaluate whether differences in model performance across feature sets were significant, we performed Friedman’s tests with the Bonferroni corrected alpha of 0.01 (0.05 divided by the number of tests run, five, one for each metric). For significant Friedman test results, we performed pairwise post hoc Nemenyi tests. We selected these tests because they are non-parametric and recommended when comparing machine learning model cross-validation performance [33].

3. Results

We find that, by applying straightforward, interpretable machine learning approaches to EHR data, we are able to predict the onset of first-AMI events in patients without pre-existing cardiovascular conditions, within six months, with moderately good accuracy. Upon comparing different explainable feature engineering approaches (the feature sets), we report that they can have significantly different performance, depending on the model. Overall, the best-performing feature sets were those that included computational phenotypes. Additionally, we compared seven machine and deep learning models, each with a different level of interpretability, and found them to exhibit significantly different performance. We present these results in detail in the following.

3.1. Feature Set Performance

Feature sets including computational phenotypes significantly outperform those without. Overall, the “All” feature set performs best, though only slightly (see Table 2). This is likely because of the large number of diverse features included, the efficacy of computational phenotyping for feature extraction, and the relevance of historical information. Incorporating computational phenotypes as features resulted in performance gains in AUROC as much as 0.05 (see Table A3 and Figure A4a). We evaluated whether the differences in overall model performance between feature sets were significant by performing Friedman’s test followed by pairwise Nemenyi tests. Across every pairwise comparison, all feature sets containing phenotypes had significantly higher performance than feature sets without phenotypes, according to AUROC. Under AUPRC and F1 score, most of these comparisons were also statistically significant. In no pairwise comparison, regardless of metric, do any of the feature sets containing phenotypes exhibit significantly different performance from each other, save the “All” feature set outperforming the “Phenotypes” feature set according to AUPRC.

3.2. Model Performance

We predict the onset of AMI within six months in patients without pre-existing cardiovascular diagnoses with good performance using several interpretable models. Model performance varied significantly between models and feature sets, often depending on the evaluation metric (see Figure A4). While there is no clear “best” model, random forest, logistic regression, and TGFNN performed best overall. In interpreting model performance on an imbalanced dataset, multiple metrics must be appropriately considered as there is no singularly best one. Determining performance criteria is especially important in a clinical application where false positives and negatives could lead to patient harm, either by receiving unnecessary treatment or not receiving needed care, respectively. We note that the models showed varying levels of minimal-to-mild overfitting according to training, validation, and testing set performance (see Table A1, Table A2 and Table A3). When considering all metrics, logistic regression and random forest consistently performed near best, often followed by TGFNN, while XGboost and decision tree were often among the worst. Depending on the metric, TGFNN, EBM, and TabNet typically performed either best or worst (see Figure A4). We performed Friedman tests followed by Nemenyi tests to evaluate whether, across all models, performance differences between feature sets were statistically significant. According to AUROC, random forest, logistic regression, and EBM all performed significantly better than the other models. For nearly all pairwise comparisons, random forest, logistic regression, and TGFNN performed significantly better than XGBoost, decision tree, TabNet, and EBM, according to F1 score. When considering F1 score, there was no significant difference between random forest, logistic regression, and TGFNN performance.
Several models appear biased to over- or underestimating risk of AMI. Across all feature sets, TGFNN exhibits high recall on average (0.754 ± 0.234), as it heavily predicts the positive class. Conversely, TabNet and EBM have low average recall scores (0.159 ± 0.124, 0.095 ± 0.076) due to relatively fewer positive predictions. The accuracy of the, albeit relatively few, positive predictions of EBM contributes to its high mean precision and AUROC (see Table 3). These biases are also present in the model calibration plots (see Figure 1). Both TabNet and EBM underestimate the probability of positive samples while TGFNN slightly overestimates. Overall, the best-performing models, according to F1 score on the “All” feature set, exhibit good calibration.
Several models suffered from variable performance. Model stability is an important factor when considering implementation, especially in a healthcare setting. In contrast to the simpler random forest and logistic regression models, more complicated models, like EBM, TabNetm, and TGFNN, exhibited higher standard deviations in performance (see Figure A4d,e). This may be due to the greater number of hyperparameters that require precise tuning in these models. This variance in performance makes the interpretation of important features difficult as well.

3.3. Model Interpretation

Each of the employed models exhibits a degree of inherent interpretability, allowing for some explanation of how predictions were made. For brevity, we will focus our analysis on the “All” feature set, as it is generally the best performing and includes features from all subsets. Additionally, we will focus our interpretability analysis on the better and more consistently performing models: logistic regression, random forest, and XGBoost.
Across all models with global feature importance scores (logistic regression, random forest, XGBoost, EBM, and TabNet), the computational phenotype features of longitudinal EHR data are often the most predictive of a future AMI event. The most important features include Dx/Rx phenotype 47, family history of cardiovascular diseases, Dx/Rx phenotype 36, lab/vital phenotype 18, and a high max systolic blood pressure within the five-year observation window (see Table 4). Feature coefficients in logistic regression and SHAP values of XGBoost and random forest models indicate the direction of the relationship between feature magnitude and future AMI prediction (see Figure 2). The patient phenotypes most strongly indicative of a future AMI are characterized by dorsalgia, type 2 diabetes, hypertension, high creatinine and urea nitrogen levels, cardiovascular medications like atorvastatin, and anemia (see Table 5). The temporal factor of these phenotypes may suggest the characteristic timing of its presentation in patients. We present the temporal factor plots of six of the most predictive phenotypes in Figure 3. The temporal components of the phenotypes predominantly range from immediately before AMI onset to three years prior. Apart from phenotypes being among the most predictive features, additional important variables include a history of smoking and high mean body mass index over the five-year window (Table 6).
The importance of features varies between models, making interpretation difficult at times. When computing the, on average, most important features in the “All” set across all models with global feature importance, there are large standard deviations (see Figure 4). Additionally, we compared how each model cross-validation replicate ranked variables by importance via the Kendall rank correlation coefficient (see Figure 4). This metric shows on a scale of [−1, 1] how negatively or positively correlated the rankings are, with 0 indicating no correlation. We found variable correlation between replicates and models. Logistic regression, random forest, and XGBoost show decent correlation between replicates. On the other hand, TabNet and EBM show relatively low correlation between replicates. Surprisingly, despite somewhat comparable performance, the feature importance ranking of logistic regression is somewhat negatively correlated with the rankings of both random forest and XGBoost. This may be a result of the specific model properties, such as the limitation of logistic regression in identifying linear relationships, whereas XGBoost and random forest can identify nonlinear ones. However, it may also reflect the inherent difficulty in the task of predicting future AMI events in this cohort.
Unlike the other models, TGFNN learns precise, interpretable rules that determine predictions. The best-performing TGFNN model uses the “All” feature set and is based on 12 rules learned directly from the data (AUROC = 0.658, AUPRC = 0.479, F1 = 0.456, precision = 0.475, recall = 0.439). We present these rules in Figure 5. Linguistically, the most important rule (R0) is
Patient has a history of Clopidogrel prescription and matches Dx/Rx phenotype 27 (characteristic features include: vitamin D3, simvastatin, vitamin B-12, vitamin C, and malignant neoplasm of bladder (ICD10: C67)) and has high mean creatinine.
Rules 1–5 are similarly simple to understand, containing a couple of concepts each, and describe combinations of cardiovascular and metabolic medication prescriptions along with abnormal lab and vital measurements, as well as a family history of cardiovascular conditions. Interestingly, while R6 has family history as important, R5 has the lack of family history as important. These rules may be stratifying between different underlying pathologies leading up to AMI. Notably, interpreting the “low”, “medium”, and “high” concepts is dependent on the shape of the underlying membership functions. Because of the flexible, trainable parameters of these functions, they may “squish” the “low” or “high” function out of the possible range of values to dynamically simplify to only two concepts.
Overall, we found the interpretable models able to accurately identify patients, without pre-existing malignant cardiovascular diagnoses, that have an AMI within six months. While the evaluated models varied in both performance and prioritization of important features, we identified several consistently important medical concepts and phenotypes.

4. Discussion

We demonstrate that accurately predicting an AMI within six months in patients without pre-existing cardiovascular conditions, using only outpatient data and interpretable models, is possible. Furthermore, we show that temporal, computational phenotyping can identify highly predictive clinical profiles of future AMI events. This suggests the relevance of historical information, temporal EHR relationships, and computational phenotyping in evaluating the future risk of AMI, which is often ignored in similar studies. We anticipate that these findings will be informative to researchers and clinicians seeking to develop interpretable machine learning approaches for hard-to-predict events like AMI, as well as leverage high-dimensional longitudinal EHR data.
The Dx/Rx phenotypes predictive of future AMI onset generally agree with strongly supported clinical relationships and also suggest potential underutilized relationships. The predictive Dx/Rx phenotype 47 describes dorsalgia and other pain as predictive. While low back pain does not have a known association with AMI, chronic pain is associated with various cardiovascular diseases [34,35] and some pain medications, like non-steroidal anti-inflammatory drugs (NSAIDs), are a known risk factor of AMI [36]. The consistently predictive Dx/Rx phenotype 13 describes a profile of pain medication prescriptions, including the NSAID ibuprofen. However, it was negatively related with future AMI according to several models. This discontinuity may be dataset specific or indicate an underlying relationship such as if a patient is on a certain type of medication it reflects their interaction with healthcare professionals that may be helping prevent AMI in other ways. On the other hand, a back pain phenotype may capture patients who are misinterpreting angina (precursor symptom of AMI) for dorsalgia. This could suggest clinicians increase their suspicion of underlying cardiovascular diseases when patients present with back pain. In congruence with known AMI risk factors, Dx/Rx phenotype 36 characterizes patients with type 2 diabetes, potentially further complicated with hypertension [37,38]. The Dx/Rx phenotype number 6 encompasses several cardiovascular medications like the platelet inhibitor clopidogrel. A clopidogrel prescription suggests the patients may have already had severe cardiovascular conditions, like coronary stenosis, that were not recorded in the EHR, that required a stent. This phenotype may be predictive due to poor medication adherence followed by in-stent re-stenosis and a subsequent AMI within approximately six months (see phenotype temporal peak in Figure 3e). However, further analysis is required to ascertain specific and supported claims of this clinical relationship.
The lab/vital phenotypes suggest some clinically valid risk factors, but are noticeably harder to interpret due to large quantile ranges. Lab/vital phenotype 28 characterizes patients with mild-to-severe kidney disease, indicated by elevated creatinine [39], high urea nitrogen [40], and hyperchloremia [41]. Kidney disease greatly increases the risk of adverse cardiac events like AMI [42]. Lab/vital phenotype 22 describes a patient with mild-to-severe anemia, a risk factor of AMI [43]. The temporal component of the phenotype suggests that this occurs relatively soon before AMI (see Figure 3f). However, the large range of these lab result quantiles limits the utility of the phenotypes. In the future, more precise partitioning of variables may resolve this. Notably, lab/vital phenotype 18 does not described abnormal physiology. The range of the “Absolute Early Granulocyte Count” encompasses essentially all possible values. A deeper look at the distribution of values in the training data suggested this is a result of too few unique values to make five equally sized quantiles. Additionally, the feature weights in this phenotype are relatively low, indicating weak membership and thus a rather ambiguous phenotype. The relevance of this phenotype with future AMI may be an artifact of the data or methods. While phenotypes using laboratory values and vital signs can be improved, they can successfully capture important abnormal physiology across temporal EHRs.
Visualizing SHAP values of the latest value and summary statistic features revealed additional risk factors with known clinical relevance. Unlike standard feature importance scores generated by tree-based models, SHAP values indicate the direction of relationships between features and outcomes. Specifically, the SHAP values of the random forest and XGBoost models trained on the “All” feature set suggest several predictive relationships (see Figure 2). These predictive variables include high blood pressure [44], family history of cardiovascular diseases [45], high body mass index [12], smoking [46], and low mean corpuscular hemoglobin levels [47]. Additionally, the SHAP values agree with other feature importance scores, indicating the presence of Dx/Rx phenotypes 36 and 47, back pain and cardiometabolic syndrome, respectively, are predictive of a future AMI. Other features were not consistently highly predictive across multiple replicates or models. The contradictions and variability in the importance of features and their relationship to future AMI events are likely a result of noise within the used EHR data and reflect a typical challenge in predictive machine learning in healthcare. Computational approaches, such as this work, may best serve as a screening method for a specific clinical relationship to be explored in more controlled settings.
When compared to summary statistics and the most recent recorded data, computational EHR phenotypes can significantly increase interpretability and performance in biomedical machine learning. Across multiple model architectures and feature sets, the phenotypes consistently ranked as the most important features. These results suggest that historical and temporal information encoded in EHRs is highly relevant for predictive modeling, and specifically for AMI risk assessment. Additionally, we suggest the increased use of tensor decomposition in EHR feature extraction. The employed tensor factorization algorithm mines temporal, high-dimensional EHR data without supervision, removing the need for clinicians to manually curate phenotypes. These phenotypes capture patterns of co-occurring medical variables across time to describe distinct patient profiles. These patient phenotypes reduce the dimensionality of the EHR data while maintaining interpretability and improving performance. Predictive phenotypes can prioritize to clinicians the important sets of conditions patients present with in the clinic, that may be indicative of risk for a future AMI. These can direct more targeted studies to establish association. Additionally, they can provide information regarding the timing of conditions, which may prompt further investigation into understanding the progression and evolution of disease, as well as potential timing for early intervention. Notably, computational phenotypes may be difficult to interpret if they are redundant, have many features with similar weight, or do not make clinical sense. Many improvements upon the base PARAFAC tensor factorization have been made to address these problems specifically for temporal EHR phenotyping [24]. However, in this work, we focused on the baseline approach due to its wide accessibility.
While we did not identify a singular superior, interpretable machine learning model, we identified several strengths and weaknesses. Overall, random forest, logistic regression, and TGFNN performed best. All models exhibit good calibration. In a similar AMI prediction study, [11] presents poor model calibration results. As the authors state, poor calibration in [11] is likely a result of very severe class imbalance, whereas in this work we limit class imbalance via the downsampling of matched negative samples and class-weighted loss functions. Some models, like EBM and TabNet, displayed very poor recall and F1 scores due to biased class predictions. They also showed low concordance in feature importance between cross-validation instances. It is likely that these models were not well suited for this particular dataset and task. The TGFNN provides clear rules for predictions, making it perhaps the most interpretable of the models. The rule-based nature of TGFNN well reflects how clinicians make decisions and identify patterns. We anticipate that the further development of TGFNN and other interpretable rule-based models will aid clinical adoption. Still, in the example presented in Figure 5, interpretation can be difficult if “medium” concepts cannot be clarified. Logistic regression and random forest both showed some of the best performance and consistency of feature importance. These models are often too simplistic to solve difficult tasks; however, in this case, deriving features from computational phenotypes improved performance.
This study has several limitations that affect the applicability and bias of results. First, the employed cohort of patients comes from a single hospital system and is predominantly elderly and white. We excluded data on procedures received by patients. We employed mRMR feature selection, which may not find the optimal set of features. Additionally, the interpretation of important features showed high variability between and even within models. We note that while similar studies attempting to predict AMI, such as the work carried out in [11], show higher AUROC values, this work attempts a potentially more difficult task to predict AMI events within a cohort without pre-existing cardiac conditions. Future work could address these limitations by expanding the cohort inclusion criteria, incorporating data from multiple healthcare systems, as well as using computational methods to explore the causal relationships between clinical features and AMI onset.
In conclusion, we suggest that temporal, computational phenotyping can improve the utility of outpatient EHRs in both predicting the risk of AMI in otherwise low-risk patients and identifying novel risk factors for further investigation. Additionally, we demonstrate that interpretable machine learning models can consistently identify important risk factors and accurately predict a future AMI event in patients without pre-existing cardiovascular conditions, using only outpatient data. We note that model-derived feature importance scores may be discordant, and encourage researchers to validate findings. We anticipate that these findings will promote further development in computational and machine learning approaches to identify novel phenotypes that can aid clinicians in understanding, predicting, and preventing AMI and subsequent hospitalization.

Author Contributions

Conceptualization, M.H. and K.N.; methodology, M.H., C.M. and M.M.; software, M.H.; validation, M.M.; formal analysis, M.H.; investigation, M.H.; resources, M.H.; data curation, M.H.; writing—original draft preparation, M.H.; writing—review and editing, M.H., C.M. and M.M.; visualization, M.H.; supervision, E.W. and K.N; project administration, M.H. and E.W.; funding acquisition, K.N. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare that this study received funding from Collaborative Safety Research Center at Toyota Motor Engineering & Manufacturing North America, Inc. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Institutional Review Board Statement

This study was approved by the institutional review board at the University of Michigan (HUM00228545).

Informed Consent Statement

Informed consent was waived due to the retrospective nature of the project.

Data Availability Statement

The datasets generated and/or analyzed during the current study were collected at Michigan Medicine. The University of Michigan’s Innovation Partnerships (UMIP) unit will handle potential charges/arrangements of the use of data by external entities, using such methods as material transfer agreements. Please contact UMIP ([email protected]) for data inquiries.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AMIAcute Myocardial Infarction
EHRElectronic Health Record
LRLogistic Regression
RFRandom Forest
TGFNNTropical Geometry Fuzzy Neural Network
EBMExplainable Boosting Machine
XGBoosteXtreme Gradient Boosting
XGBeXtreme Gradient Boosting
TNETTabNet
AUROCArea Under the Receiver Operating Characteristic Curve
AUPRCArea Under the Precision Recall Curve
mRMRMinimum Redundancy Maximum Relevance
ICD1010th revision of the International Statistical Classification of Diseases
SHAPSHapley Additive exPlanations
NSAIDNon-Steroidal Anti-Inflammatory Drug

Appendix A

Figure A1. The predictive performance of random forest using features from different rank decompositions. The lines (blue = training set, orange = validation set) show the mean score of three replicates, with shaded regions showing 95% confidence intervals. (a) Mean AUROC of lab/vital phenotypes; (b) mean AUPRC of lab/vital phenotypes; (c) mean AUROC of Dx/Rx phenotypes; (d) mean AUPRC of Dx/Rx phenotypes.
Figure A1. The predictive performance of random forest using features from different rank decompositions. The lines (blue = training set, orange = validation set) show the mean score of three replicates, with shaded regions showing 95% confidence intervals. (a) Mean AUROC of lab/vital phenotypes; (b) mean AUPRC of lab/vital phenotypes; (c) mean AUROC of Dx/Rx phenotypes; (d) mean AUPRC of Dx/Rx phenotypes.
Diagnostics 14 01741 g0a1aDiagnostics 14 01741 g0a1b
Figure A2. The predictive performance of random forest using different-sized feature sets selected by mRMR. The lines (blue = training set, orange = validation set) show the mean score of three replicates, with shaded regions showing 95% confidence intervals. (a) AUROC for phenotypes; (b) AUPRC for phenotypes; (c) F1 for phenotypes; (d) AUROC for latest, demo.; (e) AUPRC for latest, demo.; (f) F1 for latest, demo.; (g) AUROC for summary statistics; (h) AUPRC for summary statistics; (i) F1 for summary statistics.
Figure A2. The predictive performance of random forest using different-sized feature sets selected by mRMR. The lines (blue = training set, orange = validation set) show the mean score of three replicates, with shaded regions showing 95% confidence intervals. (a) AUROC for phenotypes; (b) AUPRC for phenotypes; (c) F1 for phenotypes; (d) AUROC for latest, demo.; (e) AUPRC for latest, demo.; (f) F1 for latest, demo.; (g) AUROC for summary statistics; (h) AUPRC for summary statistics; (i) F1 for summary statistics.
Diagnostics 14 01741 g0a2
Figure A3. The predictive performance of random forest using different-sized feature sets selected by mRMR. The lines (blue = training set, orange = validation set) show the mean score of three replicates, with shaded regions showing 95% confidence intervals. (a) AUROC for latest, demo., statistics; (b) AUPRC for latest, demo., statistics; (c) F1 for latest, demo., statistics; (d) AUROC for latest, demo., phenotypes; (e) AUPRC for latest, demo., phenotypes; (f) F1 for latest, demo., phenotypes; (g) AUROC for All; (h) AUPRC for All; (i) F1 for All.
Figure A3. The predictive performance of random forest using different-sized feature sets selected by mRMR. The lines (blue = training set, orange = validation set) show the mean score of three replicates, with shaded regions showing 95% confidence intervals. (a) AUROC for latest, demo., statistics; (b) AUPRC for latest, demo., statistics; (c) F1 for latest, demo., statistics; (d) AUROC for latest, demo., phenotypes; (e) AUPRC for latest, demo., phenotypes; (f) F1 for latest, demo., phenotypes; (g) AUROC for All; (h) AUPRC for All; (i) F1 for All.
Diagnostics 14 01741 g0a3
Table A1. Model cross-validation results on training set.
Table A1. Model cross-validation results on training set.
Feature SetModelAUROCAUPRCF1PrecisionRecall
DT0.682 ± 0.010.514 ± 0.0240.526 ± 0.0230.488 ± 0.0380.577 ± 0.061
EBM0.796 ± 0.0110.669 ± 0.0140.418 ± 0.0470.776 ± 0.0160.288 ± 0.045
LR0.677 ± 0.010.512 ± 0.0110.528 ± 0.0090.479 ± 0.010.588 ± 0.009
AllRF0.729 ± 0.0040.562 ± 0.0060.575 ± 0.0040.505 ± 0.0050.666 ± 0.006
TGFNN0.647 ± 0.0270.489 ± 0.0250.458 ± 0.040.478 ± 0.0480.455 ± 0.096
TNET0.672 ± 0.0190.513 ± 0.050.318 ± 0.0970.611 ± 0.0180.224 ± 0.094
XGB0.682 ± 0.0060.511 ± 0.0050.526 ± 0.0090.462 ± 0.0060.612 ± 0.024
DT0.712 ± 0.0110.547 ± 0.0110.557 ± 0.0090.484 ± 0.0290.664 ± 0.064
EBM0.736 ± 0.0170.591 ± 0.0220.232 ± 0.1340.783 ± 0.1230.146 ± 0.092
LR0.655 ± 0.0110.482 ± 0.0110.505 ± 0.0120.458 ± 0.0140.562 ± 0.009
Latest, demo., phenotypesRF0.698 ± 0.0050.529 ± 0.0060.545 ± 0.0010.475 ± 0.0040.638 ± 0.005
TGFNN0.537 ± 0.0080.36 ± 0.0050.464 ± 0.0080.343 ± 0.0030.717 ± 0.031
TNET0.674 ± 0.0150.511 ± 0.0390.331 ± 0.1090.596 ± 0.0090.241 ± 0.117
XGB0.652 ± 0.010.469 ± 0.0140.506 ± 0.0080.445 ± 0.0080.586 ± 0.017
DT0.65 ± 0.010.477 ± 0.0080.526 ± 0.0110.429 ± 0.020.686 ± 0.071
EBM0.715 ± 0.0130.568 ± 0.0180.19 ± 0.0940.771 ± 0.050.113 ± 0.063
LR0.635 ± 0.010.475 ± 0.0120.48 ± 0.0050.438 ± 0.010.532 ± 0.007
Latest, demo., statisticsRF0.708 ± 0.0070.559 ± 0.0040.545 ± 0.0050.489 ± 0.0120.617 ± 0.01
TGFNN0.634 ± 0.0150.479 ± 0.0170.506 ± 0.0060.344 ± 0.0120.96 ± 0.052
TNET0.616 ± 0.020.458 ± 0.0580.246 ± 0.0930.563 ± 0.0160.164 ± 0.084
XGB0.716 ± 0.010.568 ± 0.0080.549 ± 0.0130.502 ± 0.0120.607 ± 0.021
DT0.652 ± 0.0120.489 ± 0.0210.508 ± 0.0180.445 ± 0.0180.595 ± 0.063
EBM0.646 ± 0.0180.49 ± 0.0190.051 ± 0.0520.867 ± 0.1260.027 ± 0.028
LR0.632 ± 0.0160.476 ± 0.0190.469 ± 0.0190.44 ± 0.0130.504 ± 0.027
Latest, demographicsRF0.627 ± 0.0130.449 ± 0.0130.504 ± 0.010.423 ± 0.0120.622 ± 0.017
TGFNN0.583 ± 0.0350.402 ± 0.0280.381 ± 0.2130.311 ± 0.1740.493 ± 0.277
TNET0.655 ± 0.0170.493 ± 0.0480.193 ± 0.1020.673 ± 0.0240.119 ± 0.08
XGB0.613 ± 0.0150.421 ± 0.0130.465 ± 0.0120.421 ± 0.0130.519 ± 0.026
DT0.684 ± 0.0120.498 ± 0.0170.542 ± 0.0070.482 ± 0.0350.625 ± 0.057
EBM0.726 ± 0.0330.579 ± 0.0440.158 ± 0.1620.85 ± 0.1160.099 ± 0.111
LR0.631 ± 0.0180.456 ± 0.0180.484 ± 0.0090.442 ± 0.0180.535 ± 0.02
PhenotypesRF0.713 ± 0.010.559 ± 0.010.552 ± 0.0070.497 ± 0.0070.621 ± 0.012
TGFNN0.628 ± 0.0130.443 ± 0.0110.498 ± 0.0290.384 ± 0.0420.776 ± 0.223
TNET0.631 ± 0.0280.457 ± 0.0610.276 ± 0.2010.531 ± 0.0330.242 ± 0.235
XGB0.726 ± 0.0120.568 ± 0.0210.563 ± 0.0130.514 ± 0.0160.623 ± 0.011
DT0.664 ± 0.0060.512 ± 0.0110.462 ± 0.0320.531 ± 0.060.421 ± 0.073
EBM0.717 ± 0.0310.575 ± 0.0430.162 ± 0.130.846 ± 0.0990.096 ± 0.084
LR0.629 ± 0.010.478 ± 0.0120.464 ± 0.0070.44 ± 0.0140.492 ± 0.004
Summary statisticsRF0.711 ± 0.0040.567 ± 0.0050.555 ± 0.0050.486 ± 0.0070.646 ± 0.01
TGFNN0.633 ± 0.020.477 ± 0.0230.506 ± 0.0040.381 ± 0.0560.828 ± 0.212
TNET0.592 ± 0.0130.435 ± 0.0770.164 ± 0.0990.592 ± 0.0430.1 ± 0.069
XGB0.721 ± 0.0090.572 ± 0.0080.559 ± 0.0060.512 ± 0.0110.615 ± 0.003
Table A2. Model cross-validation results on validation set.
Table A2. Model cross-validation results on validation set.
Feature SetModelAUROCAUPRCF1PrecisionRecall
DT0.596 ± 0.0330.422 ± 0.0420.444 ± 0.0290.424 ± 0.0450.479 ± 0.078
EBM0.671 ± 0.0430.508 ± 0.0610.292 ± 0.0340.622 ± 0.130.197 ± 0.041
LR0.634 ± 0.0320.469 ± 0.050.483 ± 0.0450.448 ± 0.0390.534 ± 0.089
AllRF0.643 ± 0.0450.466 ± 0.060.495 ± 0.0160.45 ± 0.0460.56 ± 0.054
TGFNN0.614 ± 0.0420.457 ± 0.0530.42 ± 0.0770.441 ± 0.0390.42 ± 0.138
TNET0.566 ± 0.0180.387 ± 0.2240.156 ± 0.1080.442 ± 0.2520.1 ± 0.079
XGB0.583 ± 0.0250.414 ± 0.0340.439 ± 0.0420.396 ± 0.0280.504 ± 0.099
DT0.586 ± 0.0280.414 ± 0.030.452 ± 0.0290.405 ± 0.0380.522 ± 0.085
EBM0.642 ± 0.0340.469 ± 0.0530.165 ± 0.1040.518 ± 0.1250.106 ± 0.073
LR0.632 ± 0.0390.465 ± 0.0530.48 ± 0.0490.447 ± 0.0490.528 ± 0.085
Latest, demo., phenotypesRF0.64 ± 0.0480.464 ± 0.060.496 ± 0.0230.443 ± 0.0490.571 ± 0.049
TGFNN0.532 ± 0.0140.357 ± 0.0070.46 ± 0.0260.339 ± 0.0090.718 ± 0.086
TNET0.576 ± 0.020.39 ± 0.2260.156 ± 0.1110.467 ± 0.2720.098 ± 0.072
XGB0.606 ± 0.0240.425 ± 0.0260.467 ± 0.0230.412 ± 0.0190.546 ± 0.069
DT0.596 ± 0.030.411 ± 0.0220.485 ± 0.0330.395 ± 0.0260.633 ± 0.075
EBM0.641 ± 0.0410.487 ± 0.0480.152 ± 0.0450.641 ± 0.1280.088 ± 0.031
LR0.619 ± 0.0390.464 ± 0.050.464 ± 0.0490.427 ± 0.0340.517 ± 0.096
Latest, demo., statisticsRF0.634 ± 0.0390.472 ± 0.0520.474 ± 0.040.425 ± 0.0320.543 ± 0.081
TGFNN0.615 ± 0.0370.461 ± 0.050.507 ± 0.0090.344 ± 0.0130.962 ± 0.05
TNET0.562 ± 0.0220.391 ± 0.2290.146 ± 0.0860.472 ± 0.2820.087 ± 0.054
XGB0.594 ± 0.0130.431 ± 0.0220.437 ± 0.0190.397 ± 0.0070.488 ± 0.044
DT0.575 ± 0.030.398 ± 0.030.456 ± 0.0320.4 ± 0.030.535 ± 0.074
EBM0.617 ± 0.0590.46 ± 0.0720.025 ± 0.0360.556 ± 0.370.013 ± 0.02
LR0.618 ± 0.0630.464 ± 0.0750.444 ± 0.10.42 ± 0.0480.495 ± 0.168
Latest, demographicsRF0.602 ± 0.0480.423 ± 0.0540.454 ± 0.0950.399 ± 0.0390.551 ± 0.177
TGFNN0.57 ± 0.0510.391 ± 0.0490.353 ± 0.2170.298 ± 0.1680.476 ± 0.352
TNET0.556 ± 0.0290.385 ± 0.2280.106 ± 0.0740.513 ± 0.2980.06 ± 0.042
XGB0.584 ± 0.0470.4 ± 0.0460.45 ± 0.050.396 ± 0.0520.531 ± 0.086
DT0.584 ± 0.0480.408 ± 0.0470.462 ± 0.0530.411 ± 0.0350.53 ± 0.089
EBM0.632 ± 0.0540.462 ± 0.0650.106 ± 0.1110.428 ± 0.260.07 ± 0.084
LR0.603 ± 0.0650.434 ± 0.0710.465 ± 0.0460.42 ± 0.0590.523 ± 0.037
PhenotypesRF0.64 ± 0.0640.475 ± 0.0780.493 ± 0.0350.445 ± 0.0620.559 ± 0.028
TGFNN0.605 ± 0.0640.426 ± 0.0690.5 ± 0.0420.393 ± 0.0570.772 ± 0.235
TNET0.547 ± 0.0240.363 ± 0.3310.069 ± 0.0730.381 ± 0.350.038 ± 0.041
XGB0.607 ± 0.0450.444 ± 0.0660.47 ± 0.0180.427 ± 0.0370.528 ± 0.032
DT0.552 ± 0.020.382 ± 0.0120.357 ± 0.0440.408 ± 0.060.336 ± 0.084
EBM0.631 ± 0.0460.477 ± 0.0480.126 ± 0.1010.685 ± 0.2090.078 ± 0.07
LR0.605 ± 0.040.454 ± 0.0470.447 ± 0.0170.418 ± 0.0360.484 ± 0.029
Summary statisticsRF0.613 ± 0.0410.454 ± 0.0490.465 ± 0.020.404 ± 0.0350.553 ± 0.051
TGFNN0.602 ± 0.0440.447 ± 0.0470.484 ± 0.0310.353 ± 0.0180.817 ± 0.223
TNET0.544 ± 0.0210.361 ± 0.330.051 ± 0.0490.44 ± 0.4030.027 ± 0.026
XGB0.591 ± 0.020.432 ± 0.0230.437 ± 0.0240.4 ± 0.0270.485 ± 0.049
Table A3. Model cross-validation results on test set.
Table A3. Model cross-validation results on test set.
Feature SetModelAUROCAUPRCF1PrecisionRecall
DT0.583 ± 0.0180.407 ± 0.0210.432 ± 0.0330.4 ± 0.0160.478 ± 0.079
EBM0.674 ± 0.0030.494 ± 0.0040.294 ± 0.0330.558 ± 0.0250.202 ± 0.034
LR0.663 ± 0.0050.478 ± 0.0030.503 ± 0.0060.446 ± 0.0050.576 ± 0.016
AllRF0.652 ± 0.0020.458 ± 0.0040.5 ± 0.0050.439 ± 0.0060.581 ± 0.014
TGFNN0.636 ± 0.0160.463 ± 0.010.441 ± 0.030.46 ± 0.0430.438 ± 0.092
TNET0.615 ± 0.020.429 ± 0.0170.253 ± 0.0910.49 ± 0.0330.182 ± 0.094
XGB0.609 ± 0.0110.416 ± 0.010.465 ± 0.0230.405 ± 0.010.546 ± 0.049
DT0.596 ± 0.0080.412 ± 0.0160.464 ± 0.0180.404 ± 0.0180.551 ± 0.067
EBM0.662 ± 0.0070.479 ± 0.0060.18 ± 0.1060.597 ± 0.0450.113 ± 0.071
LR0.661 ± 0.0030.478 ± 0.0040.502 ± 0.010.45 ± 0.0030.567 ± 0.022
Latest, demo., phenotypesRF0.651 ± 0.0040.459 ± 0.0050.5 ± 0.0050.439 ± 0.0070.58 ± 0.011
TGFNN0.537 ± 0.0070.349 ± 0.0030.465 ± 0.0070.341 ± 0.0030.732 ± 0.026
TNET0.617 ± 0.010.438 ± 0.0160.274 ± 0.0950.499 ± 0.0390.204 ± 0.113
XGB0.597 ± 0.0130.411 ± 0.0080.45 ± 0.0150.395 ± 0.010.524 ± 0.031
DT0.575 ± 0.0070.389 ± 0.0060.464 ± 0.0190.374 ± 0.0090.618 ± 0.074
EBM0.631 ± 0.0040.458 ± 0.0040.151 ± 0.0580.575 ± 0.0560.089 ± 0.04
LR0.623 ± 0.0040.452 ± 0.0040.468 ± 0.0090.416 ± 0.0020.536 ± 0.025
Latest, demo., statisticsRF0.625 ± 0.0040.453 ± 0.0050.467 ± 0.0040.41 ± 0.0070.544 ± 0.015
TGFNN0.619 ± 0.0020.452 ± 0.0030.498 ± 0.0020.337 ± 0.010.955 ± 0.06
TNET0.575 ± 0.0070.4 ± 0.010.209 ± 0.0810.481 ± 0.0390.142 ± 0.077
XGB0.591 ± 0.0040.407 ± 0.0030.444 ± 0.0060.4 ± 0.0090.5 ± 0.012
DT0.581 ± 0.0110.402 ± 0.0130.451 ± 0.020.391 ± 0.0130.539 ± 0.065
EBM0.619 ± 0.0030.456 ± 0.0030.04 ± 0.0390.747 ± 0.1530.021 ± 0.021
LR0.622 ± 0.0020.454 ± 0.0030.459 ± 0.0090.427 ± 0.0070.496 ± 0.02
Latest, demographicsRF0.603 ± 0.0040.415 ± 0.0050.471 ± 0.0160.396 ± 0.0070.585 ± 0.049
TGFNN0.586 ± 0.0430.412 ± 0.040.385 ± 0.2150.314 ± 0.1760.499 ± 0.282
TNET0.594 ± 0.0070.424 ± 0.0060.166 ± 0.0810.553 ± 0.0690.105 ± 0.072
XGB0.587 ± 0.0040.388 ± 0.0080.443 ± 0.0140.394 ± 0.0040.508 ± 0.034
DT0.597 ± 0.0060.403 ± 0.010.462 ± 0.0140.41 ± 0.0190.534 ± 0.055
EBM0.646 ± 0.0050.454 ± 0.0070.098 ± 0.1170.296 ± 0.270.063 ± 0.08
LR0.631 ± 0.0030.44 ± 0.0040.477 ± 0.0040.43 ± 0.0060.536 ± 0.017
PhenotypesRF0.651 ± 0.0020.463 ± 0.0060.493 ± 0.0040.442 ± 0.0040.56 ± 0.015
TGFNN0.616 ± 0.0050.421 ± 0.010.488 ± 0.0390.377 ± 0.0390.766 ± 0.236
TNET0.601 ± 0.0190.413 ± 0.0160.231 ± 0.1840.442 ± 0.0710.224 ± 0.237
XGB0.627 ± 0.0130.441 ± 0.010.463 ± 0.0140.425 ± 0.0110.508 ± 0.025
DT0.557 ± 0.0120.384 ± 0.0120.366 ± 0.0430.413 ± 0.0390.341 ± 0.077
EBM0.621 ± 0.0010.453 ± 0.0050.138 ± 0.1020.647 ± 0.1040.085 ± 0.068
LR0.609 ± 0.0040.445 ± 0.0040.458 ± 0.0050.415 ± 0.0050.51 ± 0.006
Summary statisticsRF0.615 ± 0.0040.44 ± 0.0060.475 ± 0.0030.407 ± 0.0050.572 ± 0.015
TGFNN0.601 ± 0.0070.436 ± 0.0080.486 ± 0.0170.359 ± 0.0350.816 ± 0.227
TNET0.556 ± 0.0170.385 ± 0.0110.147 ± 0.0940.464 ± 0.0320.095 ± 0.071
XGB0.578 ± 0.0090.395 ± 0.0110.436 ± 0.0140.392 ± 0.0070.491 ± 0.026
Table A4. Results from Friedman’s test on feature set mean model performance.
Table A4. Results from Friedman’s test on feature set mean model performance.
Metricp-Value
AUROC2.79 × 10−20
AUPRC1.86 × 10−10
F13.68 × 10−7
Precision1.54 × 10−4
Recall5.04 × 10−4
Table A5. Statistically significant Nemenyi test results on pairwise feature set mean model performance comparison (alpha = 0.05).
Table A5. Statistically significant Nemenyi test results on pairwise feature set mean model performance comparison (alpha = 0.05).
MetricFeature Set 1Feature Set 2p-Value
PhenotypesSummary statistics0.001
PhenotypesLatest, demographics0.001
Summary statisticsLatest, demo., phenotypes0.001
Latest, demographicsLatest, demo., phenotypes0.001
AUROCPhenotypesLatest, demo., statistics0.004
Summary statisticsLatest, demo., statistics0.009
Latest, demo., phenotypesLatest, demo., statistics0.014
Summary statisticsAll0.001
Latest, demographicsAll0.001
Latest, demo., statisticsAll0.001
Summary statisticsLatest, demo., phenotypes0.001
Latest, demographicsLatest, demo., phenotypes0.006
Latest, demo., phenotypesLatest, demo., statistics0.047
AUPRCPhenotypesAll0.005
Summary statisticsAll0.001
Latest, demographicsAll0.001
Latest, demo., statisticsAll0.001
PhenotypesSummary statistics0.002
PhenotypesLatest, demographics0.003
F1Summary statisticsLatest, demo., phenotypes0.001
Latest, demographicsLatest, demo., phenotypes0.001
Summary statisticsAll0.001
Latest, demographicsAll0.001
Latest, demo., phenotypesLatest, demo., statistics0.010
PrecisionSummary statisticsAll0.007
Latest, demo., statisticsAll0.001
RecallSummary statisticsLatest, demo., phenotypes0.004
Latest, demographicsLatest, demo., phenotypes0.009
Figure A4. Results of cross-validation-trained models. These plot show the mean test set scores of each model on each feature set. Error bars indicate the standard deviation. The lower limit of the x-axes of the AUROC and AUPRC plots are set to the worst-case scenario of random guessing. (a) Mean test AUROC; (b) mean test AUPRC; (c) mean test precision; (d) mean test recall; (e) mean test F1.
Figure A4. Results of cross-validation-trained models. These plot show the mean test set scores of each model on each feature set. Error bars indicate the standard deviation. The lower limit of the x-axes of the AUROC and AUPRC plots are set to the worst-case scenario of random guessing. (a) Mean test AUROC; (b) mean test AUPRC; (c) mean test precision; (d) mean test recall; (e) mean test F1.
Diagnostics 14 01741 g0a4
Table A6. Results from Friedman’s test on mean model performance on all feature sets.
Table A6. Results from Friedman’s test on mean model performance on all feature sets.
Metricp-Value
AUROC7.24 × 10−27
AUPRC1.06 × 10−23
F15.52 × 10−27
Precision4.66 × 10−22
Recall5.17 × 10−25
Table A7. Statistically significant Nemenyi test results on pairwise feature set mean model performance comparison (alpha = 0.05).
Table A7. Statistically significant Nemenyi test results on pairwise feature set mean model performance comparison (alpha = 0.05).
MetricModel 1Model 2p-Value
RFTGFNN0.004
LRTGFNN0.001
RFXGB0.001
LRXGB0.001
RFTNET0.001
LRTNET0.001
AUROCTGFNNEBM0.001
XGBEBM0.001
TNETEBM0.001
RFDT0.001
LRDT0.001
TGFNNDT0.045
EBMDT0.001
LRTGFNN0.018
RFXGB0.001
LRXGB0.001
RFTNET0.004
LRTNET0.001
RFEBM0.021
AUPRCTGFNNEBM0.001
XGBEBM0.001
TNETEBM0.001
RFDT0.001
LRDT0.001
TGFNNDT0.018
EBMDT0.001
RFXGB0.001
TGFNNXGB0.026
RFTNET0.001
LRTNET0.001
TGFNNTNET0.001
XGBTNET0.003
F1RFEBM0.001
LREBM0.001
TGFNNEBM0.001
XGBEBM0.001
RFDT0.002
TGFNNDT0.037
TNETDT0.002
EBMDT0.001
RFTGFNN0.008
LRTGFNN0.001
LRXGB0.014
TGFNNTNET0.001
XGBTNET0.001
RFEBM0.001
PrecisionLREBM0.010
TGFNNEBM0.001
XGBEBM0.001
RFDT0.037
LRDT0.001
TNETDT0.001
EBMDT0.001
RFXGB0.023
TGFNNXGB0.005
RFTNET0.001
LRTNET0.001
TGFNNTNET0.001
XGBTNET0.002
RecallRFEBM0.001
LREBM0.001
TGFNNEBM0.001
XGBEBM0.001
TGFNNDT0.045
TNETDT0.001
EBMDT0.001

References

  1. Anderson, J.L.; Morrow, D.A. Acute Myocardial Infarction. N. Engl. J. Med. 2017, 376, 2053–2064. [Google Scholar] [CrossRef]
  2. Thygesen, K.; Alpert, J.S.; Jaffe, A.S.; Chaitman, B.R.; Bax, J.J.; Morrow, D.A.; White, H.D.; Executive Group on behalf of the Joint European Society of Cardiology (ESC)/American College of Cardiology (ACC)/American Heart Association (AHA)/World Heart Federation (WHF) Task Force for the Universal Definition of Myocardial Infarction. Fourth Universal Definition of Myocardial Infarction (2018). Circulation 2018, 138, e618–e651. [Google Scholar] [CrossRef]
  3. Salari, N.; Morddarvanjoghi, F.; Abdolmaleki, A.; Rasoulpoor, S.; Khaleghi, A.A.; Hezarkhani, L.A.; Shohaimi, S.; Mohammadi, M. The global prevalence of myocardial infarction: A systematic review and meta-analysis. BMC Cardiovasc. Disord. 2023, 23, 206. [Google Scholar] [CrossRef]
  4. Chi, G.C.; Kanter, M.H.; Li, B.H.; Qian, L.; Reading, S.R.; Harrison, T.N.; Jacobsen, S.J.; Scott, R.D.; Cavendish, J.J.; Lawrence, J.M.; et al. Trends in Acute Myocardial Infarction by Race and Ethnicity. J. Am. Heart Assoc. 2020, 9, e013542. [Google Scholar] [CrossRef]
  5. Caughey, M.C.; Arora, S.; Qamar, A.; Chunawala, Z.; Gupta, M.D.; Gupta, P.; Vaduganathan, M.; Pandey, A.; Dai, X.; Smith, S.C., Jr.; et al. Trends, Management, and Outcomes of Acute Myocardial Infarction Hospitalizations with In-Hospital-Onset Versus Out-of-Hospital Onset: The ARIC Study. J. Am. Heart Assoc. 2021, 10, e018414. [Google Scholar] [CrossRef] [PubMed]
  6. Hall, M.; Dondo, T.B.; Yan, A.T.; Mamas, M.A.; Timmis, A.D.; Deanfield, J.E.; Jernberg, T.; Hemingway, H.; Fox, K.A.A.; Gale, C.P. Multimorbidity and survival for patients with acute myocardial infarction in England and Wales: Latent class analysis of a nationwide population-based cohort. PLoS Med. 2018, 15, e1002501. [Google Scholar] [CrossRef]
  7. Choi, A.; Kim, M.J.; Sung, J.M.; Kim, S.; Lee, J.; Hyun, H.; Kim, H.C.; Kim, J.H.; Chang, H.J.; Connected Network for EMS Comprehensive Technical Support Using Artificial Intelligence Investigators. Development of Prediction Models for Acute Myocardial Infarction at Prehospital Stage with Machine Learning Based on a Nationwide Database. J. Cardiovasc. Dev. Dis. 2022, 9, 430. [Google Scholar] [CrossRef] [PubMed]
  8. Stopyra, J.P.; Harper, W.S.; Higgins, T.J.; Prokesova, J.V.; Winslow, J.E.; Nelson, R.D.; Alson, R.L.; Davis, C.A.; Russell, G.B.; Miller, C.D.; et al. Prehospital Modified HEART Score Predictive of 30-Day Adverse Cardiac Events. Prehosp. Disaster Med. 2018, 33, 58–62. [Google Scholar] [CrossRef]
  9. Holmberg, M.; Andersson, H.; Winge, K.; Lundberg, C.; Karlsson, T.; Herlitz, J.; Wireklint Sundström, B. Association between the reported intensity of an acute symptom at first prehospital assessment and the subsequent outcome: A study on patients with acute chest pain and presumed acute coronary syndrome. BMC Cardiovasc. Disord. 2018, 18, 216. [Google Scholar] [CrossRef]
  10. Frisch, A.; Heidle, K.J.; Frisch, S.O.; Ata, A.; Kramer, B.; Colleran, C.; Carlson, J.N. Factors associated with advanced cardiac care in prehospital chest pain patients. Am. J. Emerg. Med. 2018, 36, 1182–1187. [Google Scholar] [CrossRef]
  11. Mandair, D.; Tiwari, P.; Simon, S.; Colborn, K.L.; Rosenberg, M.A. Prediction of incident myocardial infarction using machine learning applied to harmonized electronic health record data. BMC Med. Inform. Decis. Mak. 2020, 20, 252. [Google Scholar] [CrossRef] [PubMed]
  12. Moore, A.; Bell, M. XGBoost, A Novel Explainable AI Technique, in the Prediction of Myocardial Infarction: A UK Biobank Cohort Study. Clin. Med. Insights Cardiol. 2022, 16, 11795468221133611. [Google Scholar] [CrossRef] [PubMed]
  13. Wang, Z.; Liu, C.; Yao, B. Multi-Branching Neural Network for Myocardial Infarction Prediction. In Proceedings of the 2022 IEEE 18th International Conference on Automation Science and Engineering (CASE), Mexico City, Mexico, 22–26 August 2022; IEEE Press: Piscataway, NJ, USA, 2022; pp. 2118–2123. [Google Scholar]
  14. Tsarapatsani, K.H.; Sakellarios, A.; Pezoulas, V.C.; Tsakanikas, V.D.; Matsopoulos, G.K.; Marz, W.; Kleber, M.; Fotiadis, D.I. Machine learning models to predict myocardial infarction within 10-years follow-up of cardiovascular disease progression. In Proceedings of the 2022 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Ioannina, Greece, 27–30 September 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
  15. Sievering, I.D.; Senouf, O.; Mahendiran, T.; Nanchen, D.; Fournier, S.; Muller, O.; Frossard, P.; Abbe, E.; Thanou, D. Anatomy-informed multimodal learning for myocardial infarction prediction. medRxiv 2023. [Google Scholar] [CrossRef]
  16. Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
  17. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016; KDD ’16. pp. 1135–1144. [Google Scholar]
  18. Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; Lakkaraju, H. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–8 February 2020; AIES ’20. pp. 180–186. [Google Scholar]
  19. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  20. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016; KDD ’16. pp. 785–794. [Google Scholar]
  21. Lou, Y.; Caruana, R.; Gehrke, J. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Beijing, China, 12–16 August 2012; KDD ’12. pp. 150–158. [Google Scholar]
  22. Arik, S.O.; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. arXiv 2019, arXiv:1908.07442. [Google Scholar] [CrossRef]
  23. Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
  24. Becker, F.; Smilde, A.K.; Acar, E. Unsupervised EHR-based phenotyping via matrix and tensor decompositions. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2023, 13, e1494. [Google Scholar] [CrossRef]
  25. Yao, H.; Derksen, H.; Golbus, J.R.; Zhang, J.; Aaronson, K.D.; Gryak, J.; Najarian, K. A Novel Tropical Geometry-Based Interpretable Machine Learning Method: Pilot Application to Delivery of Advanced Heart Failure Therapies. IEEE J. Biomed. Health Inform. 2022, 27, 239–250. [Google Scholar] [CrossRef]
  26. Gilbert, T.; Neuburger, J.; Kraindler, J.; Keeble, E.; Smith, P.; Ariti, C.; Arora, S.; Street, A.; Parker, S.; Roberts, H.C.; et al. Development and validation of a Hospital Frailty Risk Score focusing on older people in acute care settings using electronic hospital records: An observational study. Lancet 2018, 391, 1775–1782. [Google Scholar] [CrossRef]
  27. WHO. The International Statistical Classification of Diseases and Health Related Problems ICD-10: Tabular List v. 1: Tenth Revision, 2nd ed.; World Health Organization: Genève, Switzerland, 2005. [Google Scholar]
  28. Guo, C.; Lu, M.; Chen, J. An evaluation of time series summary statistics as features for clinical prediction tasks. BMC Med. Inform. Decis. Mak. 2020, 20, 48. [Google Scholar] [CrossRef] [PubMed]
  29. Kossaifi, J.; Panagakis, Y.; Anandkumar, A.; Pantic, M. TensorLy: Tensor Learning in Python. J. Mach. Learn. Res. 2019, 20, 1–6. [Google Scholar]
  30. Ding, C.; Peng, H. Minimum redundancy feature selection from microarray gene expression data. In Proceedings of the Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference, Stanford, CA, USA, 11–14 August 2003; CSB2003. IEEE: Piscataway, NJ, USA, 2003; pp. 523–528. [Google Scholar]
  31. Zhao, Z.; Anand, R.; Wang, M. Maximum relevance and minimum redundancy feature selection methods for a marketing machine learning platform. In Proceedings of the 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Washington, DC, USA, 5–8 October 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
  32. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Louppe, G.; Prettenhofer, P.; Weiss, R.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  33. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  34. Fayaz, A.; Ayis, S.; Panesar, S.S.; Langford, R.M.; Donaldson, L.J. Assessing the relationship between chronic pain and cardiovascular disease: A systematic review and meta-analysis. Scand. J. Pain 2016, 13, 76–90. [Google Scholar] [CrossRef] [PubMed]
  35. Reynolds, C.A.; Minic, Z. Chronic Pain-Associated Cardiovascular Disease: The Role of Sympathetic Nerve Activity. Int. J. Mol. Sci. 2023, 24, 5378. [Google Scholar] [CrossRef] [PubMed]
  36. Bally, M.; Dendukuri, N.; Rich, B.; Nadeau, L.; Helin-Salmivaara, A.; Garbe, E.; Brophy, J.M. Risk of acute myocardial infarction with NSAIDs in real world use: Bayesian meta-analysis of individual patient data. BMJ 2017, 357, j1909. [Google Scholar] [CrossRef]
  37. Tavani, A.; Bertuzzi, M.; Gallus, S.; Negri, E.; La Vecchia, C. Diabetes mellitus as a contributor to the risk of acute myocardial infarction. J. Clin. Epidemiol. 2002, 55, 1082–1087. [Google Scholar] [CrossRef] [PubMed]
  38. Cui, J.; Liu, Y.; Li, Y.; Xu, F.; Liu, Y. Type 2 Diabetes and Myocardial Infarction: Recent Clinical Evidence and Perspective. Front. Cardiovasc. Med. 2021, 8, 644189. [Google Scholar] [CrossRef]
  39. Schmidt, M.; Mansfield, K.E.; Bhaskaran, K.; Nitsch, D.; Sørensen, H.T.; Smeeth, L.; Tomlinson, L.A. Serum creatinine elevation after renin-angiotensin system blockade and long term cardiorenal risks: Cohort study. BMJ 2017, 356, j791. [Google Scholar] [CrossRef]
  40. Bradley, S.M.; Borgerding, J.A.; Wood, G.B.; Maynard, C.; Fihn, S.D. Incidence, Risk Factors, and Outcomes Associated With In-Hospital Acute Myocardial Infarction. JAMA Netw. Open 2019, 2, e187348. [Google Scholar] [CrossRef] [PubMed]
  41. Yeh, P.; Pan, Y.; Sanchez-Pinto, L.N.; Luo, Y. Hyperchloremia in critically ill patients: Association with outcomes and prediction using electronic health record data. BMC Med. Inform. Decis. Mak. 2020, 20, 302. [Google Scholar] [CrossRef] [PubMed]
  42. Jankowski, J.; Floege, J.; Fliser, D.; Böhm, M.; Marx, N. Cardiovascular Disease in Chronic Kidney Disease: Pathophysiological Insights and Therapeutic Options. Circulation 2021, 143, 1157–1172. [Google Scholar] [CrossRef] [PubMed]
  43. Padda, J.; Khalid, K.; Hitawala, G.; Batra, N.; Pokhriyal, S.; Mohan, A.; Cooper, A.C.; Jean-Charles, G. Acute Anemia and Myocardial Infarction. Cureus 2021, 13, e17096. [Google Scholar] [CrossRef] [PubMed]
  44. Fuchs, F.D.; Whelton, P.K. High Blood Pressure and Cardiovascular Disease. Hypertension 2020, 75, 285–292. [Google Scholar] [CrossRef] [PubMed]
  45. Bertuzzi, M.; Negri, E.; Tavani, A.; La Vecchia, C. Family history of ischemic heart disease and risk of acute myocardial infarction. Prev. Med. 2003, 37, 183–187. [Google Scholar] [CrossRef] [PubMed]
  46. Nyboe, J.; Jensen, G.; Appleyard, M.; Schnohr, P. Smoking and the risk of first acute myocardial infarction. Am. Heart J. 1991, 122, 438–447. [Google Scholar] [CrossRef]
  47. Huang, Y.L.; Hu, Z.D. Lower mean corpuscular hemoglobin concentration is associated with poorer outcomes in intensive care unit admitted patients with acute myocardial infarction. Ann. Transl. Med. 2016, 4, 190. [Google Scholar] [CrossRef]
Figure 1. Calibration plots of best model replicate on the “All” feature set, according to F1 score. The diagonal dotted line indicates a classifier with perfect calibration. Samples are grouped into 10 uniformly sized bins, with empty bins excluded. Each point on the curve contrasts the mean model-predicted probability of being a positive sample, with the actual frequency of positive samples, within the bin. (a) Decision tree; (b) logistic regression; (c) random forest; (d) XGBoost; (e) EBM; (f) TabNet; (g) TGFNN.
Figure 1. Calibration plots of best model replicate on the “All” feature set, according to F1 score. The diagonal dotted line indicates a classifier with perfect calibration. Samples are grouped into 10 uniformly sized bins, with empty bins excluded. Each point on the curve contrasts the mean model-predicted probability of being a positive sample, with the actual frequency of positive samples, within the bin. (a) Decision tree; (b) logistic regression; (c) random forest; (d) XGBoost; (e) EBM; (f) TabNet; (g) TGFNN.
Diagnostics 14 01741 g001
Figure 2. SHAP values of top 10 features, averaged across random forest and XGBoost model replicates trained on the “All” feature set. Each point represents a sample (patient), positioned relative to the impact of the feature on whether the model predicted the positive (AMI) or negative class. The color of points reflect the feature magnitude for the sample. (a) Random forest mean SHAP values; (b) XGBoost mean SHAP values.
Figure 2. SHAP values of top 10 features, averaged across random forest and XGBoost model replicates trained on the “All” feature set. Each point represents a sample (patient), positioned relative to the impact of the feature on whether the model predicted the positive (AMI) or negative class. The color of points reflect the feature magnitude for the sample. (a) Random forest mean SHAP values; (b) XGBoost mean SHAP values.
Diagnostics 14 01741 g002
Figure 3. Temporal factors of top six predictive phenotypes. Y-axis values are weights learned in tensor decomposition and may give an indication of when the phenotype was characteristically presented. (a) Dx/Rx phenotype 47 (back pain); (b) Dx/Rx phenotype 36 (cardiometabolic syndrome); (c) lab/vital phenotype 18 (ambiguous); (d) lab/vital phenotype 28 (kidney disease); (e) Dx/Rx phenotype 6 (cardiovascular medication); (f) lab/vital phenotype 22 (anemia).
Figure 3. Temporal factors of top six predictive phenotypes. Y-axis values are weights learned in tensor decomposition and may give an indication of when the phenotype was characteristically presented. (a) Dx/Rx phenotype 47 (back pain); (b) Dx/Rx phenotype 36 (cardiometabolic syndrome); (c) lab/vital phenotype 18 (ambiguous); (d) lab/vital phenotype 28 (kidney disease); (e) Dx/Rx phenotype 6 (cardiovascular medication); (f) lab/vital phenotype 22 (anemia).
Diagnostics 14 01741 g003
Figure 4. Kendall rank correlation coefficients comparing the ranking of features in the “All” set by importance. A value of 1 indicates perfect positive correlation, −1 is perfect negative correlation, and 0 is no correlation.
Figure 4. Kendall rank correlation coefficients comparing the ranking of features in the “All” set by importance. A value of 1 indicates perfect positive correlation, −1 is perfect negative correlation, and 0 is no correlation.
Diagnostics 14 01741 g004
Figure 5. Rules from best-performing TGFNN model on the “All” feature set. The darker red the cell, the more important the concept in the rule. Relative rule contribution to predicting future AMI events is listed along the x-axis. Rules and concepts with less than 0.1 relative importance were removed.
Figure 5. Rules from best-performing TGFNN model on the “All” feature set. The darker red the cell, the more important the concept in the rule. Relative rule contribution to predicting future AMI events is listed along the x-axis. Rules and concepts with less than 0.1 relative importance were removed.
Diagnostics 14 01741 g005
Table 1. Overview of cohort. Age and hospital frailty risk score are presented as mean ± standard deviation. All percentages are of the patients within the column subset, besides the “Number of patients” which is the percentage of positive/negative patients within in the train/test set.
Table 1. Overview of cohort. Age and hospital frailty risk score are presented as mean ± standard deviation. All percentages are of the patients within the column subset, besides the “Number of patients” which is the percentage of positive/negative patients within in the train/test set.
TrainTest
ControlCasesControlsCases
Number of patients3705 (66.5%)1865 (33.5%)1582 (67.1%)776 (32.9%)
Age, years63.5 ± 13.063.6 ± 2.963.5 ± 2.963.8 ± 12.9
Hospital frailty risk score4.6 ± 6.14.9 ± 6.24.7 ± 6.04.9 ± 6.4
Sex, male2093 (56.5%)1092 (58.6%)918 (58.0%)433 (55.8%)
Cardiac family history1524 (41.1%)971 (52.1%)644 (40.7%)378 (48.7%)
Ethnicity
White or Caucasian3154 (85.1%)1553 (83.3%)1347 (85.1%)671 (86.5%)
Black or African American248 (6.7%)164 (8.8%)111 (7.0%)61 (7.9%)
Asian133 (3.6%)54 (2.9%)68 (4.3%)16 (2.1%)
Other97 (2.6%)49 (2.6%)32 (2.0%)13 (1.7%)
Unknown33 (0.9%)21 (1.1%)14 (0.9%)8 (1.0%)
American Native11 (0.3%)12 (0.6%)2 (0.1%)5 (0.6%)
Native Pacific Islander3 (0.1%)2 (0.1%)1 (0.1%)0 (0.0%)
Table 2. Mean ± standard deviation scores for each feature set across all models, sorted by F1 score.
Table 2. Mean ± standard deviation scores for each feature set across all models, sorted by F1 score.
Feature SetAUROCAUPRCF1PrecisionRecall
All0.63 ± 0.030.45 ± 0.030.41 ± 0.100.46 ± 0.060.43 ± 0.17
Latest, demo., phenotypes0.62 ± 0.040.43 ± 0.040.4 ± 0.130.45 ± 0.080.47 ± 0.22
Phenotypes0.62 ± 0.020.43 ± 0.020.39 ± 0.170.4 ± 0.110.46 ± 0.25
Latest, demo., statistics0.61 ± 0.020.43 ± 0.030.39 ± 0.140.43 ± 0.080.48 ± 0.28
Summary statistics0.59 ± 0.030.42 ± 0.030.36 ± 0.150.44 ± 0.10.42 ± 0.26
Latest, demographics0.6 ± 0.020.42 ± 0.030.35 ± 0.180.46 ± 0.160.39 ± 0.24
Table 3. Mean ± standard deviation scores for each model, across all feature sets, sorted by F1 score.
Table 3. Mean ± standard deviation scores for each model, across all feature sets, sorted by F1 score.
ModelAUROCAUPRCF1PrecisionRecall
RF0.633 ± 0.020.448 ± 0.0170.484 ± 0.0150.422 ± 0.0190.57 ± 0.026
LR0.635 ± 0.0210.458 ± 0.0150.478 ± 0.020.431 ± 0.0140.537 ± 0.034
TGFNN0.599 ± 0.0370.422 ± 0.0410.46 ± 0.0910.365 ± 0.0850.701 ± 0.246
XGBoost0.598 ± 0.0190.41 ± 0.0190.45 ± 0.0170.402 ± 0.0140.513 ± 0.034
DT0.582 ± 0.0170.4 ± 0.0160.44 ± 0.0430.399 ± 0.0230.51 ± 0.108
TabNET0.593 ± 0.0250.415 ± 0.0220.213 ± 0.1110.488 ± 0.0570.159 ± 0.124
EBM0.642 ± 0.0210.466 ± 0.0160.15 ± 0.110.57 ± 0.1870.095 ± 0.076
Table 4. Top 10 most important features from the “All” feature set, reported as min–max scaled mean ± standard deviation of all LR, RF, XGB, EBM, and TNET models.
Table 4. Top 10 most important features from the “All” feature set, reported as min–max scaled mean ± standard deviation of all LR, RF, XGB, EBM, and TNET models.
FeatureImportance
Dx/Rx phenotype 470.533 ± 0.316
Family history of cardiovascular diseases0.511 ± 0.324
Dx/Rx phenotype 360.498 ± 0.374
Lab/vital phenotype 180.496 ± 0.304
Max systolic blood pressure0.44 ± 0.338
Dx/Rx phenotype 130.405 ± 0.216
Lab/vital phenotype 40.404 ± 0.256
Lab/vital phenotype 280.359 ± 0.252
Lab/vital phenotype 90.342 ± 0.195
Lab/vital phenotype 110.336 ± 0.245
Table 5. Phenotypes most predictive of impending AMI, based on logistic regression coefficients and SHAP values from random forest and XGBoost. The top five features of each are shown. Units: millimoles per liter (mmol/L), milligrams per deciliter (mg/dL), grams per deciliter (g/dL), multiples of a thousand per cubic millimeter (K/mm3), multiples of a million per cubic millimeter (M/mm3).
Table 5. Phenotypes most predictive of impending AMI, based on logistic regression coefficients and SHAP values from random forest and XGBoost. The top five features of each are shown. Units: millimoles per liter (mmol/L), milligrams per deciliter (mg/dL), grams per deciliter (g/dL), multiples of a thousand per cubic millimeter (K/mm3), multiples of a million per cubic millimeter (M/mm3).
Dx/Rx Phenotype 47 (Back Pain)Weight
Dorsalgia (M54)0.0252
Low back pain (M545)0.0148
Other joint disorder, not elsewhere classified (M25)0.0118
Pain, not elsewhere classified (G89)0.0111
Other chronic pain (G8929)0.0097
Dx/Rx Phenotype 36 (cardiometabolic syndrome)Weight
Type 2 diabetes mellitus (E11)0.0463
Type 2 diabetes mellitus without complications (E119)0.0413
Essential (primary) hypertension (I10)0.0113
Type 2 diabetes mellitus with hyperglycemia (E1165)0.0076
Disorders of lipoprotein metabolism and other lipidemias (E78)0.0064
Lab/vital Phenotype 18 (ambiguous)Weight
Absolute Basophil Count, (−0.001, 0.1] K/mm30.0017
Absolute Early Granulocyte Count, (−0.001, 13.7] K/mm30.0017
Potassium, (4.0, 4.3] mmol/L0.0009
Chloride, (104.0, 106.0] mmol/L0.0009
Urea Nitrogen, (4.999, 13.0] mg/dL0.0009
Lab/vital Phenotype 28 (kidney disease)Weight
Creatinine, (1.122, 21.48] mg/dL0.1908
Urea Nitrogen, (23.0, 139.0] mg/dL0.1631
CO2, (9.999, 26.0] mmol/L0.0980
Chloride, (107.0, 122.0] mmol/L0.0663
Potassium, (4.7, 9.8] mmol/L0.0550
Dx/Rx Phenotype 6 (cardiovascular medication)Weight
Atorvastatin 40 mg tablet0.0214
Amlodipine 10 mg tablet0.0095
Clopidogrel 75 mg tablet0.0050
Lisinopril 40 mg tablet0.0025
Metoprolol succinate ER 25 mg 24 h tablet0.0016
Lab/vital Phenotype 22 (anemia)Weight
Hemoglobin, (5.699, 12.1] g/dL0.0559
Hematocrit, (18.099, 36.5] %0.0547
Red Blood Cell Count, (1.87, 4.02] M/mm30.0517
Calcium, (4.799, 9.2] mg/dL0.0512
Albumin, (1.499, 4.0] g/dL0.0501
Table 6. Top 10 logistic regression feature coefficients from the best-performing model on the “All” feature set, reported as mean ± standard deviation. Milligrams (mg).
Table 6. Top 10 logistic regression feature coefficients from the best-performing model on the “All” feature set, reported as mean ± standard deviation. Milligrams (mg).
FeatureCoefficient
Dx/Rx phenotype 7−1.346 ± 0.191
Lab/vital phenotype 4−0.911 ± 0.08
Lab/vital phenotype 180.858 ± 0.154
Lab/vital phenotype 11−0.79 ± 0.046
Clopidogrel, 75 mg table (within last 5 years)0.777 ± 0.076
Dx/Rx phenotype 470.768 ± 0.379
Dx/Rx phenotype 46−0.712 ± 0.156
Dx/Rx phenotype 13−0.708 ± 0.089
Lab/vital phenotype 9−0.69 ± 0.106
Dx/Rx phenotype 35−0.626 ± 0.221
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hodgman, M.; Minoccheri, C.; Mathis, M.; Wittrup, E.; Najarian, K. A Comparison of Interpretable Machine Learning Approaches to Identify Outpatient Clinical Phenotypes Predictive of First Acute Myocardial Infarction. Diagnostics 2024, 14, 1741. https://doi.org/10.3390/diagnostics14161741

AMA Style

Hodgman M, Minoccheri C, Mathis M, Wittrup E, Najarian K. A Comparison of Interpretable Machine Learning Approaches to Identify Outpatient Clinical Phenotypes Predictive of First Acute Myocardial Infarction. Diagnostics. 2024; 14(16):1741. https://doi.org/10.3390/diagnostics14161741

Chicago/Turabian Style

Hodgman, Matthew, Cristian Minoccheri, Michael Mathis, Emily Wittrup, and Kayvan Najarian. 2024. "A Comparison of Interpretable Machine Learning Approaches to Identify Outpatient Clinical Phenotypes Predictive of First Acute Myocardial Infarction" Diagnostics 14, no. 16: 1741. https://doi.org/10.3390/diagnostics14161741

APA Style

Hodgman, M., Minoccheri, C., Mathis, M., Wittrup, E., & Najarian, K. (2024). A Comparison of Interpretable Machine Learning Approaches to Identify Outpatient Clinical Phenotypes Predictive of First Acute Myocardial Infarction. Diagnostics, 14(16), 1741. https://doi.org/10.3390/diagnostics14161741

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop