1. Introduction
An acute myocardial infarction (AMI), or “heart attack”, is myocardial necrosis due to sudden ischemia caused by blood clotting around ruptured or exposed plaque in the coronary arteries [
1,
2]. Globally, more than 3 million people have an AMI each year [
3]. Fortunately, AMI incidence rates have declined as researchers and clinicians have identified and managed risk factors [
4]. The vast majority of AMIs occur out of the hospital, where patients have limited monitoring [
5]. Additionally, most reported AMIs are the patients’ first and are often unaccompanied by comorbidities [
1,
6]. These observations highlight the inherent difficulty in predicting AMI events.
There have been significant efforts to predict a variety of severe adverse cardiovascular events, including AMIs. Many studies predict AMI onset in patients while they are in the hospital [
7,
8,
9,
10]. However, ideally, prediction occurs earlier so clinicians can intervene to avoid hospitalization. Using electronic health record (EHR) data for over 20,000 AMI cases in a cohort of 2.27 million patients from the UCHealth hospital system, Mandair et al. predicted the 6 month risk of a first AMI using several machine learning models [
11]. The best-performing model achieved an AUROC of 0.835 and F1 of 0.092. Unfortunately, their model exhibited poor calibration, ignored timing, and did not utilize laboratory values, nor did they provide any insight into how their model made predictions. Moore and Bell used XGBoost on data from over 500,000 patients in the UK Biobank to predict self-reported “heart attack” (11,849 cases) [
12]. They interpreted their models using SHAP values. However, they did not give any information regarding the timing of recorded features or heart attack. Wang et al. predicted AMI within 10 years in 11,635 patients but did not provide any model interpretability [
13]. Tsarapatsani et al. predicted 10 year AMI onset in a cohort of 3267 patients that had electrocardiogram and angiography data available, and used SHAP values for model explainability [
14]. Sievering et al. predicted 5 year AMI onset in 500 patients with coronary artery disease using angiography images and 11 clinical features [
15]. While significant effort has been put into predicting AMIs, the resulting models often focus on patients who already have cardiovascular comorbidities, ignore temporal relationships in the data, and lack sufficient interpretability and explainability.
Interpretability and explainability in machine learning models generally derive from two approaches: model structure and post hoc analyses. Common post hoc methods for quantifying feature importance include SHAP [
16] and LIME [
17] values, albeit with potentially questionable reliability [
18]. Many canonical models incorporate inherent interpretability into their structure. Logistic regression models provide variable coefficients that indicate the effect of features on the outcome. Tree-based models like random forest [
19] and Extreme Gradient Boosting (XGBoost) [
20] compute feature importance based on location in trees and metrics like impurity and gain, respectively. Generalized additive models, such as the Explainable Boosting Machine (EBM) [
21], learn a nonlinear function for each variable, or interaction of variables, which describes their impact on the model. Attention mechanisms embedded in deep learning models can explain what data heavily weighs the outcome and relationships between data. For example, TabNet provides global and local feature importance scores [
22]. However, the interpretability and reliability of attention is disputed [
23]. Matrix and tensor factorizations methods learn interpretable factors that provide a low-rank approximation of the data and can be used for clustering, phenotyping, dimensionality reduction, and feature engineering. Applied to EHR data, tensor factorization can automatically discover patterns of co-occurring medical variables across patients and their evolution across time [
24]. This has proven useful as the irregular temporal nature of EHR data is a primary challenge. Fuzzy neural networks are models that use fuzzy logic within a neural network structure to map features to interpretable concepts and learn logical rules for prediction. Specifically, the tropical geometry fuzzy neural network (TGFNN) developed by Yao et al. has shown recent promise [
25]. We employ several of these interpretable methods in this work.
In this work, we assess whether state-of-the-art interpretable machine learning approaches can learn clinical profiles that predict a patient’s first AMI, before hospitalization is required. We extract five years of longitudinal outpatient EHR data for patients without cardiovascular diagnoses before AMI onset, from the University of Michigan Health System (2698 positive and 5396 matched negative samples). Using tensor factorization, we reduce the dimensionality of the longitudinal health history while preserving interpretability and temporal relationships. Using the EHR phenotypes and other patient data, we train seven state-of-the-art interpretable machine learning models, including TGFNN, to predict AMI onset within six months. We evaluate whether incorporating temporal information via computational phenotyping improves model performance, overall model performance, and the consistency of important features. We present and clinically validate the learned phenotypes, rules, and relationships that explain the models’ predicted outcomes. We anticipate that these findings can assist researchers and clinicians in better understanding the risk factors of AMI, identifying at-risk patients, and providing preventative care.
2. Materials and Methods
2.1. Dataset
In this study, we used outpatient data collected from adult patients of the University of Michigan Health System (UMHS) from 1 January 2012 to 1 May 2023.
To define our cohort, we retrieved data from adult patients (23–89 years) who had at least three outpatient visits within the five years before their latest visit or their first cardiac diagnosis. We defined cardiac diagnoses as ICD9 codes 410.*–429.* and 785.0–785.1, and ICD10 codes I20.*–I52.* and R00.*. Cases, or positive samples, were defined as those patients in the cohort whose first cardiac diagnosis was an AMI (ICD9: 410.*; ICD10: I21.*). Controls, or negative samples, were any other UMHS patient who met the above criteria but did not have a cardiac diagnosis. We matched two negative patients to each positive patient based on sex, ±2 years in age (at time of AMI or last encounter), and ±2 points in the hospital frailty risk score [
26]. We computed each patient’s hospital frailty risk score using all diagnoses on their EHR in the five-year period. Positive patients without control matches were excluded. This resulted in a cohort of 2641 positive patients (those who develop an AMI) and 5287 negative patients (those who do not develop an AMI). We split the patients into training and testing sets with a 70–30 split. For demographic information on the cohort, see
Table 1.
2.2. Data Preprocessing
We extracted each patient’s data within the five years before their AMI or last recorded encounter. These data included time-dependent data like diagnoses, medications, vitals, laboratory results, and substance use. Time-independent data were also extracted, including demographics and family health history. We cleaned the data to remove erroneous and ambiguous values (e.g., text entry in numeric variable column, values outside of possible range, etc.). We converted all temperature values to Fahrenheit. Continuous variables like laboratory values and vitals were only included if >60% of patients in the training set had at least one measurement. We arbitrarily selected 60% as a missingness cutoff to prioritize the most common and accessible clinical variables as well as limit errors in downstream imputation. Removing rare variables reduces the need for data imputation and prioritizes results based off common, accessible variables. Categorical variables, like race, were one-hot encoded. We excluded patients missing information on their sex. We determined whether patients had a family history of cardiovascular diseases by whether they had at least one familial occurrence of heart disease, heart attack, coronary artery disease, heart failure, heart defect, aortic disease, sudden cardiac death, cardiomyopathy, cardiovascular disease, or rheumatic heart disease. We excluded procedure data. Diagnosis features were originally recorded as codes from the International Statistical Classification of Diseases (ICD) version 9 or 10 [
27] and we converted all ICD9 codes to ICD10 via a conversion table provided at
https://github.com/bhanratt/ICD9CMtoICD10CM, accessed on 3 March 2024. We removed all “Z” chapter ICD10 codes. Diagnoses were encoded as binary variables to indicate the presence of the ICD10 code, regardless of how often it was recorded. To condense the diagnosis data, we added the higher-level IDC10 categories as features if one of their children diagnoses was present, e.g., if a patient had ICD10 code E11.0 present, they would also have E11 marked as present. Medication information was also encoded as binary variables indicating its prescription at every encounter between its start and stop dates. Medication feature names were taken directly as recorded in the EHR. For both diagnosis and medication data, we employed carry forward imputation followed by zero imputation. We removed variables present in less than 1% of both case and control patients in the training set. All data preprocessing was completed in Python (Version 3.9) and all code used in the study is available at
https://github.com/kayvanlabs/interpretable-ami-prediction, accessed on 3 March 2024.
We extracted three different sets of features from the data: (1) the demographics, or time-independent data, and latest recorded values within six months of AMI onset or the last encounter, (2) summary statistics of the entire five-year history, and (3) computational phenotypes of five-year health history using unsupervised tensor factorization. We selected these approaches for condensing patients EHR history because they are common, interpretable, easy to compute, and have been shown to be effective in other studies. Additionally, we evaluated the combination of the feature sets. In total, we tested six distinct feature sets:
Latest data and demographics;
Summary statistics;
Computational phenotypes;
Latest data, demographics, and summary statistics;
Latest data, demographics, and computational phenotypes;
Latest data, demographics, summary statistics, and computational phenotypes.
2.3. Latest Recorded Data of Health History
For each patient in the dataset, we extracted the most recent measurement of each variable before AMI onset or the last recorded visit, for positive and negative samples, respectively.
2.4. Summary Statistics of Health History
Summary statistics of clinical variables are fast to compute, easily understandable, and can be predictive of important outcomes [
28]. We summarized laboratory and vital data over the five-year observation window by computing the mean, standard deviation, minimum, and maximum of each variable, for each patient. Categorical variables were aggregated by taking the maximum value, indicating whether a patient ever had the feature present in the five-year window. These statistics reduced the multiple, longitudinal measurements of each variable to a single, interpretable value.
2.5. Computational Phenotypes of Health History
Computational phenotyping of high-dimensional EHR data via tensor decomposition enables automated, low-dimensional representation of co-occurring medical events across patients, different types of variables, and time [
24]. In this work, we use tensor decomposition to discover temporal, clinical phenotypes that act as interpretable features for downstream classification. While various tensor factorization approaches exist, we used the unsupervised, non-negative PARAFAC decomposition, with the hierarchical alternating least squares algorithm, implemented in TensorLy (Version 0.8.1) because of its simplicity, wide use, and ease of use [
29]. Non-negative PARAFAC decomposition approximates the original data with the sum of rank-one component tensors. Each component tensor is defined by the outer product of vectors, one for each mode of the original data. The values in these vectors are learned via alternating least squares and describe the components. In this work, we decompose three-dimensional tensors with the modes: patients, time, and features. Thus, after applying non-negative PARAFAC decomposition, each component of the factorization can be interpreted as a phenotype defined by three vectors that encode the membership, weight, or importance of patients, time points, and features.
We learned temporal phenotypes for laboratory values and vitals separately from diagnoses and medication. All laboratory values and vitals for the five years preceding AMI onset or the patient’s last encounter were separated into ten, six-month segments. Only the last recorded value of each feature was kept per segment. To simplify computation and interpretation, each feature was discretized into quintiles based on the feature distributions in the training set. The resulting three-dimensional tensor representation of the data consisted of modes: patient × time × feature and size 7928 × 10 × 320.
We used diagnosis and medication data over the 5 observation years to generate temporal phenotypes. First, we split the 5 years into ten, six-month intervals. Within each interval, we recorded the diagnoses documented at encounters with a “1” and undocumented diagnoses as “0”. If there was no encounter in the six-month interval, diagnosis variables were left as null. We then performed carry forward imputation of all diagnosis codes until the next interval with an encounter. All remaining null values were imputed with zero. For medication data, we marked a “1” if the medication was prescribed during a given interval; otherwise a “0” was inserted. We formatted these data into a three-dimensional tensor of modes: patient × time × feature of size 7928 × 10 × 734.
Determining the optimal number of phenotypes, or rank, of a tensor decomposition is an open problem [
24]. One common approach is to evaluate and plot the predictive performance of various ranks and choose the rank at the “elbow” of the curve—effectively identifying the rank after which performance increases are marginal. We carried this out by first factoring the training set using ranks at increments of two, between one and 50, running three replicates at each rank. Next, using the normalized patient membership to the phenotypes as features, we removed 30% of the training set for a validation set, trained a random forest model to discriminate between positive and negative samples, and visualized the performance according to common machine learning metrics, against the rank. For both the lab/vital and diagnoses/medications (Dx/Rx) phenotypes, the F1 score plateaued by a rank of ten. However, both AUROC and AUPRC gradually increased till approximately 30 for lab/vital phenotypes and continued to increase to a rank of 50 for Dx/Rx phenotypes (see
Figure A1). We decided to use those ranks for each decomposition. While the predictive performance of the Dx/Rx phenotypes may continue to increase beyond rank 50, in practice, more phenotypes may become increasingly redundant or less interpretable.
Using the ranks of 30 and 50, we decomposed the training set tensors and extracted the learned lab/vital and Dx/Rx phenotypes, respectively. To determine the test set patients’ membership of these phenotypes, we fixed the feature and time dimensions of the phenotypes to those of the training set and then set the decomposition to fit only the patient membership mode. This projects the phenotypes onto the test set patients to determine their membership of each, without changing the phenotypes themselves. Lastly, we concatenated the lab/vital phenotype features with the Dx/Rx phenotype features into a single patient × feature table with 80 temporal phenotypes as features to describe each patient’s EHR history.
2.6. Feature Selection
We used the Minimum Redundancy and Maximum Relevance (mRMR) [
30] approach to select the most relevant and least redundant subset of features from each feature set: latest/demographics, summary statistics, and computational phenotypes. We opted to use this feature selection approach because it not only identifies the most relevant features but also limits collinearity between the selected features, unlike other feature selection methods. Using mRMR can improve model performance, speed, and interpretability [
31]. Feature relevance is determined by random forest feature importance and feature redundancy by Pearson’s correlation. To assess the optimal number of features for each feature set, we incrementally increased the number of features to use when running mRMR and then assessed their performance in three random forest models. By looking at the results, for each feature set, we determined a reasonably small number of features with near-optimal performance. We selected 20 features for the latest and demographic feature set, 30 for the summary statistics feature set, and 30 for the phenotypes feature set. After feature selection, if any variables had missing values, we imputed them using k-nearest neighbors, as implemented in Scikit-learn (Version 1.2.2) [
32], fit on the training set only, and applied to both training and testing sets. Next, we performed the same experiment using combinations of the three feature sets and opted to use 20 features in the latest/demographic/summary statistics feature set, 30 in the latest/demographic/phenotypes feature set, and 60 in the “All” feature set (latest/demographic/summary statistics/phenotypes).
2.7. Model Training and Cross-Validation
We selected a set of machine learning models to evaluate in this work based on their interpretability and accessibility, including decision tree (DT), logistic regression with L2 penalty (LR), random forest (RF), EBM, XGBoost (XGB), TabNet (TNET), and TGFNN. We used the decision tree, logistic regression, and random forest implementations from scikit-learn (Version 1.2.2), the EBM implementation from InterpretML (Version 0.3.2) [
16], and XGBoost (Version 1.7.5) [
20], TabNet (Version 4.1.0) [
22], and the TGFNN as described in [
25] and implemented in Pytorch (Version 2.1.0). To compensate for dataset imbalance, we up-weighted the minority class (positive) and down-weighted the majority class (negative). In models not allowing class weights (EBM, XGBoost, and TabNet), we randomly downsampled the majority class (negative) to a 1:1 ratio with the minority class (positive).
We performed three-fold cross-validation on the 90% of the training set (10% withheld as a validation set) to determine the optimal hyperparameters for each model. We randomly sampled 500 combinations of hyperparameters for each model, except TGFNN. Because of the longer runtime of TGFNN, we evaluated 300 combinations for all feature sets besides evaluating 200 on the “All” feature set, due to slower training from the additional features. After evaluating their performance, the combination of hyperparameters with the highest F1 score was selected. Next, we performed five-fold cross-validation to evaluate the performance of the models and datasets on the training and test sets. This trains five instances of each model on a different subset of data, providing information on the variance in performance. To evaluate model calibration, we calibrated the best-performing replicate according to F1 score on the “All” feature set. Each model was calibrated on the training set according to Platt’s method and plotted with the mean probability of ten uniform bins.
2.8. Tropical Geometry Fuzzy Neural Network
Due to its lesser-known architecture, we briefly describe the TGFNN, though a full description can be found in [
25]. The TGFNN is fuzzy logic classifier built in a neural network architecture that allows flexible and interpretable variable concept encoding, logical rule learning, and inference for a classification task. TGFNN consists of three modules: the encoding module, the rule module, and the inference module.
The encoding module “fuzzifies” continuous input variables into their membership to the concepts “low”, “medium”, and “high”. This encoding is performed via parameterized membership functions that map each variable to three values in the range [0, 1] that represent how much it belongs to each concept. Categorical variables are one-hot encoded. Membership functions are learned during training and help model the intuition and uncertainty in clinician decision making. For example, TGFNN can learn the concept of “low blood pressure” and use that concept in the decision-making rules.
The rule module learns combinations of variable concepts that are predictive of an outcome. The first layer of the rule module learns which concepts are important for each variable within each rule. The second layer learns the importance of each variable within each rule. The more important the variable, the greater the weight within the network, and thus the more it will contribute to inference when activated. Rule activation strength is calculated via a parameterized T-norm which models an AND operation in fuzzy logic via either a product or minimum function. This enables the easily interpretable and logical structure of the decision rules, for example, “if is low and is high”.
The final layer of the TGFNN is the inference layer, which learns the importance of each rule in determining the model output. The importances, or contributions, of the rules are aggregated in a T-conorm function, followed by softmax activation. This calculates the probability of each output class, given the activation of the rules by the input sample. Implementation of tropical geometry allows the OR operation to be changed between an addition or maximum function.
2.9. Statistical Analysis
To evaluate whether differences in model performance across feature sets were significant, we performed Friedman’s tests with the Bonferroni corrected alpha of 0.01 (0.05 divided by the number of tests run, five, one for each metric). For significant Friedman test results, we performed pairwise post hoc Nemenyi tests. We selected these tests because they are non-parametric and recommended when comparing machine learning model cross-validation performance [
33].
3. Results
We find that, by applying straightforward, interpretable machine learning approaches to EHR data, we are able to predict the onset of first-AMI events in patients without pre-existing cardiovascular conditions, within six months, with moderately good accuracy. Upon comparing different explainable feature engineering approaches (the feature sets), we report that they can have significantly different performance, depending on the model. Overall, the best-performing feature sets were those that included computational phenotypes. Additionally, we compared seven machine and deep learning models, each with a different level of interpretability, and found them to exhibit significantly different performance. We present these results in detail in the following.
3.1. Feature Set Performance
Feature sets including computational phenotypes significantly outperform those without. Overall, the “All” feature set performs best, though only slightly (see
Table 2). This is likely because of the large number of diverse features included, the efficacy of computational phenotyping for feature extraction, and the relevance of historical information. Incorporating computational phenotypes as features resulted in performance gains in AUROC as much as 0.05 (see
Table A3 and
Figure A4a). We evaluated whether the differences in overall model performance between feature sets were significant by performing Friedman’s test followed by pairwise Nemenyi tests. Across every pairwise comparison, all feature sets containing phenotypes had significantly higher performance than feature sets without phenotypes, according to AUROC. Under AUPRC and F1 score, most of these comparisons were also statistically significant. In no pairwise comparison, regardless of metric, do any of the feature sets containing phenotypes exhibit significantly different performance from each other, save the “All” feature set outperforming the “Phenotypes” feature set according to AUPRC.
3.2. Model Performance
We predict the onset of AMI within six months in patients without pre-existing cardiovascular diagnoses with good performance using several interpretable models. Model performance varied significantly between models and feature sets, often depending on the evaluation metric (see
Figure A4). While there is no clear “best” model, random forest, logistic regression, and TGFNN performed best overall. In interpreting model performance on an imbalanced dataset, multiple metrics must be appropriately considered as there is no singularly best one. Determining performance criteria is especially important in a clinical application where false positives and negatives could lead to patient harm, either by receiving unnecessary treatment or not receiving needed care, respectively. We note that the models showed varying levels of minimal-to-mild overfitting according to training, validation, and testing set performance (see
Table A1,
Table A2 and
Table A3). When considering all metrics, logistic regression and random forest consistently performed near best, often followed by TGFNN, while XGboost and decision tree were often among the worst. Depending on the metric, TGFNN, EBM, and TabNet typically performed either best or worst (see
Figure A4). We performed Friedman tests followed by Nemenyi tests to evaluate whether, across all models, performance differences between feature sets were statistically significant. According to AUROC, random forest, logistic regression, and EBM all performed significantly better than the other models. For nearly all pairwise comparisons, random forest, logistic regression, and TGFNN performed significantly better than XGBoost, decision tree, TabNet, and EBM, according to F1 score. When considering F1 score, there was no significant difference between random forest, logistic regression, and TGFNN performance.
Several models appear biased to over- or underestimating risk of AMI. Across all feature sets, TGFNN exhibits high recall on average (0.754 ± 0.234), as it heavily predicts the positive class. Conversely, TabNet and EBM have low average recall scores (0.159 ± 0.124, 0.095 ± 0.076) due to relatively fewer positive predictions. The accuracy of the, albeit relatively few, positive predictions of EBM contributes to its high mean precision and AUROC (see
Table 3). These biases are also present in the model calibration plots (see
Figure 1). Both TabNet and EBM underestimate the probability of positive samples while TGFNN slightly overestimates. Overall, the best-performing models, according to F1 score on the “All” feature set, exhibit good calibration.
Several models suffered from variable performance. Model stability is an important factor when considering implementation, especially in a healthcare setting. In contrast to the simpler random forest and logistic regression models, more complicated models, like EBM, TabNetm, and TGFNN, exhibited higher standard deviations in performance (see
Figure A4d,e). This may be due to the greater number of hyperparameters that require precise tuning in these models. This variance in performance makes the interpretation of important features difficult as well.
3.3. Model Interpretation
Each of the employed models exhibits a degree of inherent interpretability, allowing for some explanation of how predictions were made. For brevity, we will focus our analysis on the “All” feature set, as it is generally the best performing and includes features from all subsets. Additionally, we will focus our interpretability analysis on the better and more consistently performing models: logistic regression, random forest, and XGBoost.
Across all models with global feature importance scores (logistic regression, random forest, XGBoost, EBM, and TabNet), the computational phenotype features of longitudinal EHR data are often the most predictive of a future AMI event. The most important features include Dx/Rx phenotype 47, family history of cardiovascular diseases, Dx/Rx phenotype 36, lab/vital phenotype 18, and a high max systolic blood pressure within the five-year observation window (see
Table 4). Feature coefficients in logistic regression and SHAP values of XGBoost and random forest models indicate the direction of the relationship between feature magnitude and future AMI prediction (see
Figure 2). The patient phenotypes most strongly indicative of a future AMI are characterized by dorsalgia, type 2 diabetes, hypertension, high creatinine and urea nitrogen levels, cardiovascular medications like atorvastatin, and anemia (see
Table 5). The temporal factor of these phenotypes may suggest the characteristic timing of its presentation in patients. We present the temporal factor plots of six of the most predictive phenotypes in
Figure 3. The temporal components of the phenotypes predominantly range from immediately before AMI onset to three years prior. Apart from phenotypes being among the most predictive features, additional important variables include a history of smoking and high mean body mass index over the five-year window (
Table 6).
The importance of features varies between models, making interpretation difficult at times. When computing the, on average, most important features in the “All” set across all models with global feature importance, there are large standard deviations (see
Figure 4). Additionally, we compared how each model cross-validation replicate ranked variables by importance via the Kendall rank correlation coefficient (see
Figure 4). This metric shows on a scale of [−1, 1] how negatively or positively correlated the rankings are, with 0 indicating no correlation. We found variable correlation between replicates and models. Logistic regression, random forest, and XGBoost show decent correlation between replicates. On the other hand, TabNet and EBM show relatively low correlation between replicates. Surprisingly, despite somewhat comparable performance, the feature importance ranking of logistic regression is somewhat negatively correlated with the rankings of both random forest and XGBoost. This may be a result of the specific model properties, such as the limitation of logistic regression in identifying linear relationships, whereas XGBoost and random forest can identify nonlinear ones. However, it may also reflect the inherent difficulty in the task of predicting future AMI events in this cohort.
Unlike the other models, TGFNN learns precise, interpretable rules that determine predictions. The best-performing TGFNN model uses the “All” feature set and is based on 12 rules learned directly from the data (AUROC = 0.658, AUPRC = 0.479, F1 = 0.456, precision = 0.475, recall = 0.439). We present these rules in
Figure 5. Linguistically, the most important rule (R0) is
Patient has a history of Clopidogrel prescription and matches Dx/Rx phenotype 27 (characteristic features include: vitamin D3, simvastatin, vitamin B-12, vitamin C, and malignant neoplasm of bladder (ICD10: C67)) and has high mean creatinine.
Rules 1–5 are similarly simple to understand, containing a couple of concepts each, and describe combinations of cardiovascular and metabolic medication prescriptions along with abnormal lab and vital measurements, as well as a family history of cardiovascular conditions. Interestingly, while R6 has family history as important, R5 has the lack of family history as important. These rules may be stratifying between different underlying pathologies leading up to AMI. Notably, interpreting the “low”, “medium”, and “high” concepts is dependent on the shape of the underlying membership functions. Because of the flexible, trainable parameters of these functions, they may “squish” the “low” or “high” function out of the possible range of values to dynamically simplify to only two concepts.
Overall, we found the interpretable models able to accurately identify patients, without pre-existing malignant cardiovascular diagnoses, that have an AMI within six months. While the evaluated models varied in both performance and prioritization of important features, we identified several consistently important medical concepts and phenotypes.
4. Discussion
We demonstrate that accurately predicting an AMI within six months in patients without pre-existing cardiovascular conditions, using only outpatient data and interpretable models, is possible. Furthermore, we show that temporal, computational phenotyping can identify highly predictive clinical profiles of future AMI events. This suggests the relevance of historical information, temporal EHR relationships, and computational phenotyping in evaluating the future risk of AMI, which is often ignored in similar studies. We anticipate that these findings will be informative to researchers and clinicians seeking to develop interpretable machine learning approaches for hard-to-predict events like AMI, as well as leverage high-dimensional longitudinal EHR data.
The Dx/Rx phenotypes predictive of future AMI onset generally agree with strongly supported clinical relationships and also suggest potential underutilized relationships. The predictive Dx/Rx phenotype 47 describes dorsalgia and other pain as predictive. While low back pain does not have a known association with AMI, chronic pain is associated with various cardiovascular diseases [
34,
35] and some pain medications, like non-steroidal anti-inflammatory drugs (NSAIDs), are a known risk factor of AMI [
36]. The consistently predictive Dx/Rx phenotype 13 describes a profile of pain medication prescriptions, including the NSAID ibuprofen. However, it was negatively related with future AMI according to several models. This discontinuity may be dataset specific or indicate an underlying relationship such as if a patient is on a certain type of medication it reflects their interaction with healthcare professionals that may be helping prevent AMI in other ways. On the other hand, a back pain phenotype may capture patients who are misinterpreting angina (precursor symptom of AMI) for dorsalgia. This could suggest clinicians increase their suspicion of underlying cardiovascular diseases when patients present with back pain. In congruence with known AMI risk factors, Dx/Rx phenotype 36 characterizes patients with type 2 diabetes, potentially further complicated with hypertension [
37,
38]. The Dx/Rx phenotype number 6 encompasses several cardiovascular medications like the platelet inhibitor clopidogrel. A clopidogrel prescription suggests the patients may have already had severe cardiovascular conditions, like coronary stenosis, that were not recorded in the EHR, that required a stent. This phenotype may be predictive due to poor medication adherence followed by in-stent re-stenosis and a subsequent AMI within approximately six months (see phenotype temporal peak in
Figure 3e). However, further analysis is required to ascertain specific and supported claims of this clinical relationship.
The lab/vital phenotypes suggest some clinically valid risk factors, but are noticeably harder to interpret due to large quantile ranges. Lab/vital phenotype 28 characterizes patients with mild-to-severe kidney disease, indicated by elevated creatinine [
39], high urea nitrogen [
40], and hyperchloremia [
41]. Kidney disease greatly increases the risk of adverse cardiac events like AMI [
42]. Lab/vital phenotype 22 describes a patient with mild-to-severe anemia, a risk factor of AMI [
43]. The temporal component of the phenotype suggests that this occurs relatively soon before AMI (see
Figure 3f). However, the large range of these lab result quantiles limits the utility of the phenotypes. In the future, more precise partitioning of variables may resolve this. Notably, lab/vital phenotype 18 does not described abnormal physiology. The range of the “Absolute Early Granulocyte Count” encompasses essentially all possible values. A deeper look at the distribution of values in the training data suggested this is a result of too few unique values to make five equally sized quantiles. Additionally, the feature weights in this phenotype are relatively low, indicating weak membership and thus a rather ambiguous phenotype. The relevance of this phenotype with future AMI may be an artifact of the data or methods. While phenotypes using laboratory values and vital signs can be improved, they can successfully capture important abnormal physiology across temporal EHRs.
Visualizing SHAP values of the latest value and summary statistic features revealed additional risk factors with known clinical relevance. Unlike standard feature importance scores generated by tree-based models, SHAP values indicate the direction of relationships between features and outcomes. Specifically, the SHAP values of the random forest and XGBoost models trained on the “All” feature set suggest several predictive relationships (see
Figure 2). These predictive variables include high blood pressure [
44], family history of cardiovascular diseases [
45], high body mass index [
12], smoking [
46], and low mean corpuscular hemoglobin levels [
47]. Additionally, the SHAP values agree with other feature importance scores, indicating the presence of Dx/Rx phenotypes 36 and 47, back pain and cardiometabolic syndrome, respectively, are predictive of a future AMI. Other features were not consistently highly predictive across multiple replicates or models. The contradictions and variability in the importance of features and their relationship to future AMI events are likely a result of noise within the used EHR data and reflect a typical challenge in predictive machine learning in healthcare. Computational approaches, such as this work, may best serve as a screening method for a specific clinical relationship to be explored in more controlled settings.
When compared to summary statistics and the most recent recorded data, computational EHR phenotypes can significantly increase interpretability and performance in biomedical machine learning. Across multiple model architectures and feature sets, the phenotypes consistently ranked as the most important features. These results suggest that historical and temporal information encoded in EHRs is highly relevant for predictive modeling, and specifically for AMI risk assessment. Additionally, we suggest the increased use of tensor decomposition in EHR feature extraction. The employed tensor factorization algorithm mines temporal, high-dimensional EHR data without supervision, removing the need for clinicians to manually curate phenotypes. These phenotypes capture patterns of co-occurring medical variables across time to describe distinct patient profiles. These patient phenotypes reduce the dimensionality of the EHR data while maintaining interpretability and improving performance. Predictive phenotypes can prioritize to clinicians the important sets of conditions patients present with in the clinic, that may be indicative of risk for a future AMI. These can direct more targeted studies to establish association. Additionally, they can provide information regarding the timing of conditions, which may prompt further investigation into understanding the progression and evolution of disease, as well as potential timing for early intervention. Notably, computational phenotypes may be difficult to interpret if they are redundant, have many features with similar weight, or do not make clinical sense. Many improvements upon the base PARAFAC tensor factorization have been made to address these problems specifically for temporal EHR phenotyping [
24]. However, in this work, we focused on the baseline approach due to its wide accessibility.
While we did not identify a singular superior, interpretable machine learning model, we identified several strengths and weaknesses. Overall, random forest, logistic regression, and TGFNN performed best. All models exhibit good calibration. In a similar AMI prediction study, [
11] presents poor model calibration results. As the authors state, poor calibration in [
11] is likely a result of very severe class imbalance, whereas in this work we limit class imbalance via the downsampling of matched negative samples and class-weighted loss functions. Some models, like EBM and TabNet, displayed very poor recall and F1 scores due to biased class predictions. They also showed low concordance in feature importance between cross-validation instances. It is likely that these models were not well suited for this particular dataset and task. The TGFNN provides clear rules for predictions, making it perhaps the most interpretable of the models. The rule-based nature of TGFNN well reflects how clinicians make decisions and identify patterns. We anticipate that the further development of TGFNN and other interpretable rule-based models will aid clinical adoption. Still, in the example presented in
Figure 5, interpretation can be difficult if “medium” concepts cannot be clarified. Logistic regression and random forest both showed some of the best performance and consistency of feature importance. These models are often too simplistic to solve difficult tasks; however, in this case, deriving features from computational phenotypes improved performance.
This study has several limitations that affect the applicability and bias of results. First, the employed cohort of patients comes from a single hospital system and is predominantly elderly and white. We excluded data on procedures received by patients. We employed mRMR feature selection, which may not find the optimal set of features. Additionally, the interpretation of important features showed high variability between and even within models. We note that while similar studies attempting to predict AMI, such as the work carried out in [
11], show higher AUROC values, this work attempts a potentially more difficult task to predict AMI events within a cohort without pre-existing cardiac conditions. Future work could address these limitations by expanding the cohort inclusion criteria, incorporating data from multiple healthcare systems, as well as using computational methods to explore the causal relationships between clinical features and AMI onset.
In conclusion, we suggest that temporal, computational phenotyping can improve the utility of outpatient EHRs in both predicting the risk of AMI in otherwise low-risk patients and identifying novel risk factors for further investigation. Additionally, we demonstrate that interpretable machine learning models can consistently identify important risk factors and accurately predict a future AMI event in patients without pre-existing cardiovascular conditions, using only outpatient data. We note that model-derived feature importance scores may be discordant, and encourage researchers to validate findings. We anticipate that these findings will promote further development in computational and machine learning approaches to identify novel phenotypes that can aid clinicians in understanding, predicting, and preventing AMI and subsequent hospitalization.