1. Introduction
Recent evidence suggests that coronary artery disease (CAD) is the most common type of cardiovascular disease being the leading cause of mortality worldwide. The percentage of death attributed to CAD is higher compared with other heart diseases. CAD occurs when the blood vessels are narrowed, and blood does not circulate properly around the heart. It is extremely crucial for a patient to be diagnosed early with any form of heart disease, especially CAD, so that the doctor can ensure that the patient will receive the proper medication and treatment [
1,
2].
A well-established method in clinical practice for the assessment of CAD is the single-photon emission computed tomography (SPECT) myocardial perfusion imaging (MPI) with technetium-99 m-labeled (99 mTc) perfusion agents. The method is rather advantageous for most cases. This method can offer rest and stress representation of the patient’s heart to identify areas that have myocardial perfusion abnormalities [
3]. The most important factor is that SPECT offers three-dimensional information, as well as reduces scanning time, and decreases the procedure’s cost [
4,
5]. The SPECT clinical diagnostics methodology is the most popular technique in dealing with CAD, proving that computer-aided medical diagnostics to be increasingly used in recent years and offering doctors time saving and error avoidance. However, the increasing accuracy of computer-aided medical diagnostics is not only due to the advancements made in computer tomography and the newly developed image capture hardware that developed, but also due to the research development and innovation achieved in the deep learning (DL) and other machine learning (ML) methodologies.
DL-based solutions are increasingly being used to solve medical image analysis and computer-assisted intervention problems in medical informatics. This is due to their ability to effectively extract patterns and features only from input images, achieving tremendous results. The most significant advantage of such methods lies in the fact that they are adaptable to the specific functionality of medical image analysis. This advantage has been extensively utilized by researchers in the field in adapting DL into a variety of medical diagnostics with minimal implementation and heterogeneous infrastructure harmonization efforts. Additionally, DL provides a pipeline for medical imaging applications such as segmentation, regression, image generation, and representation learning in medical diagnostics [
3]. DL is an advancement of artificial neural networks (ANNs) that consists of more layers that allow for higher levels of abstraction and better data predictions. Thus, the method currently has emerged to be the most powerful ML tool in the general imaging and computer vision domains and therefore in the medical informatics domain.
CAD systems are increasingly utilized in scientific diagnostics and they have an irreplaceable role in the diagnosis of MPI images, as they represent the blood flow of the heart in high contrast [
6]. An automated algorithm for classifying CAD images is highly necessary for nuclear physicians since the increasing number of cases causes a bottleneck for the doctors [
7]. Concerning CAD diagnosis in nuclear image analysis, ML has been introduced and investigated by various research studies conducted so far, as a methodology for automatic classification [
1,
8,
9,
10,
11,
12]. Already, CAD systems have proven themselves as a highly stable method in the domain of cardiovascular data analysis, due to their ability to extract data in highly respected analyses, such as SPECT MPI. However, CAD systems can work together with ML techniques to provide automatic SPECT image classification, without the need for additional data. DL methodologies and especially convolutional neural networks (CNNs) have exhibited promising results in CAD diagnosis. Considering the diagnosis of imaging CAD, a wide number of studies have focused on the development of CNNs, as they exhibit high reliability in image classification. In what follows, the most prominent and relevant CNNs along with various efficient architectures previously proposed to classify SPECT MPI scans for the detection of CAD, are demonstrated, according to the reported research papers.
In particular, Berkaya et al. in [
4] sought to correctly classify SPECT MPI images and diagnose any myocardial abnormalities such as ischemia and/or infarction. To do so, the authors developed two classification models, one DL-based model, and one knowledge-based model. The proposed models extracted results such as an accuracy of 94%, a sensitivity of 88%, and a specificity of 100%, which are close to those based on expert analysis, so they can assist in clinical decisions regarding CAD. Papandrianos et al. [
13] explored the abilities of CNNs to automatically classify SPECT MPI images, in a two-class classification problem, where the possible outputs were normal and ischemia. The outcomes (AUC 93.77% and accuracy 90.2075%)_indicate that the proposed model is a promising asset in nuclear cardiology. Furthermore, Papandrianos et al. [
8] dealt with the same problem, investigating the capabilities of the Red Green Blue (RGB)-CNN model while they further compared its performance with transfer learning. The proposed RGB-CNN emerged as an efficient, robust, and straightforward deep neural network (NN) able to detect perfusion abnormalities related to myocardial ischemia on SPECT images in nuclear medicine image analysis. Additionally, it was proven that this is a model of low complexity and generalization capabilities compared to state-of-the-art deep NNs. Betancur et al. in [
14] combined semi-upright and supine polar maps to analyze SPECT MPI abnormalities. A CNN model was implemented and competed against the standard method of total perfusion deficit (TPD) towards the prediction of obstructive disease. The authors concluded that the DL approach extracted promising results (Area under Curve (AUC) per vessel 0.81, AUC per patient 0.77), in contrast to quantitative methods. Moreover, in another research study, Betancur et al. [
15] explored deep CNNs utilizing MPI images to automatically predict obstructive disease. Raw and quantitative polar maps were used in stress mode. Although the results seemed adequate (AUC per vessel 0.76, AUC per patient 82.3%), further investigations are needed regarding this task.
Other studies on CAD diagnosis in nuclear medicine are focused on the implementation of the DL methods to categorize polar maps into normal and abnormal. Liu et al. [
16] developed a DL model utilizing CNN to diagnose automatically myocardial perfusion abnormalities in stress MPI count profile maps. The CNN model delivers an output being characterized as normal or abnormal. The DL method was compared against the standard quantitative perfusion defect size method. This study concludes that DL for stress MPI has the potential in providing notable help in clinical cases. Zahiri et al. in [
3] suggested the implementation of a CNN to classify SPECT images for a successful CAD diagnosis. The dataset consists of polar maps demonstrated in stress and rest mode. The extracted results concerning AUC (0.7562), sensitivity (0.7856), and specificity (0.7434), represent the ability of the network to assist in future applications with regard to SPECT MPI. Apostolopoulos et al. in [
17] investigated the implementation of a CNN to automatically characterize polar maps that were acquired with the MPI procedure. The aim of this research was the diagnosis of CAD by using both attenuation-corrected and non-corrected images. According to its evaluation results (74.53% accuracy, 75% sensitivity, and 73.43% specificity), the DL model performed surprisingly well.
Apostolopoulos et al. in [
18] developed a hybrid CNN-random forest (RF) model and utilized polar map images and clinical data for the classification of patients’ status into normal and abnormal and further compared the results against nuclear experts’ diagnosis. The proposed model was evaluated with the 10-fold cross-validation technique and achieved 79.15% accuracy, 77.36% sensitivity, and 79.25% specificity while exhibiting similar overall results to those provided by experts. Nakajima et al. in [
19] developed an ANN for the diagnosis of CAD, in contrast to the conventional quantitative approach. The proposed model classified 364 external data for validation purposes. The applied ANN outperformed with 0.92% AUC. Filho et al. in [
20] compared four ensemble ML algorithms, namely: adaptive boosting (AB), gradient boosting (GB), eXtreme gradient boosting (XGB), and RF for the classification of SPECT images into normal and abnormal. The proposed model was evaluated with the utilization of the cross-validation approach and achieved an AUC of 0.853%, accuracy of 0.938%, and sensitivity of 0.963%. Ciecholewski et al. in [
21] presented three methodologies: SVM, PCA (principal component algorithm) and NN, to diagnose ischemic heart disease. The results showed that SVM achieved greater accuracy (92.31%) and specificity (98%), whereas PCA extracted the best sensitivity.
Additionally, various recent studies demonstrate outstanding achievements regarding explainable artificial intelligence (XAI) implementations, which manage to provide transparency in the case of black-box models [
22]. In particular, utilizing GRAD-CAM for the diagnosis of CAD, Otaki et al. [
23] developed an explainable DL model, which highlights the regions that correspond to the predicted outcome. To evaluate the model, the authors compared its outcome against those provided by the automated quantitative total perfusion deficit (TPD) and the experts’ diagnosis. The results demonstrated that the DL model outperformed all. Spier et al. in [
24] explored the ability of graph convolutional neural networks (GCNNs) to diagnose CAD successfully. GCNNs employ the same operation as that of CNNs to maps without the need for re-sampling them. The authors developed a model that can provide a second opinion on the diagnosis of CAD by acquiring polar maps. Chen et al. in [
1] explored 3D gray CZT SPECT images and focused their research on the development of a CNN and the application of Grad-CAM capable of successfully diagnosing CAD. The model achieved an accuracy of 87.64%, sensitivity of 81.58%, and specificity of 92.16%. Otaki et al. in [
25] developed a CNN model for the classification of polar maps into normal and abnormal, in contrast with TPD. Clinical data were added, such as sex and BMI. Additionally, Grad-CAM was applied to visualize the regions that correlate to the predicted output. The sensitivity produced by the CNN approach was 82% and 71% for women in the dataset, while the TPD for upright and supine was 75% and 73% in men and 71% and 70% in women, accordingly. It is worth mentioning that XAI models have not been examined thoroughly in nuclear medicine; thus, it is a promising and open research field in CAD diagnosis using SPECT MPI images.
According to previously conducted studies and the corresponding results, CNNs seem to attain similar accuracy to that of medical experts, and the approaches developed [
8,
10,
11]. However, it emerges that there are not any studies performing the three-class categorization into ischemia, infarction, and normal. On that basis, the aim of this research work focuses on information extraction from SPECT data, classified by an expert reader, towards an automatic classification of obstructive CAD images, with three possible outputs: infarction, ischemic, and normal. The innovation involves the contribution of a predefined RGB-CNN architecture, which has been concluded after a thorough exploration of parameters, and the proposal of pre-trained models, which are VGG-16 and DenseNet-121, for improving CAD detection. The evaluation of the model is accomplished through reliable metrics and the usage of k-fold cross-validation.
The contribution of this research is to propose a CNN model and utilize transfer learning with the purpose of developing a model that can automatically classify SPECT images and predict CAD. For that reason, we have carefully selected metrics, that are reliable and robust such as accuracy, loss, AUC value, and ROC curve [
4]. Our extracted results demonstrate high constancy and so our produced models can assist in future nuclear imaging issues. The structure of this article is as follows: we first present a compact yet detailed literature review on the research community’s contributions to DL and ML processing of SPECT MPI imaging exploring the notions of transfer learning, clustering, and classification. Following, the materials and a small description of the available datasets are discussed. In the next section, we illustrate the methodology of processing in a predefined sequence of steps thus, prototyping a directed process imposed by the CNN implementation, validation, and testing. Finally, we provide an in-depth presentation of the outcomes and a discussion of the results of the tri-partite classification of the available images according to CAD conditions.
3. Results
In this research study, the problem we seek to address is the automatic classification of obstructive CAD images, with three possible outputs: infarction, ischemic, and normal. To that end, various architectures of CNN, which are commonly used for classifying images correctly [
24,
26,
27,
28,
29], were explored to find the ideal for our corresponding dataset, while ensuring stability and generalizability. Each combination of the examined architectures was performed at least 10 times, to confirm its reliability.
The developed algorithm was run in Google Colab [
21,
30] runtime environment, which gives access to a powerful machine with faster GPUs and an increased amount of RAM and disk, making it suitable for training large-scale ML and DL models. This platform provides free space for uploading data, allowing users to run code entirely on the cloud and thus, overcoming any computational limitations on local machines. With reference to the development of our code, concerning the proposed CNN, we used Python 3.6 version, utilizing Tensorflow 2.6 and Keras 2.4.0. framework packages. The OpenCV library was used to load our dataset, while the scikit-learn library was used for splitting our dataset and computing the results. Matplotlib was further deployed for representing the plots. Furthermore, the specifications of our PC for the main component are: (1) Processor: Intel
® Core™ i9-9980HK CPU 2.40GHz, (2) GPU: NVIDIA Quadro RTX 3000, (3) Ram: 32 GB, (4) System operating type: 64-bit operating system and x64-based processor.
To find the best architecture for the provided dataset, a thorough exploration was conducted. On that basis, we examined different values for pixel sizes (200 × 200, 250 × 250, 300 × 300 and 350 × 350), batch sizes (8, 16, 32 and 64), number of nodes for convolutional layers (32-64-128, 16-32-64-128 and 16-32-64-128-256) and nodes for dense layers (32-32, 64-64, 128-128 and 256-256). Additionally, we examined various epochs (400 up to 800), whereas we kept a fixed drop rate value of 0.2 for all simulations, as this value provides adequate results.
To properly assess the results, certain robust metrics were selected, such as accuracy, loss, AUC, ROC curve, precision, recall, sensitivity, and specificity. More specifically, accuracy is the ratio between several correct predictions to a total number of predictions. Loss dictates the error in classifying images by calculating the gradients between predicted and real outputs. The AUC value is the numerical demonstration of a model to differentiate an image within the possible output and its value ranges between 0 and 1, where 1 is the ideal, ROC curve is the visual representation of the classifier’s performance. Regarding precision is the ratio between correctly classified positive samples to the total number labeled as positive, where recall is the percentage of correctly classified samples. Sensitivity evaluates the model’s capability by calculating the percentage of the positive samples, that were predicted correctly, whereas specificity computes the percentage of negative samples [
8,
11,
27].
With respect to the initial default values for the performed exploration process, these are the following: 250 × 250 for pixel size, 16 for batch size, 16-32-64-128 for convolutional layers, and 128-128 for dense layers, as well as 400 epochs. Firstly, various batch sizes were examined by keeping the rest of the values fixed, to find the best batch size.
Table 1 presents the respective results, for each of the examined batch sizes. It emerges that for batch size 32 the overall results are superior to those of the rest batch sizes.
Afterwards, we examined different pixel sizes, while having kept fixed values for convolutional and dense layers, considering that 32 is the best batch size. In
Figure 4, we can clearly distinguish, the remarkable results that 300 × 300 extracted, in contrast to the rest structures, thus being the best combination for our corresponding dataset.
In the following step, different convolutional layers were examined, in order to configure to the best combination. In
Figure 5 the calculated results are visually illustrated regarding the application of various convolutional layers. The combination of 16-32-64-128 convolutional layers seems to have performed better, in terms of overall metrics. On the other hand, the combination 32-64-128 and 16-32-64-128-256 did not perform sufficiently. Moreover, we have performed further experiments with various epochs, in the range between 400 and 800. However, we observed that above 400 epochs, no improvement in results was attained.
The last step deals with the selection of the best number of nodes for the dense layers. Therefore, different numbers of nodes were examined, and the produced results are visually depicted in
Figure 6. It is concluded that the best sequence is 128-128, where all metrics are considered.
At this point, various architectures were explored for our RGB model, and each combination was executed for at least 10 runs so that a robust and reliable model was finally conducted. The ultimate architecture of RGB is 300 × 300 for pixel size, 32 for batch size, 16-32-64-128 for convolutional layers, and 128-128 for dense layers. We can see in
Table 2, the average value of metrics for each class individually. It is obvious that the model can perfectly distinguish each class.
In addition to the procedure, which is followed for the determination of the RGB model, we also utilized a powerful method, commonly used in imaging problems, which is called transfer learning. A model is previously trained in various classification problems and can perform adequately in related tasks and has great generalizability skills [
4]. In our proposed research, we used two pre-trained models, Visual Geometry Group (VGG)-16 and DenseNet-121. Furthermore, we selected the architecture for the pre-trained networks, following the literature on VGG-16 and DenseNet-121 applications in medical image analysis, as well as our previous works in this domain. Regarding VGG-16, the best architecture entails 200 × 200-pixel size, 32 batch size, 0.2 drop rate, 400 epochs, 14 true trainable parameters, global average pooling, and two fully connected layers with 32 nodes each. It must be understood that VGG-16 has 16 trainable layers, as the name indicates, and for our comparative process, we “froze” the last two layers of VGG-16, which means that the weights of these layers will not be updated during the training procedure to include two fully connected layers, and to adjust the computation of output for our corresponding dataset. In
Table 3, we can see how well each class of the model was distinguished by visualizing the result of robust metrics.
DenseNet-121’s best architecture consists of 250 × 250 for pixel size, 32 for batch size, 0.2 drop rate, 400 epochs, 12 true trainable layers, global average pooling, and 2 fully connected layers with 32 nodes each. In
Table 4, the results of the metrics are demonstrated, and they represent the efficiency of DenseNet-121 to distinguish classes.
In
Table 5, we have the comparison of results among the RGB and the pre-trained networks (VGG-16, DenseNet-121). We can see that RGB performed efficiently, where VGG-16 extracted adequate metrics and DenseNet-121 extracted the lowest results. Regarding computation time, RGB-VGG-16 are similar and DenseNet-121 has the highest calculation time.
In
Figure 7,
Figure 8 and
Figure 9 the accuracy and loss plots are demonstrated for each model. RGB performed better. In
Figure 10,
Figure 11 and
Figure 12, we can see the ROC curves for each model and each class individually. It is clear that RGB performed better.
In
Figure 13, the comparison of RGB-CNN, VGG-16, and DenseNet-121 is illustrated with respect to the produced validation and testing accuracy, AUC, and validation and testing loss results for each model. It is observed that RGB extracted the best results, for the case where VGG-16 has similar results in validation and testing loss. DenseNet-121 performed adequately.
To implement robust and reliable models, we utilized the 10-fold cross-validation technique in order to further evaluate the performance of our model [
6]. This method splits the given dataset into ten parts using every time, nine parts for training, and one part for testing. This procedure is repeated until every chunk is utilized as a testing part. In
Table 6,
Table 7 and
Table 8 we have demonstrated the results for RGB, VGG-16, and DenseNet-121, accordingly, whereas in
Table 9,
Table 10 and
Table 11 the performance metrics of each CNN model for each possible output are depicted. RGB outperformed in contrast to pre-trained networks, which demonstrates great robustness and efficiency.
For comparative analysis,
Table 12 summarizes the results of each one of the best CNN models applying 10-fold cross-validation. According to the produced results, RGB exhibits sufficient performance with robust results, VGG-16 performed efficiently, whereas DenseNet-121 extracted adequate outcomes.
4. Discussions
In this study, three CNN models were examined to differentiate between infarction, ischemic and normal perfusion in SPECT-MPI images; in stress and rest mode. The first DL-based model, the RGB-CNN, was built from scratch, comprising four convolutional layers and two dense layers. VGG-16 was the second model applied, which is acknowledged for image classification tasks and was designed using fourteen layers, to prevent overfitting. The third model is DenseNet-121, which was implemented with twelve layers for the same purpose. Transfer learning was applied to the last two models, as a widely used and robust technique for image classification. The provided dataset consists of 647 images, where the data are heterogeneous, which is an important factor in terms of generalizability. Since the dataset is considered rather small, the data augmentation method was employed to increase the size. Furthermore, the dataset was split into three parts, training, validation, and testing, to evaluate our model in data, with known outputs, through the validation dataset, and unseen data, with the utilization of the testing dataset. Additionally, a 10-fold cross-validation was performed, and the results demonstrated great stability and reliability for our models.
Nuclear experts also performed visual classification of the models achieving supervised learning with known output. It should be highlighted that we conducted an in-depth exploration process where various combinations concerning different image sizes, batch sizes, and convolutional and dense layers were examined, to conclude the best architecture of the DL models, for the corresponding dataset. Concerning the selection of batch size, we investigated different values for batch sizes between 8 and 64, following the reported literature [
31,
32]. The overall outcomes of the research pinpoint that a small number (16, 32) of batch sizes can offer reliable results. After a thorough exploration process, it emerged that batch size 32 provides the best results for CAD diagnosis demonstrating a better training process with a low learning rate, thus avoiding overfitting and performing with higher classification accuracy. It can be definitely concluded that the research findings reflect the results of this study.
The distinction of results and the improvement of metrics, during the experiment, is a promising fact for the model’s capabilities. In this direction, certain metrics, which are robust and reliable, were employed to evaluate models’ stability and performance. These are accuracy, loss, AUC, ROC curve, precision, recall, sensitivity, and specificity and were applied in all three models. The results demonstrated that all three models can achieve high accuracy and minimize loss, even for a small dataset, which is a key factor for attaining reliability. VGG-16 and RGB-CNN performed better against DenseNet-121, achieving an accuracy of 93.33%, 88.54%, and 86.11%, respectively. These results ensure reliability and that fine-tuning a model is both important and effective. It turns out that the proposed RGB-CNN model and other robust and widely recognizable models can perform equally well.
Due to their generalizability and reliability, as proven through k-fold cross-validation, the proposed models can be deployed by nuclear experts on any SPECT-MPI images, without any further adjustments, extracting sufficient results. However, there are some limitations in this work, that need to be considered and further addressed in the future. Firstly, only SPECT images were used as input to the proposed models, since CNNs can perform directly to images. Secondly, the size of our dataset is limited for CNN classifiers, so we had to increase it by adding more images to feed our models, thus extracting more abstract feature maps to further optimize our approach. Despite the existence of these limitations, the proposed models are capable to perform satisfactorily and operate adequately in unknown data.
Some of the clinical implications that emerge from the employment of the proposed methodology in medical imaging include the invaluable support towards an instant and automatic clinical diagnosis of SPECT MPI images that could prevent patients from possible undesirable heart conditions, such as infarction. The DL-based approach could potentially serve as a crucial assisting tool for medical experts in their effort to (i) evaluate and further diagnose correctly SPECT MPI images of patients suffering from angina, or known coronary artery disease, and (ii) instantly deliver proper treatment suggestions. The integration of the proposed RGB-CNN model in the clinical everyday practice constitutes a challenge in the medical imaging domain, provided that its enhanced performance is also validated on larger datasets.
A notable aspect that has been reported in the present studies is explainable artificial intelligence (XAI). Even though CNNs can extract spectacular results, it remains unclear how CNN models arrive at a specific output resulting in certain decisions. CNNs are considered black boxes, as they do not provide any evidence of their conclusions and thus, we depend mostly on the extracted metrics to fully trust their output. CNNs accept images as input, compute some calculations with the help of hidden layers, and produce an output in the form of a percentage of a prediction. Therefore, scientists and medical experts have the belief that CNNs perform with bias and therefore, they are skeptical about CNNs trustworthiness, especially in the medical domain, where each decision is crucial for the patient’s health. Considering this persistent disbelief of their abilities and trustworthiness, explainable artificial systems have been developed in order to provide interpretability and understanding of the internal computation of CNNs. In future studies, our aim is to thoroughly explore and further develop XAI systems. In view of the abovementioned considerations, XAI is a realm that needs to be explored to a greater extent in the future, especially for medical diagnosis, where the predicted output of the proposed models has such an essential role [
33].
To sum up, our three-class classification models are capable of performing pleasingly and distinguishing images between infarction, ischemic and normal cases, with the utilization of 10-fold cross-validation, transfer learning, and data augmentation. All three models can be an integral tool to assist with the automatic diagnosis of coronary artery disease and can be clinically deployed with the help of a user interface (UI) platform.