1. Introduction
The number of patients with atrial fibrillation (AF) is increasing annually, and this trend is naturally related to the aging of the population [
1]. In recent years, the aging of patients with AF has brought to light clinical problems that were previously invisible. The European Society of Cardiology (ESC) notes that six main problems are closely associated with AF: mortality, stroke, hospitalization, reduced quality of life, left ventricular dysfunction/heart failure, and cognitive decline/vascular dementia [
2]. Therefore, the early detection and treatment of AF are important to prevent complications. AF is a disease that gradually shortens the interval between attacks over time, eventually becoming persistent, long-lasting, and permanent. Thus, atrial fibrillation can be viewed as a disease that progresses through various stages [
3]. Catheter ablation therapy, which is a treatment for AF, has been shown to be effective for paroxysmal atrial fibrillation (PAF). However, its efficacy is not well established in non-pharmacological guidelines for persistent atrial fibrillation and long-standing persistent atrial fibrillation (LSAF), for which the recommended level is Class IIa or Class IIb [
4]. In other words, it is very important to determine which patients with persistent atrial fibrillation will benefit from catheter ablation therapy based on the results and possible complications of catheter ablation therapy for persistent AF, as described above [
5]. However, it is difficult to predict postoperative recurrence, and the indications for catheter ablation therapy are currently determined based on the surgeon’s empirical judgment and the patient’s self-reported AF duration.
AF recurrence after catheter ablation therapy and its predictors have been the subject of many studies [
6,
7,
8,
9]. Njoku et al. showed that left atrial diameter predicts AF recurrence after radiofrequency catheter ablation treatment in a meta-analysis of the difference in left atrial volume between patients with and without recurrent AF after radiofrequency catheter ablation [
6]. Other factors, such as the duration of AF, structural changes in the left atrium and pulmonary veins, and age, may also affect the outcome of catheter ablation therapy.
In recent years, many methods have been reported to classify AF types [
10,
11], and Nuria Ortigosa et al. proposed a method to classify AF subtypes with feature extraction from a general Fourier time-frequency transform using ECG waveforms and classification using a support vector machine [
8]. The classification accuracy of the test data was approximately 77%. However, classification using ECG waveforms is often limited by the possibility of significant changes in the waveform characteristics when other diseases coexist.
Therefore, we attempted to classify AF types by extracting image features, such as left atrial diameter and structural changes in pulmonary veins due to persistent AF, from contrast-enhanced computed tomography (CT) images using convolutional neural networks (CNNs), which have been applied in medical practice in recent years [
12,
13,
14,
15,
16,
17,
18]. Although previous studies using electrocardiogram waveforms have been reported in the classification of AF type, no method using contrast-enhanced CT images has been proposed. Furthermore, although there are research papers on the relationship between left atrial volume and AF type, there are no reported cases of applying that method to the classification of the type of disease. In this study, we propose a clinically novel method of classifying paroxysmal AF and long-term persistent AF on contrast-enhanced CT images using conventional CNN models, focusing on structural remodeling changes in the left atrium. The purpose of this study is to enable a standardized assessment using a deep learning approach that considers the information physicians need to evaluate the structural remodeling of the left atrium, including left atrial enlargement, poor contrast, structural changes in the pulmonary veins, the presence of thrombi in the left atrium, and coronary artery calcification. Based on this objective, contrast-enhanced CT imaging has an advantage over other dynamic modalities in that it can accurately capture the shape and focus on the structures around the left atrium. Furthermore, we hypothesize that the method using contrast-enhanced CT images will enable standardized evaluation with reduced subjective bias, even in cases in which the ECG waveform cannot detect sudden attacks, such as paroxysmal AF, or when there are concomitant diseases that may affect the ECG waveform. With the application of these systems to clinical workflows, it will be possible to evaluate the load on the atrial muscle when AF is first detected and, if signs of long-term persistence are confirmed, to begin treatment early.
In this study, we also compared the results of the CNN classification with those of physicians’ clinical judgment by surveying cardiologists regarding AF type classification. Physicians estimate the type of atrial fibrillation based on factors such as the size of the left atrium, enlargement of the pulmonary veins, thrombus formation in the left atrial appendage, and fibrosis of the atrial septum. Focusing on these features, we looked at images similar to those entered into the CNN to predict the corresponding disease type.
2. Materials and Methods
2.1. Outline
In this study, target slices were selected from contrast-enhanced CT images. The number of images was increased using data augmentation and then input into a CNN model. The output images were classified into two classes, PAF and LSAF, and the saliency map, which emphasized the pixels that contributed to the classification result using score-CAM according to their importance, was used to compare what each model focused on in the image to make its judgment. Persistent atrial fibrillation was excluded because its duration varies widely from 7 days to less than 1 year, making it difficult to accurately identify through the evaluation of the left atrial shape. This study was conducted with the approval of the ethics committee of the first author’s institution (approval number HM22-095).
2.2. Image Dataset
This study included 60 patients with AF who underwent CE-CT at Fujita Health University Bantane Hospital between May 2021 and July 2022. A total of 162 contrast-enhanced CT scans were performed during the period, including 116 patients with paroxysmal atrial fibrillation and 46 patients with long-standing persistent atrial fibrillation. From these, 30 patients of each disease type were randomly selected, and only those patients who did not undergo CT examinations due to contrast medium allergy or impaired renal function were excluded. The patients’ disease types were diagnosed as defined in the guidelines [
4]. Specifically, PAF was defined as AF that returns to sinus rhythm within 7 days of onset, and LSAF was defined as AF that persists beyond 1 year. The percentages of PAF and LSAF were each half of all patients. Basic patient information is shown in
Table 1. An Aquilion ONE CT system (Canon Medical Systems, Inc., Tochigi, Japan) was used to obtain the images. The details of the imaging protocol are shown in
Table 2. We used transaxial images with a matrix size of 512 × 512 pixels and a pixel size of 0.625 mm. The images were stored in DICOM format, and all images were converted to 8-bit PNG images with a window level of 30HU and a window width of 1000 HU.
2.3. Atrial Fibrillation Type Classification Using Contrast-Enhanced CT Images
The flow of this study is shown in
Figure 1, and the details of each process are described below.
2.3.1. Image Pre-Processing
Images centered on the slice, with the largest left atrium in the contrast-enhanced CT image and located 5 and 10 mm above and below, were selected, and five images per patient were used for analysis. If a bed was depicted in the image, it was removed by manually setting the CT value of the bed area to −1000 HU.
2.3.2. Data Augmentation
Data augmentation is a method of increasing data by “transforming” image data for training. For example, by rotating, flipping, shifting horizontally, scaling, distorting, adjusting brightness and contrast, and adding noise to an image, various variations can be created. In this study, the number of images increased nine times through data augmentation [
19]. CT examinations are usually performed in the supine position; however, in some facilities, the patient is positioned so that the heart, which is located on the left side of the body, is centered in the FOV. In such cases, the curvature of the bed may cause the body to rotate about 10°. To simulate this, the heart was rotated by −10° and +10° for each image, aligning the heart’s tilt to match that observed in the actual CT image. In contrast-enhanced CT examinations, since the density of the contrast agent varies depending on the case, we augmented the pixel values to be robust to changes in pixel values. The CT values of the left atrium were observed across the entire dataset, and the window level (WL) and window width (WW) were adjusted so that the CT values after augmentation fell within the range of real CT images. As a result, in addition to the initial condition of WL = 30, WW = 1000, two variations, including WL = −50, WW = 950 and WL = 160, WW = 1500, were added to increase the number of images threefold. An example of the created image is shown in
Figure 2.
2.3.3. Classification by CNN
In this study, we used six network models (VGG16, VGG19, Resnet50, DenseNet121, DenseNet169, and DenseNet201). These networks were trained on 1.2 million images across 1000 categories in the ImageNet database [
20,
21,
22]. To adapt these networks to PAF and LSAF classification, we removed the fully connected layers in each of the pretrained network models and replaced them with three new fully connected layers (the final layer being the output layer). The number of units in each layer was set to 1024, 256, and 2. In this study, finetuning was employed. Finetuning is a method to perform transfer learning using a different dataset for a different target task than the one used during pre-training that involves using a network model that has been pretrained from a large dataset as the initial parameters. Finetuning facilitates the learning of highly accurate models for each task from small datasets by simply recalibrating pretrained CNNs. In this case, the weights of the convolutional layer were initialized with the pretrained weights, and both the convolutional and fully connected layers were retrained (finetuning) using real images. The average of five continuous values obtained from the outputs of five slices output from the CNN was used as the patient’s evaluation. In this evaluation, the cutoff value was fixed at 0.5.
For the CNN training conditions, we used a learning coefficient of 0.000001, early stopping (maximum number of epochs: 100) as the training frequency, a batch size of eight, and Adam as the optimization algorithm. The categorical cross entropy was employed for the loss function in the training of CNN. The training environment used was Windows 10 Pro OS, an AMD Ryzen 7 2700X CPU, and an NVIDIA TITAN RTX GPU.
2.4. Saliency Map
In this study, we used score-class activation mapping (CAM) to visualize the points of interest by highlighting the pixels that contributed to the classification results according to their importance. Score-CAM eliminates the dependence on gradients by obtaining the weight of each activation map through its forward passing score on the target class; the final result is obtained using a linear combination of weights and activation maps [
23]. It visualizes the importance based on the results obtained by providing the generated images to the CNN using the feature map obtained when the trained CNN infers a specific image. The resulting feature map was enlarged to the size of the input, normalized to a value between 0 and 1, and multiplied by the input image to generate a heatmap. The output of CAM is shown as a heatmap overlaid on the image. This heatmap is called a saliency map in CAM. The input and saliency map images are shown in
Figure 3.
2.5. Validation and Evaluation Metrics
In this study, cross-validation was used to assess the generalizability of the model. We also increased the number of folds and chose 10-fold cross-validation to improve generalization performance and reduce bias. The 10-fold cross-validation method divides the dataset into 10 subsets, 70% of which are training data and 20% of which are validation data, 10% of which are test data.
Figure 4 shows a schematic of the 10-part cross-validation method.
Using this method, the prediction results were compared based on patient-specific accuracy, sensitivity, specificity, and precision. The final classification performance evaluation was performed by determining the overall accuracy rate using the CNN classification results. The overall accuracy rate was calculated using the following Equation (1). TP, TN, FP, and FN are the numbers of true positives, true negatives, false positives, and false negatives, respectively.
The ROC curve represents the relationship between the true positive fraction (TP/TP + FN) and the false positive fraction (FP/FP + TN). It was created by plotting the false positive rate on the horizontal axis and the true positive rate on the vertical axis and continuously varying the cutoff value to separate positive and negative results. To smooth the ROC curve, the false positive fraction (FPF) and true positive fraction (TPF) were plotted on both normal probability papers to obtain an approximately straight line, and the curve depicted the relationship between the two.
The CNN was trained and evaluated thrice for each model, with the median value and standard deviation used as the final classification result. In this study, the slice with the largest left atrium and the two slices above and below it were used for training and evaluation to enable continuous evaluation of the left atrium in the direction of the body axis. In addition, the number of images used for training increased with data augmentation. To demonstrate the effectiveness of these methods, we performed an additional validation using only one central slice for training and evaluation (Additional Study 1) and a validation using an evaluation without data augmentation (Additional Study 2).
2.6. Classification by Physicians
In this study, we administered the same questionnaire to physicians regarding the classification of atrial fibrillation types based on only five images entered into the CNN classification, and the results were compared with the correct response rate and focus of the CNN classification.
2.6.1. Participants
A questionnaire survey was conducted among physicians in the Department of Cardiovascular Medicine at Fujita Health University Bantane Hospital, and responses were obtained from 18 physicians. In this survey, we asked patients to evaluate the type of AF in terms of structural changes around the left atrium. The purpose of this questionnaire was to compare the results of this study’s classification with those of the physicians’ clinical judgments.
2.6.2. Questionnaire Items
Questions included: (1) years of experience as a physician, (2) specialty, (3) number of catheter ablation procedures performed per year, (4) whether preoperative CT imaging could predict the efficacy of catheter ablation, and (5) type classification of atrial fibrillation (20 cases) and the basis for decision.
(3) The number of catheter ablation procedures performed in a year and (4) whether preoperative CT images could predict the efficacy of catheter ablation procedures were optional answers for physicians performing catheter ablation procedures. For AF classification (5), 10 cases of paroxysmal PAF and 10 cases of LSAF were randomly selected from the cases used in the CNN classification, and the results were tabulated on a 6-point scale. In addition, the basis for judgment was asked, e.g., “Please tell us the reason why you answered that way”, for the answer of the disease type classification, and the answer was left open-ended. This question aimed to compare the points of interest of the CNN with those of physicians.
4. Discussion
4.1. Comparison of CNN Models
In this study, six CNN models were evaluated on their performance in classifying the AF types. ResNet50 performed the best in terms of overall accuracy, followed by VGG19. The reason these CNN outperformed DenseNet121, 169, and 201 could be that the number of layers in the network was shallow, which made it possible to extract features in a localized region. The long-term persistence of AF results in structural remodeling, such as left atrial shape changes and auricular enlargement, also affected the results. Therefore, ResNet50 and VGG19 should focus on these localized areas for classification purposes. The best overall correct response rate for ResNet50 was achieved because ResNet50 is optimized using a residual function and performs batch normalization for each residual block. We hypothesize that this resulted in stable learning without the gradient vanishing problem.
In addition,
Figure 6 shows a comparison of the classification correctness rate between the proposed method and Additional Studies 1 and 2. In most cases, the proposed method performs better than Additional Studies 1 and 2. The reason for the better accuracy rate than that of Additional Study 1 is that the proposed method uses a total of five slices (located 5 mm above and below) centered on the slice with the most enlarged left atrium for training; therefore, it is possible to analyze information in the body axis direction, in addition to the slice direction, and classification is more accurate than when only one slice is used for evaluation. The reason for the higher rate of correct answers compared to Additional Study 2 is thought to be that the data augmentation increased the number of pseudo-variations because of the various body inclinations and CT values due to the contrast agent and was able to respond to the effects on the image caused during imaging. Furthermore, data augmentation increased the number of images used for training by a factor of nine; therefore, it was assumed that efficient training was possible.
4.2. Insights from Saliency Map in CNN Classifications
Score-CAM was used to output a color map showing the pixels contributing to the CNN classification results. In the heatmap output for the correct classification in
Figure 8, the left atrium and pulmonary veins tended to attract more CNN attention. In addition, when attention was focused on structures other than the heart, which was often seen in the heatmap output when the patient was incorrectly classified, as shown in
Figure 10, there was a tendency toward incorrect classification. Focusing on the left atrium, cases of PAF were misclassified with findings of major LSAF, including an enlarged left atrium, the loss of comb-like muscular structures, and large rounded anterior and posterior structural left atria. In the cases of LSAF, there was also a tendency to misclassify cases in which the left atrium was not enlarged, especially when the anteroposterior diameter of the left atrium was short. Based on these findings, CNN classification focuses on the shape and surrounding structures of the left atrium and is considered a valid classification for the findings of LSAF.
4.3. Comparison with Physician’s Results
In response to the physician’s description of the basis for judgment, enlargement of the left atrium is a feature of LSAF in many cases. In the correctly classified cases shown in
Figure 7, (a) the PAF has a small, flat left atrial structure, whereas (b) the LSAF has a large, rounded left atrial structure in the front and back. The CNN model is expected to classify patients using the same criteria as physicians because the heatmap also shows that the left atrium area attracts more attention. The cases in which the CNN model and averaged results of the physicians’ responses differed are shown in
Figure 13. Case (a) involved LSAF, but the left atrium was relatively small (left), and there was no loss of the pectinate muscle structure (right). The CNN model can classify these cases. However, it was misclassified, even when the typical findings of LSAF in the size of the left atrium were observed, as shown in (b). The possible reason is that by using the entire CT image as the input image, information other than the left atrial region may have led to misclassification. This problem could be improved by increasing the variation with more training data and narrowing the field of view to the left atrial region alone.
4.4. Comparison with Previous Studies
The results of this study showed a higher accuracy than those of the study by Ortigosa et al. using ECG (classification accuracy rate 77.1%) [
10]. Furthermore, the method of this study has the advantage of being able to classify the pathology of AF using the assessment of structural remodeling of the left atrium, even when other diseases that affect the ECG waveform are present at the same time. AF is usually detected using an ECG, but we think that the limitation of using an ECG is that the time of detection of an attack is considered to be the moment of the first appearance of the attack. The advantage of this study using contrast CT images is that it allows for an objective evaluation of the state of the atrium regardless of the type of disease. We think that by evaluating the stress on the atrial muscle when atrial fibrillation is first discovered and confirming long-term findings, it will be possible to get closer to starting treatment at an earlier stage.
4.5. Practical Applications in Clinical Settings
We hypothesize that by using deep learning to classify AF types from CT images, this study will facilitate a standardized assessment of structural remodeling of the left atrium, which was originally determined subjectively by physicians, thereby reducing subjective bias. By integrating these systems into clinical workflows, it will become possible to evaluate the strain on the atrial muscle at the initial detection of AF. Additionally, if signs of long-term persistence are confirmed, early treatment can be initiated. This approach could potentially reduce unnecessary catheter ablation procedures, allow for more tailored treatment recommendations, and decrease healthcare costs. Furthermore, computational resources and processing time need to be discussed for practical application. Although model training requires substantial hardware resources and prolonged processing time (2–5 h), we believe that once the model is trained, the prediction process can be completed in under one minute, making it sufficiently feasible for clinical use because of the reduced hardware requirements for inference.
4.6. Limitation of This Study
There are two limitations of this study. The first is that it is a small and single-facility dataset. Furthermore, potential confounding factors, such as patient comorbidities, are not discussed. When the number of data is increased and external validation is performed in the future, comorbidities should be included in the analysis and evaluated. In addition, contrast-enhanced CT provides a clearer image of the left atrium than simple CT, but patients who cannot use contrast media and variations in contrast media and image quality among facilities remain a challenge. We hypothesize that this challenge can be resolved by using simple CT images or by preparing a dataset that includes images taken at other facilities and performing data augmentation, as in this study. The second limitation is that the classification does not include persistent AF, which we think does not allow for continuous evaluation. The definition of the duration of persistent AF ranges from 7 days to less than 1 year, making it difficult to accurately identify it through the assessment of left atrial geometry. Therefore, persistent atrial fibrillation was excluded from classification in this study and classified as paroxysmal and long-standing persistent; these cases have predominantly different results in ablation therapy and can be evaluated for structural remodeling based on imaging features. In the future, it is necessary to develop a method to evaluate AF types continuously by adding cases of persistent AF. The use of left atrial volume, dynamic modality information, additional machine learning models, and natural language processing models is also possible and will be explored.
5. Conclusions
Catheter ablation therapy is a treatment for AF; however, its efficacy is not well established due to the high recurrence rate in patients with PAF. In this study, we attempted to classify AF types using a convolutional neural network based on features obtained from contrast-enhanced CT images. As a result of the classification, ResNet50, which is a CNN model, showed the best performance in terms of the overall correct response rate and AUC value. The output of the heatmap and the survey of physicians’ judgment criteria indicated that many patients tend to focus on the shape of the left atrium in both classifications, suggesting that this method can classify AF types more accurately than physicians in a manner similar to the physicians’ judgment criteria. In the future, we plan to address the challenges of this study, such as using plain CT images, preparing a dataset that includes images from other facilities, and conducting continuous evaluations that include persistent AF. Furthermore, once these issues are resolved, this study can potentially be applied in predicting the efficacy of catheter ablation therapy. A future direction is to predict the efficacy of catheter ablation therapy in patients with atrial fibrillation based on contrast-enhanced CT images with the goal of providing quality information for patients to choose their treatment options.