1. Introduction
Heart diseases are a leading cause of death worldwide. Coronary heart diseases include arrhythmias, prolapsed mitral valves, coronary artery disease, congenital heart disease, congestive heart failure, and many others [
1,
2]. Various traditional approaches such as blood tests, chest X-rays, and ECGs are used to detect such diseases [
3]. ECG is a widely used approach and is recognized as the most effective means of detecting heart problems in the current era [
4]. It is a painless procedure to monitor heart health and is used to detect various heart conditions, including arrhythmias and blockages in arteries that cause chest pain or even lead to a heart attack [
1,
5]. Precise diagnosis of irregular heartbeats through ECG analysis can significantly contribute to the early detection of cardiac illness. Extracting the relevant and significant information from the ECG signals using computer systems poses a considerable challenge. The automatic identification and categorization of cardiac abnormalities has the potential to assist clinicians in diagnosing a growing number of ECGs [
6,
7]. However, accomplishing this task is highly challenging.
Although various machine learning algorithms have been developed over the past decades for classifying cardiac abnormalities, their testing has often been limited to small or homogeneous datasets [
8,
9,
10,
11]. To address this issue, the PhysioNet/CinC Challenge 2020 has provided a substantial number of datasets from diverse sources worldwide. The extensive dataset offers a valuable opportunity to develop automatic systems that are capable of effectively classifying cardiac abnormalities. Previously, several ML techniques have been employed to classify heart diseases based on raw ECG data, reflecting the growing interest in the automated detection of abnormal behavior in ECG signals. Recently, DL has emerged as a valuable tool in biomedical and healthcare systems by learning intricate patterns and features from signal data [
11,
12,
13]. In DL, various deep neural networks such as recurrent neural networks (RNNs) [
14], deep residual networks (ResNets) [
15], transformers, and attention networks have been developed to classify the diseases. ResNet is one of the popular networks and it handles the complexity of large datasets efficiently with its unique deep layer-based architecture.
Despite of the many advancements in the field of machine learning (ML) and DL for ECG signal analysis, the existing solutions are limited in their ability to accurately detect cardiac abnormalities, especially in many categories. These limitations highlight the need for more robust and generalized DL models that can effectively handle large and diverse datasets. The challenge lies in developing models that not only improve detection accuracy but also enhance computational efficiency and generalizability. In order to address the shortcomings of other existing methods, this work presents the HRIDM (Hybrid Residual/Inception-based Deeper Model), a unique deep learning model. This work aims to enhance the detection efficiency of heart disease using a large 12-lead ECG dataset through DL neural networks. Specifically, this study aims to:
Develop a novel deep learning model for ECG abnormality detection, which could outperform the prior state-of-the-art (SOTA) models, considerably enhancing performance;
Utilize the annotated dataset provided by the PhysioNet/CinC Challenge 2020 to accurately classify the 27 different cardiac abnormalities, demonstrating the model’s capability to handle complex and large-scale datasets effectively;
Train the proposed model on a large 12-lead ECG dataset from over 10,000 patients, to improve its generalizability and robustness;
Illustrate the computational efficiency of the proposed model by achieving high accuracy with limited resources, such as minimal training epochs and no GPU support;
Benchmark the HRIDM against several SOTA models, including Inception, LeNet-5, AlexNet, VGG-16, ResNet-50, and LSTM, to validate its performance.
2. Literature Review
Numerous studies have been carried out on the application of ML/DL for the analysis of ECG signals [
3,
11] and the detection of cardiac arrhythmias [
16,
17]. In [
18], authors recommended a Residual CNN-GRU-based deep learning model with an attention mechanism for classifying cardiac disorders. In this work, they utilized 24 groups for classification out of 27. The proposed approach attained a test accuracy of 12.2% and obtained 30th position among 41 teams in the official ranking. The authors of [
19] introduced a modified ResNet model that includes numerous basic blocks and four modified residual blocks. The updated model is a 1D-CNN that supports attention across feature maps. The authors employed fine-tuning on the pre-trained model and their proposed algorithm attained a test accuracy of 20.8%. The authors of [
20] presented a 1D-CNN with global skip connections to classify ECG signals from 12-lead ECG recordings into numerous classes. The authors have also utilized many preprocessing and learning methods such as a customized loss function, Bayesian threshold optimization, and a dedicated classification layer. Their implemented technique produced an accuracy of 20.2% on the test dataset. The authors of [
21] introduced an SE-ResNet model with 34 layers in the DL model to classify cardiac arrhythmias. They assigned different weights to different classes based on their similarity and utilized these weights in metric calculations. The utilized DL model attained a validation accuracy of 65.3% and test accuracy of 35.9%. In [
22], the authors presented a hybrid Recurrent Convolutional Neural Network (CRNN) with 49 1D convolutional layers, along with 16 skip connections and one Bi-LSTM layer to detect heart abnormalities using a 12-Lead ECG signal. Utilizing the proposed model with 10-fold cross validation and without preprocessing, the authors achieved 62.3% validation accuracy, and 38.2% test accuracy. The authors of [
23] utilized an SE-ECGNet to detect arrhythmias from 12-lead ECG recordings. In their model design, they utilized squeeze-and-excitation networks in each model path and an attention mechanism which learns feature weights corresponding to loss. In this study, the authors achieved 64.0% validation accuracy and 41.1% test accuracy using the proposed model. In [
24], the authors presented a deep 1D-CNN model consisting of exponentially dilated causal convolutions in its structure. Their proposed model achieved a challenge score of 0.565 ± 0.005 and an AU-ROC of 0.939 ± 0.004 using 10-fold cross validation techniques. In this work, they achieved 41.7% accuracy on the test dataset utilizing the proposed model. In [
25], the authors introduced an 18-layer residual CNN for arrhythmia classification which has four stages for abnormality classification. The authors have also utilized preprocessing techniques, 10-fold cross validation, and post-training procedure refinement. The proposed approach achieved 69.5% validation score and 42% test accuracy. The authors of [
26] utilized a hybrid DL model by integrating a CNN with LSTM along with adversarial domain generalization for the detection of arrythmias from 12-lead ECG signals. In this study, the proposed model obtained 43.7% accuracy on the test dataset. In [
27], the authors presented a method for ECG signal classification in which they utilized scatter transform in combination with deep residual networks (ResNets). In their study, the authors obtained 48% test accuracy utilizing their proposed methodology. In [
28], the authors designed an SE-ResNet-based DL model which is a variant of the ResNet architecture. In their model design, SE blocks are utilized to learn from the first 10 and 30 s segments of ECG signals. The authors also utilized an external open-source dataset for model validation. To correct and verify the output, they developed a rule-based bradycardia model based on clinical knowledge. Utilizing the proposed approach, the authors detected heart arrhythmias from 12-lead ECG recordings and obtained a validation accuracy of 68.2% and a testing accuracy of 51.4%. The authors of [
29] introduced a DL method for the classification of arrhythmias utilizing 12-lead ECG signals. In their proposed approach, the authors presented a modified ResNet with SE blocks. Additionally, they applied zero padding to extend the signal to 4096 samples and downsampled them to 257 Hz. Utilizing custom weighted accuracy measure and 5-fold cross validation, they obtained a validation accuracy of 68.4%, and a test accuracy of 52%.
In [
30], the authors proposed a novel approach using a Wide and Deep Transformer Neural Network for the detection of cardiac abnormalities utilizing 12-lead ECG recordings. In their methodology, they combine two features: transformer neural network features and random forest handcrafted ECG features. The utilized approach achieved an impressive accuracy of 58.7% on the validation dataset and 53.3% on the test dataset.
3. Materials and Methods
In this paper, we developed a new model (HRIDM) for classifying ECG signal abnormalities that integrates the strength of an inception network with residual blocks. The deep inception network can learn complex features from the dataset, whereas residual blocks improve model accuracy by resolving the problem of vanishing gradients. We validated the proposed model using a dataset of 12-lead ECG signals from patients with a range of cardiac disorders. We evaluated our model’s performance against a variety of SOTA models, including LeNet, AlexNet, VGG, LSTM, ResNet, and Inception. All the models were trained on the PhysioNet/CinC Challenge 2020 dataset and tested using the independent test dataset provided by the organizers. The reported accuracy was achieved through our own testing and validated using the validation techniques provided by the PhysioNet/CinC Challenge 2020 organizers. We observed that our model outperformed DNNs and the models outlined in previous research. The improved outcomes indicate that the proposed model is a novel and promising approach to classifying ECG data in order to identify cardiac anomalies.
3.1. Datasets
The study’s dataset, which included recordings, diagnostic data, and demographic information, is collected from several open-source databases freely available to download (
Table 1). To generate this large dataset, five different sources were used. All the datasets contain 12-lead ECG recordings where the sample frequencies ranges from 257 Hz to 1 KHz. The datasets also include the demographic information such as age, sex, and types of diagnosis. There are 27 categories of ECG classes (diagnosis) which are presented along with SNOMED CT codes (Systematized Nomenclature of Medicine Clinical Terms). The following subsections detail the specific sources comprising the dataset.
CPSC Database (CPSC2018): The initial source of the ECG data is the China Physiological Signal Challenge 2018 [
31], which contains 13,256 ECG recordings and 9458 patients;
INCART Database: The second source of the ECG data is the 12-lead ECG arrhythmias dataset, which is an open-source, publicly available dataset from the St. Petersburg Institute of Cardiological Technics (INCART), St. Petersburg, Russia [
32]. This dataset has only 74 recordings and was contributed by 32 patients;
PTB and PTB-XL Database: The third dataset is a combination of two databases (PTB and PTB-XL), which contains 22,353 ECG recordings and was contributed by 19,175 patients;
Georgia 12-lead ECG Challenge (G12EC) Database: The fourth ECG dataset is also a 12-lead ECG dataset which was made available by the Emory University, Atlanta, Georgia, USA [
32]. This dataset was collected from 15,742 patients and it contains 20,678 ECG recordings;
Undisclosed: This dataset was only used for testing the model performance in the challenge. The source of the dataset is an undisclosed American institution, and this dataset is completely different from the other datasets. This dataset has never been posted or disclosed publicly and will not be disclosed in future either. It contains 10,000 recordings, the number of patients is unknown, and no training and validation sets are used from this dataset.
Detailed information about the datasets is provided in
Appendix A, specifically in
Table A2. This table includes the number of recordings, mean duration of recordings, mean age of patients, sex of patients, and the sampling frequency for each dataset included in the PhysioNet/CinC Challenge 2020 dataset [
33].
The ECG dataset used in this study was obtained from the PhysioNet/CinC Challenge 2020 database [
32,
33]. The dataset consists of 66,361 ECG recordings, of which 43,101 are for training, 6630 for validation, and 16,630 for testing. To train the model efficiently, we utilized the 43,101 training recordings and split them into training and validation sets in a 90:10 ratio, resulting in 38,790 training samples and 4311 validation samples. To ensure rigorous model evaluation and prevent data leakage, we opted to partition the provided training dataset (excluding the designated validation set of 6630 recordings) into separate training and validation subsets. This approach allowed for robust model training and hyperparameter tuning without compromising evaluation integrity. For testing, we used an undisclosed hidden dataset of 10,000 recordings, which has different sampling frequencies.
Figure 1 shows the distribution of ECG signal lengths in the dataset. The figure shows that 95% of the ECG signals in the dataset have a length of 5000 samples. The remaining 5% of the ECG signals have lengths that range from 5500 to 115,200 samples.
Figure 2 visualizes the distribution of cardiac abnormalities in the utilized datasets. The vertical axis represents the number of abnormalities, and the horizontal axis lists the names of the 27 classes in abbreviated form. The abbreviations used, corresponding to the Diagnosis and SNOMED CT codes, are provided in
Appendix A,
Table A1. It is worth noting that “Sinus Rhythm (SNR)” appears as the most frequent abnormality in the graph.
3.2. Data Preprocessing
Our hybrid residual/inception-based deep model (HRIDM) machine learning strategy for classifying ECG signals focuses on various preprocessing approaches. Our ECG dataset from PhysioNet is large and diverse, with recordings of different lengths and sizes (42,720). Hence, preprocessing is essential to prepare the data for useful analysis. We use Algorithm 1’s multi-step preprocessing approach to make sure the data are standardized and appropriate for use in machine learning models. In order to provide effective training, this iterative method serves as a generator function, continuously producing batches of features and labels.
Algorithm 1: Data preprocessing algorithm utilized for ECG data. |
Initialization: Generator for features, Generator for labels, Status result, : Database |
Input: |
Output: |
#Step 1: Initialize Parameters and Data Structures |
Set |
Create |
#Step 2: Generate Batches |
|
Initialize empty arrays for |
|
|
|
|
|
|
#Step 3: Shuffle Labels |
|
|
|
#Step 4: Preprocess Features |
While True: |
|
Load and preprocess feature data: |
|
|
|
|
|
The basic idea behind Algorithm 1 is the designation of a batch size, which establishes how many data points are processed simultaneously during training. Additionally, in order to prevent the model gaining biases from the original recording sequence, we shuffle the order of the training data points. Several generator functions retrieve features and labels for each data point inside each batch, most frequently by gaining access to external data sources. On the obtained features, we apply normalization to compensate for possible differences in signal strength between recordings. By scaling the features to a particular range (often between 0 and 1) based on the mean and standard deviation of the current batch, this normalization makes sure that each feature contributes equally during the training process [
34]. Another loop shuffles the labels in the training set while batch generation is taking place. By preventing the model from picking up possible dependencies based on the initial label order, this step eventually enhances the model’s capacity to generalize to unseen inputs.
Preparing each individual feature is an additional vital component of preprocessing; this is achieved via an additional loop that iterates over the shuffled order array. Here, we use the current index in the order array to access a particular training data point, which is a raw ECG signal [
35,
36]. Using zeros as padding ensures that all of the input data points have the same format. This is required if the duration of the recovered ECG signal is less than the required input size, which is usually 5000 samples. The data are then rearranged into a two-dimensional structure with 12 columns (representing the 12 ECG leads) and 5000 rows (representing time samples) after padding. Through this reshaping, the one-dimensional data are effectively transformed into a format that allows the machine learning model to consider each ECG lead as an independent channel.
In the last stage of preprocessing, we have utilized the ECG data to normalize the whole training set using the pre-calculated mean
and standard deviation
. The normalization of ECG data is given by Equation (1) [
37].
As a result, the model learns more robustly, and all characteristics are normalized on an identical scale across the training phase [
38,
39]. We produce batches of preprocessed features and shuffled labels continually by executing these processes recursively within the generator function. The machine learning model is now able to train on the information we have provided for accurate ECG signal classification in medical applications with ease due to our carefully developed preprocessing steps, which successfully tackles the issues posed by our sizable and varied ECG dataset. The preprocessing technique utilized in this work is presented in the form of a flow chart in
Figure 3, and
Figure 4 presents the segmented and preprocessed 12-lead ECG signals.
3.3. SOTA Models
The SOTA models utilized to validate the proposed methodologies and the proposed model (HRIDM) are LeNet-5, AlexNet, VGG16, ResNet50, Inception, and LSTM. These are prominent and commonly utilized models for various tasks, especially signal and image classification.
LeNet-5 was the first basic convolutional neural network (CNN) model introduced in 1998 by Yann LeCun et al. [
40]. It consists of seven layers: three Convolution (Conv) layers, two pooling layers (average pooling), and two fully connected layers (FC) along with sigmoid or tanh activation functions. This was the first CNN model successfully trained on the MNSIT dataset for a digit recognition task. However, due to its structure containing fewer layers, it is not suitable for more complex tasks.
AlexNet was first introduced in 2012 by Alex Krizhevsky et al. [
41], which introduced the ReLU activation function and used dropout layer to overcome overfitting [
42]. AlexNet consists of eight layers: five Conv layers, among which the first, second, and fifth are followed by max-pooling layers, and three fully connected layers. All layers use ReLU activations except the output layer, which included the SoftMax activation function. To capture hierarchies in the data, this model makes use of the filters’ depth and stride. However, because it has an extensive number of parameters, computation costs are high.
VGG16 is a prominent deep CNN model introduced in 2014 by the Visual Geometry Group [
43] at Oxford University. It consists of 16 layers: 13 Convolutional layers and three fully connected layers. VGG16 utilizes small 3 × 3 filters in Conv layers throughout its structure, and max-pooling layers are applied after some of the Conv layers to downsample the feature maps. Due to its deeper architecture, it is very effective in capturing fine details of the features and capable of performing more complex tasks effectively. Its specialty lies in its simple and uniform structure, which is easy to design and extend. However, its limitation is the slow training time due to large parameters and depth.
The ResNet50 model, developed in 2015 by Kaiming He et al. [
44], which introduced the concept of the residual block, contains 50 layers and is capable of resolving the vanishing gradient problems in deeper networks. The residual block is mathematically defined as
, where
is the CNN output within the block and
is the input. This model introduced the concept of skip connections and allows the gradient to pass directly through these connections. It also introduced bottleneck design in its structure of the residual blocks consisting of 1 × 1, 3 × 3, and 1 × 1 Conv layers. The 50 layers in the ResNet50 model with residual blocks enable the capture of more complex patterns. However, its limitation is the high computational cost due to its complex and deep structure.
The Inception model is a deep CNN model that introduced the concept of inception modules to enhance the efficiency and accuracy of DL models. The first version, Inception v3, was introduced in 2015 by Szegedy et al. [
45], and uses parallel Conv layers within the same modules to capture very fine details at different levels. Inception v3 has 48 layers with inception modules which consist of multiple Conv layers with different parallel filters (1 × 1, 3 × 3, 5 × 5) along with max-pooling layers. It is capable of capturing very complex patterns with a smaller number of patterns as compared to similar DL models.
LSTM (Long Short-Term Memory) networks, introduced in 1997 by Hochreiter and Schmidhuber [
46], are especially designed for sequential data patterns. LSTM was introduced to resolve the problem of short-term memory by incorporating gates and states. An LSTM network consists of many cells, including a cell state
and a hidden state
as well as gates such as the input gate
forget gate
and output
gate. LSTM has the ability to capture long-term dependencies from sequential data, making it highly suitable for time series tasks and language modeling.
3.4. Proposed Model (HRIDM)
The aim of this research was to determine the most effective algorithm on the utilized dataset.
Figure 5 depicts the proposed model and comprehensive methodology employed. The proposed HRIDM consists of three main sections. The first section serves as the input layer, incorporating multiple convolutional, residual, and inception blocks to extract the primary and fine features from the data, and is also responsible for producing output. We integrated residual blocks with inception blocks in our proposed model because this combination leverages the strength of both types of blocks, enhancing the overall performance of the model for arrhythmia detection in 12-lead ECG recordings. The residual blocks address the problem of vanishing gradient problem and perform deeper training through skip connections which allows the models to learn complex features in an effective manner whereas the inception blocks capture the multiscale fine and detail features because of the parallel Conv blocks with varying filter sizes. The fusion of both techniques provides the powerful structure of the DL model, enabling it to efficiently and effectively learn the diverse features and enhancing its ability to discriminate between different arrhythmia types, resulting in improved accuracy and robustness compared to other models.
The second section is connected with the first and further refines the extracted features, subsequently concatenating them. The third section is connected with both preceding sections (first and second) and combines the concatenated features. The output of the final section is given to the dense layer of the first section to produce the desired output. A detailed description of each section and its constituent blocks follows. The first section of the proposed model consists of the following layers:
The proposed model is a deep learning model that consists of the following layers:
1D convolutional (Conv) layers: In our proposed model, we have utilized multiple 1D-Convolution (Conv) layers for extracting high level features from the provided dataset. The first 1D-Conv layer employs 512 kernels, each of size 5 × 5, to learn informative patterns from the input data. The second 1D-Conv layer includes 256 kernels of size 3 × 3, further refining the extracted features. Following each 1D-Conv layer, we incorporate batch normalization (batchNorm) to improve training stability and accelerate convergence. Further, ReLU activation functions are included to introduce non-linearity and improve the model’s ability to learn complex relationships within the data. The convolutional layer computes the output
at spatial position
and output channel
as given in Equation (2):
1D max-pooling (MaxPool) layer: This layer is utilized to downsample the data while preserving prominent features. The 1D-maxPool layer employs a 3 × 3 size filter with stride 2, and it computes the output
at position
as given in Equation (3):
Residual block: This block is used to address the vanishing gradient problem and facilitate weight transfer. The residual block consists of three stacks, each comprising a 1D-Conv layer, batchNorm, and Leaky ReLU activation function with an alpha value of 1 × 10
−2, which are given by Equations (4)–(6), respectively:
where
is the filter weights,
is the input data,
is the bias term, and
is the activation function (Leaky ReLU).
and
are the scaling and shifting parameters,
and
are the batch mean and standard deviation.
is the leakiness factor (0.01 in this case).
The convolutional layer sizes for each stack are 128, 128, and 256, respectively, with a kernel size of 1 × 1, as shown by Equations (7)–(9) for each stack, respectively.
- (1)
First Stack:
- (2)
Second Stack:
- (3)
Third Stack:
To maintain weight preservation, an additional convolutional layer with 256 filters and batch normalization is incorporated into the skip connection, linking it with the output of the third stack in the residual block, which is shown by Equations (10) and (11):
Inception block: This block is used to extract further low-dimensional features. The inception block involves stacks of 1D convolutional layers, followed by batch normalization and Leaky ReLU activation with an alpha value of 1 × 10−2. Each stack utilizes 64 filters, with kernel sizes of 1, 3, and 5.
- (4)
Kernel Size 1:
- (5)
Kernel Size 3:
- (6)
Kernel Size 5:
Equations (12)–(14) illustrate how each stack within the inception block handles the input data . The max operation integrates the Leaky ReLU output with a scaled and shifted variant to ensure non-linearity and extraction of features across various receptive fields (kernel sizes). The second and third sections contain almost identical layers, a repetitive structure of convolution, batch normalization, and Leaky ReLU to progressively extract increasingly detailed and refined features. The second section combines by concatenating the extracted features from its blocks with those from the first section. The third section follows a similar set of layers but incorporates skip connections to facilitate the flow of information across layers.
Convolutional blocks: These blocks are used to capture complex patterns within the data. Each convolutional block consists of a 1D-Conv layer, batchNorm, and Leaky ReLU activation with an alpha value of 1 × 10−2. The first convolutional block uses 128 filters, a filter size of 5 × 5, and a stride of 1 × 1, complemented by instance normalization and parametric ReLU activation. A dropout layer with a rate of 20% and a 1D max-pooling layer with a filter size of 2 × 2 was added. The second convolutional block is similar to the first, except for the filter count in the convolutional layer. It employs 256 filters of size 11 × 11 and the third convolutional block omits the 1D pooling layer and utilizes a Conv layer with 512 filters of size 21 × 21;
1D global average pooling (Global Avg. Pool) layer: This layer is utilized mainly for reducing the dimensionality of the feature data, presented by Equation (15).
where feature numbers are represented by
and
are the input features;
where
is the class logit,
is the weight from input
to output
is input,
is the bias for
is the softmax output ensuring probabilities sum to 1 for predictions.
The proposed (HRIDM) model is an effective method for the classification of time series data. The utilized model is capable of extracting high-level features from the data, and it is able to capture complex patterns within the data. The model is also able to generalize well to new data.
3.5. Activation Functions
The activation functions also have a very essential responsibility in designing deep learning models. The selection of the activation function depends upon the types of input data and the category of classifications. In this work, we utilized Leaky ReLU and ReLU activation functions, and we compared these with the most commonly used activation functions, sigmoid and tanh, as visualized in
Figure 6.
Leaky ReLU: We employed the Leaky ReLU activation function for the present study, which offers strong benefits for ECG classification. It is important to analyze several activation functions that are particular to our task and dataset. Negative values in ECG signals are frequently indicative of certain types of cardiac activity. Leaky ReLU ensures that neurons continue to contribute to learning features from the data by keeping them from going into inactive states as a result of these negative inputs. Leaky ReLU is computationally more efficient than tanh and sigmoid, which is advantageous for training deeper and bigger neural networks using ECG data [
47,
48]. Leaky ReLU, in contrast to ReLU, keeps a little non-zero gradient for negative inputs. This feature might be useful in applications where it is important to detect even minute deviations from normal cardiac rhythm in order to capture minor changes in ECG patterns. Leaky ReLU can be mathematically expressed by Equation (17) [
48]:
ReLU (Rectified Linear Unit): This is simple and effective in terms of computing; it promotes sparsity by generating a large number of zero outputs. In comparison with sigmoid and tanh, it allows models to converge more quickly during training. But it has the “Dying ReLU” issue, which prevents learning by allowing neurons with negative inputs to be stuck at zero indefinitely. The mathematical definition of the ReLU activation function is given by Equation (18) [
48,
49]:
Sigmoid: The sigmoid function reduces input values to the range [0, 1]. It is especially effective for binary classification problems that require probabilistic outputs. The function has a smooth gradient over its full range, allowing for effective gradient-based optimization during training. The sigmoid activation may be mathematically described as illustrated in Equation (19) [
49]:
Tanh (Hyperbolic Tangent): The tanh function converts input values to the range [−1, 1]. Similar to the sigmoid function, its output is zero-centered, which can help neural networks converge. The tanh activation function is mathematically stated as follows in Equation (20) [
48]:
3.6. Evaluation Metrics
This section presents an overview of our proposed model and approach for identifying anomalies in ECG data. To understand the functioning of our model, we extract characteristics and employ several metrics that provide insights into its performance. In this research, we utilized the following metrics:
The Area Under the Receiver Operating Characteristic Curve (AUC) measure is used to see the model’s performance graphically across different classification levels. An increased AUC score (closer to 1) denotes better performance. Plotting recall versus specificity at different threshold values is what the ROC curve does. When comparing a model’s performance to a single criterion (such as accuracy), AUC offers a more thorough assessment.
The total number of ECG samples in the test data is equal to the sum of the number of positive and negative samples (
. The confusion matrix is the foundation for assessing classification models. For a specific task, this matrix carefully counts the number of accurate identifications (TN, True negatives; TP, True positives) and inaccurate detections (FN, False negatives; FP, False positives). When there are significant impacts for missing positive cases, the confusion matrix becomes very important [
50]. A high recall value implies that, even at cost of a higher level of false positives, the model reduces the possibility of missing significant instances. This trade-off is important, especially when the value of incorrectly detecting an instance that is negative is greater than the cost of missing a positive one [
38,
51].
4. Results
This section provides a detailed outline of the experiments performed and their corresponding outcome. The experimental setup involved implementing the experiments primarily using Python, with most of the computations performed on computer system with 16 GB Ram, utilizing a Tesla T4 GPU. To optimize model performance, we utilized a dataset comprising 43,101 training samples. This dataset was then divided into 90% training (38,790 samples) and 10% validation (4311 samples) sets to facilitate model training and evaluation. For testing, two categories of datasets were provided: 6630 known and disclosed records, and 10,000 unknown and undisclosed records, totaling 16,630. However, for this work, we utilized the undisclosed hidden dataset of 10,000 recordings provided by the CinC 2020 organizers to ensure unbiased testing of the proposed model. Throughout the experiment, a number of libraries were used: for data visualization, Matplotlib, Seaborn, and Ecg-plot; for data processing, NumPy and Pandas; and for modeling, TensorFlow and Keras. For the assessment of the models, Sklearn was used, along with other libraries like SciPy and WFDB. The following hyperparameters were used for training all the DL models including the proposed model: an Adam optimizer, a batch size of 32, min_delta: 0.0001, a dropout rate of 0.2, and a filter size of 5 × 5. A learning rate decay mechanism was used as a callback function depending on the AUC score during the model’s training. The learning rate was degraded by multiplying it by 0.1 in the optimizer if the AUC score did not show improvement every epoch, suggesting a lack of convergence.
We showcased the experimental outcomes derived from the models employed in this investigation. LeNet, AlexNet, VGG-16, ResNet-50, Inception, and LSTM are among the models. To ensure a fair comparison with the proposed HRIDM model, all SOTA models were trained from scratch on the same dataset rather than utilizing pre-trained weights. This approach allowed for a direct evaluation of model performance under identical conditions. The experimental setup previously described was used for training and evaluating these models on the dataset. The outcomes offer valuable perspectives on how well each model performs in appropriately classifying cardiac disorders. To evaluate the models’ performance, evaluation criteria including accuracy, precision, recall, and F1-score were used. To further illustrate the trade-off between the true positive rate and false positive rate for various categorization levels, ROC curves were calculated. The experimental results provide insight into how well each model detects and classifies different heart problems. In order to satisfy the research objectives, we analyzed each model’s performance, compared the outcomes, and expressed the advantages and disadvantages of each.
4.1. LeNet-5
The first model we trained on our train and validation dataset was LeNet-5.
Figure 2 displays the model’s performance as training history curves.
Figure 7a illustrates the accuracy, precision, and AUC metrics for both the training and validation datasets.
Figure 7b depicts the loss curve of the model. The training curves show that the LeNet-5 model obtained an accuracy of roughly 70% on the training data but only 60% on the validation data. A similar tendency may be seen using the AUC score. On both datasets, the precision score remained stable at roughly 50%. We confined the training iterations to 20 epochs in order to observe the model’s learning curve. This confined training undoubtedly led to the model’s underfitting, as seen by the declining precision near the end of the epochs. Furthermore, the loss curve exhibits a rather smooth fall, with values of about 20% for both training and validation data. While a falling loss curve is ideal, the absence of considerable decrease in this case implies that the model is not efficiently reducing the cost function.
4.2. AlexNet
In our second experiment, we examined the AlexNet model’s training history with a particular emphasis on AUC, accuracy, loss, and precision (
Figure 8). In comparison to the LeNet-5 model, AlexNet performed noticeably better. Each epoch exhibited a steady decline in the training and validation loss curves, suggesting good convergence towards the minimal loss value. Furthermore, the model consistently produced a precision of around 60% for both sets of data, and it demonstrated a stable and impressive accuracy of about 80% on both training and validation sets. Promising AUC values of around 70% during training are also shown in
Figure 8a.
The training loss curve in
Figure 8b, on the other hand, does not demonstrate the same smooth drop as the validation loss curve, indicating probable overfitting. This mismatch suggests that the model is remembering the training data rather than generalizing effectively to unknown data. Furthermore, the dataset’s high volatility and imbalanced classes continue to be a concern. Addressing these data restrictions may improve the AlexNet model’s performance and capacity to handle classification challenges successfully.
4.3. VGG-16
We trained the VGG-16 model on both the training and validation datasets in our third experiment, which evaluated the model. The model’s accuracy, precision, AUC, and loss training history curves are displayed in
Figure 9. With accuracy and AUC averaging 80% throughout the training process, VGG-16 produced very smooth training curves. The accuracy curve, however, stayed mostly stable. Despite the lack of gain in precision, VGG-16 outperformed preceding models. The model exhibited not only consistent and noteworthy accuracy, but also outstanding accuracy of over 80% for both training and validation data. These figures demonstrate that both accuracy and AUC were around 80% throughout training (
Figure 9a).
While the loss curve (
Figure 9b) shows a smooth decline for both training and validation data, more research into potential overfitting is needed. Similar to the preceding models, the existence of high data volatility and class imbalance is likely to impede optimal performance. Addressing these data concerns might improve the VGG-16 model’s ability to handle classification tasks.
4.4. ResNet-50
In this subsection, we examine the accuracy and loss observations during the training of the ResNet-50 model, as depicted in
Figure 10. The loss plot reveals that the ResNet model exhibited better performance compared to previous approaches when applied to validation data. However, it is important to note that the model faced challenges in minimizing the parameters from the start, resulting in slow learning. This was likely due to the large size of the model, which had 23.5 million parameters. The dataset contained 27 classes, with one particular class having a disproportionately large amount of training data. This imbalance led to misclassifications, as the model erroneously classified each validation data point as belonging to that specific class. Nevertheless, the training plot demonstrates that the model made progress over the course of several epochs.
Despite constraints, the model demonstrated progress over epochs during training, achieved accuracy value around 85% and the AUC around 81%. Nonetheless, precision fluctuated and stayed at a low level (around 50%). This shows that, perhaps as a result of the data variation, the model has difficulties correctly predicting some classes. These metrics suggest that the model was able to learn to distinguish between different classes to some extent. However, there is still room for improvement, as the model’s performance suffered due to the significant variance present in the data. Overall, the model exhibited misclassifications, highlighting the need for further refinement.
4.5. Inception Network
In the fifth experiment, we trained the Inception model and plotted the training history in terms of accuracy, precision, AUC, and loss, as shown in
Figure 11. These plots provide a comprehensive overview of the recorded training metrics for each epoch. The accuracy plot reveals that the model exhibited commendable performance during both the training and validation phases, with accuracy scores surpassing 95%. Notably, its precision on the validation and training sets fluctuated due to the influence of the learning rate. However, it is worth mentioning that the model’s training process was not optimal, as indicated by the similar performance observed on the validation set. This phenomenon can be attributed to the excessive layering of the architecture, leading to a diminished number of features in the output feature matrix. Consequently, despite employing learning decay, there was a spike in the loss for the validation data, indicating a lack of convergence compared to the gradual convergence observed for the training curves (
Figure 11a).
While the precision of validation reached 74% and validation AUC exceeded 70%, these measures did not show significant improvements on the training data, adding to the likelihood of overfitting. The positive aspect is that the loss curves (
Figure 11b) for both training and validation data converged smoothly, confirming the model’s overall learning capability. In this case, the precision score on the training dataset showed a good score. These metrics suggest that the model was able to learn to distinguish between different classes to some extent. However, there is still room for improvement, as the model’s performance suffered due to the significant variance present in the data. Overall, the model exhibited misclassifications, highlighting the need for further refinement.
4.6. LSTM
Our sixth experiment explored the Long Short-Term Memory (LSTM) model, which is ideal for dealing with sequential data.
Figure 12 depicts the training curves for both the training and validation datasets. While LSTMs excel with sequential data, our model’s performance deteriorated owing to dataset limitations. The considerable volatility in data distribution, as well as the imbalanced class representation, caused challenges.
The model most likely ignored other classes in favor of predicting the one that predominated in the imbalanced dataset. This emphasizes how crucial it is to deal with data imbalances prior to training subsequent models. The data variance prevented the cost function from fully converging, even if the graphs showed that it was heading in the direction of the minimum. Comparing this to earlier versions, the performance was not as good. While the training accuracy increased to 70–75%, other parameters, such as precision and AUC, continued to decline, averaging 58–64% (
Figure 12a). There were notable fluctuations (20–30%) in the loss values as well (
Figure 12b). The difficulties in using LSTMs on imbalanced, highly variable datasets are highlighted by these findings. Although the LSTM model works well with sequential data, our dataset’s constraints made it less effective in this experiment. Improving the effectiveness of LSTMs in this particular situation may require addressing data imbalance and maybe investigating methods to handle data volatility.
4.7. Proposed Model (HRIDM)
In the final experiment, our proposed HRIDM model was trained for 20 epochs, similar to the other models. As seen in
Figure 13, the model showed consistently high accuracy, with both training and validation accuracy settling between 95 and 97%. This demonstrates high learning capacity, as the model accurately caught the patterns in the training data and generalized well to previously encountered data in the validation set. The precision graphs support the model’s learning progress. Training precision was roughly 80%, whereas validation precision was around 70% (
Figure 13a). These findings indicate the HRIDM model’s capacity to achieve excellent accuracy and precision on both training and validation datasets.
Figure 13b depicts the decreasing trend in both training and validation loss across the training procedure. This represents a satisfactory convergence to the ideal solution (global minimum). Over the course of 20 epochs, the training loss dropped from above 0.15 to less than 0.090. The validation loss followed a similar pattern, beginning over 0.14 and decreasing to less than 0.085. Lower loss values suggest that the model is better able to predict the target variable (abnormality in ECG data). Finally, the AUC values for both training and validation data remained around 70%, with an increasing trend noted near the end of training. This indicates the model’s efficacy in terms of accuracy, recall (as represented by AUC), and overall classification ability. These combined findings demonstrate the HRIDM model’s effectiveness. The model improved its ability to detect anomalies in the big ECG dataset by including both inception and residual attributes in the model’s structure.
After training the models, we tested them to find their performance on the unseen dataset, shown in
Table 2. This section provides a comprehensive analysis of the results obtained from all the implemented models, including the proposed model. As shown in
Figure 14, the proposed model achieved significantly higher accuracy than previous models, achieving a test score of 50.87%. The Inception network emerged as the second-best model, with a test score of 40.6%, outperforming most of the models mentioned in the literature. On the other hand, ResNet-50, VGG-16, and AlexNet demonstrated comparable performance, with test scores ranging from 33.8% to 35.1%. However, LeNet-5 and LSTM did not meet the performance standards. This is likely due to the highly imbalanced class distribution in the data, making it difficult for these models to learn complex patterns.
To provide a visual representation of the results,
Figure 15 presents the confusion matrices. The
X-axis of each matrix represents the actual values, while the
Y-axis represents the predicted values of our model. The obtained challenge score for our model, using the evaluation matrix of the challenge site, was 0.50897. The confusion matrices demonstrate the model’s attempt to classify each class label despite the highly imbalanced data. Among SOTA models, our proposed model outperformed others, exhibiting favorable performance compared to previous approaches mentioned in the literature review.
5. Discussion
In this section, we investigated various research papers published on the CinC 2020: Program website [
52] and IEEE Xplore and utilized the PhysioNet/CinC Challenge 2020 dataset to effectively classify abnormalities in ECG signal data (
Table 3). The objective was to develop a model capable of learning complex patterns within ECG signals and accurately distinguishing between 27 different abnormalities.
Among all the state-of-the-art techniques, the highest test score obtained was 53.3%, achieved by utilizing the Wide and Deep Transformer neural network model [
30]. In this work, the authors utilized an extensive feature engineering approach (Handcrafted ECG Features) along with preprocessing techniques (finite impulse response bandpass filter). The second highest official ranking, with a 52.0% test score, was obtained using Modified ResNet with a Squeeze-and-Excitation Layer (MResNet-SE). This approach utilized zero padding to extend the signal sample to 4096 samples and down sampling techniques to sample the signal to 257 Hz [
29]. In this study, the authors did not utilize any preprocessing or feature extraction techniques, but they used additional approaches such as a constrained grid-search for addressing data imbalance problems and a bespoke weighted accuracy metric for evaluating model performance.
Another study [
28] also utilized SE-Resnet and acquired a test score of 51.4% on the undisclosed test dataset. In this work, the authors used squeeze-and-excitation blocks to extract the fine details for 10 s to 30 s segments of the ECG signal and validated model performance using an external open-source dataset. The authors also used wavelet-based denoising techniques to remove the noise from the ECG dataset before training, but no additional feature extraction techniques were utilized. All other methods had lower test scores compared to our proposed model. Among them, ref. [
18] had the lowest score with 12.2%, but due to the correct evaluation strategy, they secured a place in the official ranking. In this work, the authors used Polyphase Filter Resampling techniques for preprocessing of the ECG data. No additional feature extraction techniques were used. With this approach, they classified into 24 categories out of 27 and utilized 75 epochs for training.
Among all the reviewed studies, most authors [
18,
20,
24,
25,
26], utilized CNNs in their model design, either directly or indirectly. The second most utilized model was ResNet, as employed by many authors [
19,
21,
27,
28,
29], in their methodology, directly or indirectly. The analysis concluded that all the top four studies, including ours, which secured the first four places, utilized residual structure in the model design. Based on the observation of preprocessing techniques such as denoising or data augmentation, it can be concluded that the studies [
19,
20,
21,
22,
29] did not utilize any of these techniques for preprocessing of the ECG data. Among these studies, only [
29] achieved a better score than proposed technique. While studies [
21,
23,
26,
30] employed additional feature extraction techniques to enhance the ECG abnormality classifications, only [
30] achieved a better score as compared to our proposed techniques. Additionally, only studies [
19,
20,
22,
29] did not utilize any preprocessing or feature extraction techniques, although they employed some additional approaches for result enhancements. Among them, only study [
29] achieved a better score as compared to our proposed methodology.
Although our proposed method achieved fourth place with a 50.87% test score as compared with the official rankings, its strength lies in its efficiency and simplicity. Unlike other studies, we did not employ any feature extraction, denoising, additional strategies, or data augmentation in our proposed methodology. Due to the integration of residual and inception blocks in our model architecture, its performance is exceptionally good without relying on common preprocessing steps.
Among the state-of-the-art classifiers, the highest accuracy achieved was 40.6% using the Inception network. However, the results obtained using these established architectures did not yield satisfactory performance. The DNNs, LSTM, and LeNet-5 struggled to classify beyond a single class. This limitation can be attributed to the imbalanced dataset and the significant variance in data distribution. Consequently, these shorter networks with fewer parameters failed to converge and could not effectively identify complex patterns within the data. On the other hand, more recent and deeper architectures such as ResNet-50, VGG-16, and AlexNet attempted to classify multiple classes but fell short of achieving the desired results.
In contrast, our proposed hybrid residual/inception-based deeper model (HRIDM) outperformed all the aforementioned architectures. Despite its superior performance, the training time for our model was comparable to that of the Inception network. When comparing the accuracy achieved by our model with previous research discussed in the literature review, it surpassed many existing approaches. Only two models exhibited significantly higher test accuracy. One of these models employed a transformer, which is a relatively larger model that would require substantially more training time. The other approach proposed a hybrid network comprising an SE-ResNet and a fully connected network that incorporated age and gender as additional input features, which were later integrated for classification.
Overall, our proposed approach demonstrated remarkable performance in comparison to previous research and state-of-the-art architectures, achieving an impressive test score of 50.87%.
6. Conclusions
In this research, we addressed the crucial medical challenge of identifying heart diseases using 12-lead ECG data. We presented a novel DL model (HRIDM) which integrated two key components: residual blocks and inception blocks. This work utilized the official PhysioNet/CinC Challenge 2020 dataset, which includes over 41,000 training data samples and 27 categories of ECG abnormalities. We carefully tuned the hyperparameters for each block to achieve the best possible results. By allowing the network to identify complex patterns in the inputs, the use of inception blocks increased performance while the addition of residual blocks reduced the impact of the vanishing gradient issue. Compared to previous investigations, our model significantly outperformed them with an excellent accuracy score of 50.87% on the test dataset. We have also validated and compared the outcomes of our proposed model with SOTA models and techniques. Our findings open up new avenues for heart disease diagnosis research and demonstrate the promise of deep learning models in the field of cardiology.
7. Future Work
There are several avenues for further exploration and extension of this research in the future. Firstly, the application of data augmentation techniques can be employed to address the issue of imbalanced datasets. By augmenting the existing data, we can achieve a more balanced representation of each class, reducing the likelihood of misclassifications.
Additionally, incorporating demographic features such as age, gender, and sex into the model architecture can lead to the development of a hybrid network. This hybrid network can leverage these additional features in conjunction with the ECG data for more accurate classification.