1. Introduction
The coronavirus (COVID-19) is possibly the biggest human threat of the twenty-first century. Since the disease outbreak in December 2019, the COVID-19 pandemic has become a worldwide health problem. According to the World Health Organization (WHO) statistics, in January 2022, there were more than 365 million confirmed COVID-19 cases, resulting in approximately 5.5 million deaths [
1]. The COVID-19 pandemic disease is caused by severe acute respiratory syndrome coronavirus 2, or SARS-CoV-2. COVID-19 belongs to the family of viruses which cause cold-related diseases, such as Middle East respiratory syndrome (MERS-CoV) and extreme acute respiratory syndrome [
2]. The emergence of disease transmission and the rise in mortality in a number of countries has necessitated the protection of health services and the community from COVID-19 spread. Thus, remote disease monitoring, including early diagnosis and quarantine, and follow-up is of immense importance. Everyday, thousands of cases have been reported in many countries around the world, which lead some governments to implement lockdown measures to contain the virus. Unfortunately, lockdowns due to the pandemic have not only had a global impact on health, but they have also harmed the economy. Thus, lockdown restrictions were gradually released with strict obligations on wearing masks and maintaining physical distance. Health professionals from all over the world have since been working tirelessly to obtain drugs and vaccines for the disease [
3].
Most COVID-19 infected patients have been diagnosed with pneumonia; thus, radiological examinations can be helpful for diagnosis and evaluation as well as for follow-up on disease progression. Initial screening using chest computed tomography (CT) revealed an over-sensitivity to reverse transcription-polymerase chain reaction (RT-PCR) and also reported COVID-19 infection in negative or low-positive RT-PCR cases. It can sometimes outperform the RT-PCR-based test, which has a low success rate of 70% and a sensitivity of 60–70% [
4]. Recent surveys show a pooled sensitivity rate of 94% (95% confidence interval (CI)
) but a low specificity rate of 37% (95% CI
) for CT-dependent analysis [
5]. As a result, CT-dependent analysis could be used to solve the insufficient PCR test sensitivity rate [
6]. The most recent COVID-19 research has centred essentially on the observations of chest CTs. Nonetheless, a spike in the increase of COVID-19 prevalence limits the regular use of CT, as it puts a significant burden on patient health due to frequent radiation exposure and possible infections in CT suites.
Recent studies claim to obtain reliable results for the automated detection of the disease from chest X-ray (CXR) scans [
7]. Since diagnosis has become a relatively rapid operation, the financial problems of diagnostic tests have affected patients worldwide, particularly in countries with private health systems or limited access to health systems due to prohibitive costs. There was an increase in the number of publicly available CXRs from stable cases as well as COVID-19 patients. This helps researchers around the globe analyze medical images and recognise potential trends that may contribute to an automatic diagnosis of the disease [
8]. Moreover, CXR tests have the advantages of a fast operating speed, low cost, and ease of use for radiologists [
9]. Thus, the need to identify COVID-19 features on CXR is growing [
10].
Noninvasive approaches based on artificial intelligence (AI) for the analysis of patient data (e.g., CXRs and CTs) are extensively exploited for the successful diagnosis of COVID-19. In particular, deep learning (DL)-based techniques applied to radiomic features of thoracic imaging, CXR and CT as well as other clinical, pathological and genomics parameters have shown to provide valuable assistance in this direction. In the context of medical image analysis, DL automatically discovers hidden features required for detection and/or disease classification from raw data. Namely, images’ pixel/voxels at the input are explicitly used instead of the representative (extracted or selected) features. This in turns reduces errors caused by incorrect segmentation and/or subsequent extraction of hand-crafted features. Literature research has demonstrated that machine learning (ML) and DL provide quick, automated, efficient strategies for detecting abnormalities and extracting key features of altered lung parenchyma; this may be connected to the unique signatures of the COVID-19 virus. However, the available datasets of COVID-19 are insufficient for the development of deep neural networks [
11].
In short, early, accurate and rapid COVID-19 diagnosis plays a key role in timely quarantine and medical care. This is also of great significance for the prognosis of patients, the prevention of this epidemic and the protection of public health [
12]. In this work, we propose a DL approach for the detection of COVID-19 from CXR images. The proposed pipeline is based on TL by employing the Xception pre-trained model for the feature extraction stage. Then, we used the global average pooling (GAP) layer to solve the vanishing gradient problem when using DL networks and to reduce overfitting probability; the activation layer was used for reducing the losses. We used a dataset which consisted of COVID-19, pneumonia and normal CXR images. We investigated different activation layers with different activation functions and optimization algorithms and compared between them to select the one that provides the best performance with minimum losses. We compared our approach with traditional pre-trained models and also with other literature studies. Our design achieves the highest performance compared to the previous studies and the pre-trained models.
The rest of the paper is partitioned into the following sections.
Section 2 provides an overview of the related work for recent COVID-19 detection studies using different AI algorithms. Details of the methods and the processing stages of the developed framework are fully described in
Section 3. Then,
Section 4 describes the performance evaluation and validation methods/metrics for evaluating classification accuracy. The experimental results showing the potential of the proposed pipeline are given in
Section 5, and
Section 6 presents the results discussion. Finally,
Section 7 contains the conclusions.
2. Related Work
Recent research work has proven that imaging tests (e.g., CXRs and CT) can provide rapid identification of COVID-19 and also help to monitor the spread of the disease. Various image-based diagnostic systems, ranging from hand-crafted features to feature learning, have been introduced to illustrate the possibility of the identification of COVID-19 [
13]. Convolutional neural networks (CNN) have been found to be one of the most common and successful techniques for diagnosing COVID-19 from medical images. Yudong and Kulwa [
14] used deep TL techniques for a multi-classification task on a small dataset by testing 15 different pre-trained models. Their dataset consisted of 860 images (260 for COVID-19, 300 normal and 300 for pneumonia). Their study revealed that VGG19 was the best algorithm, achieving a classification accuracy of 89.3% with average precision, recall, and F1-Score values of 0.90, 0.89, 0.90, respectively. Yama et al. [
15] proposed an approach based on a convolution support estimation network (CSEN), which constructs a sparse support set of representation coefficients using a dictionary and a set of training samples. The dataset used included 6286 CXR images of COVID-19, as well as three other classifications: bacterial pneumonia, viral pneumonia, and normal. Using a 5-fold cross validation, their CSEN-based model achieved a sensitivity and specificity of more than 98% and 95%, respectively, for COVID-19 detection. A generative adversarial network (GAN)-based method was presented by Nour Eldeen et al. [
16] to detect COVID-19 infection from CXR images. Their dataset consisted of 307 images divided into 4 classes (COVID-19, normal, bacterial pneumonia, and viral pneumonia). They used AlexNet, GoogleNet, and ResNet18 algorithms as TL. Their model achieved an accuracy of 80.6% on 4 classes (GoogleNet), 85.3% on 3 classes (AlexNet) and 100% on 2 classes (GoogleNet). A similar approach by Khan et al. [
17] combined TL and the Xception pre-trained model. Their system attained a 89.6 % and 95% accuracy on classification for 4 and 3 classes, respectively. Another DCNN transfer learning-based pipeline by Asif et al. [
18] utilized Inception V3 for the detection of COVID-19 in infected patients using chest X-ray scans. The test data contained 864 scans for COVID-19, 1345 for viral pneumonia and 1341 for normal scans. The model provided a classification accuracy of greater than 98% (training accuracy of 97% and validation accuracy of 93%). Suat and Alakus [
19] produced a convolutional capsule network using chest scans for COVID-19 detection called CapsNet. Their system was evaluated using a total of 2331 images, of which 231 were COVID-19 and 1050 were normal and pneumonia. For binary and multi-class classification, the 11-layer architecture achieved 97.24% and 84.22% accuracy, respectively.
Kumari et al. [
20] used the ResNet50 plus support vector machine (SVM) model as a feature extractor in a framework for identifying COVID-19 patients. They proposed a multi-classification task on 381 CXR images. Their model achieved an accuracy, sensitivity, FPR and an F1-Score of 95.33%, 95.33%, 2.33% and 95.34%, respectively. The limitation of their method is the leakage of the used dataset. In [
21], the DenseNet121 model was used by Sarker et al. for both binary and multi-classification tasks of COVID-19 patients on a dataset consisting of 238 images for COVID-19, 6045 for pneumonia and 8851 for normal persons. Classification results of 2 and 3 classes obtained 96.49% and 93.71% accuracy, respectively. A study by Dilbag et al. [
22] compared several types of TL with their proposed model for a multi-classification task of COVID-19 on a dataset that included three classes: COVID-19, pneumonia and other disease patients. Their model achieved a test accuracy of 97.4% between the used models. Barshooi et al. [
23] proposed a model for screening for COVID-19 infection using a GAN DL approach. Data augmentation were employed using different filter banks. A total of 4560 CXR images of patients with COVID (360 cases), as well as viral, bacterial, fungal, and other diseases, were used. Their model detection accuracy was compared to the performances of 10 existing COVID-19 identification techniques and achieved an accuracy of 98.5% in the 2-class classification task. The advantage of their method is that the utilization of different filters improved the performance of the classifiers. The limitations of their study is that the dataset used was small.
A 2-way classification framework for detecting 15 different types of chest diseases including COVID-19 using CXR images was proposed by Rehman et al. [
24]. First, a CNN architecture with a softmax classifier was used. Then, TL was used with a fully connected layer of the proposed CNN to extract deep features. Deep features were then fed into traditional ML classifier. They performed 10-fold and 5-fold validation on the best-performing of the 7 ML classifiers. They collected 2800 CXR images belongs to 14 classes and 200 CXR images belonging to the COVID-19 class. Their method achieved an overall validation accuracy of 99.40% with the KNN-fine algorithm using 5-fold validation and and 99.77% with the Bag-ensemble algorithm using 10-fold validation. Although their method has the advantages of being a fusion between the deep and machine learning models, the dataset used was small, and their method of training took a long time (500 epochs). In [
25], Brima et al. designed a ResNet50 CNN-based architecture for detecting and classifying four types of classes using CXRs. Their dataset consisted of 21,165 CXR images divided into 6012 images for lung opacity, 3616 images for COVID-19, 1345 images for viral pneumonia and 10,192 images for the normal class. Their approach scored a test accuracy of 94% using the 5-fold cross validation technique. The advantage of this study is the large dataset used to compare different pre-trained models to detect the most suitable model. The limitation is the large number of epochs (100) used for training with the limited computing needed to perform hyper-parameter space searches, and, as a result, it took a long time. Additionally, the output accuracy was not high, although the dataset used was large and the principle of the TL was applied. To alleviate the burden placed on a single network, a multi-step classification by Albahli et al. was proposed in [
26] for detecting COVID-19 and other chest diseases using X-ray images. They applied the TL with different pre-trained models using data augmentation and semantic segmentation in order to increase the model’s accuracy in a 10-fold cross-validation. For the first level of classification (i.e., 3 classes), their technique achieved an average test accuracy of 92.52%. In the second level of classification (i.e., 14 classes), using a ResNet50 model, their technique achieved a maximum test accuracy of 66.63%. For all 16 classes, which were classified at once, the overall accuracy for COVID-19 detection decreased, achieving a rate of 71.91%. The advantage of this study is that the combination between the two classifiers provided a compatible accuracy on the dataset used, and the dataset contained more than two classes. The limitations of this study are that the accuracy was not high despite applying the TL and data augmentation techniques, and the dataset used was of a small size. Manokaran et al. proposed a modified DenseNet201 network[
27] for COVID-19 detection that included a global averaging layer, a batch normalization layer, a dense layer with ReLU activation, and a final classification layer. Their model was trained using 8644 images (4000 normal and pneumonia cases and 644 COVID-19 cases) and tested on 1729 images (129 COVID-19, 800 normal, and 800 pneumonia) and yielded an overall accuracy of 92.19% compared to 7 pre-trained models. The advantage of this study is that the employment of the TL technique provided an acceptable accuracy. The limitation of this study is that the overall accuracy was not high although the dataset was large and the TL was employed. Additionally, the model was trained for 100 epochs; thus, training took a long time.
In summary, after the COVID-19 outbreak, a tremendous amount of research regarding the detection of COVID-19 has been conducted using different types of medical images (e.g., CXR and CT [
28,
29,
30,
31,
32]). The existing techniques have their own advantages and disadvantages. Handcrafted vs. feature learning-based methods, time complexity, sample data size, binary and multi-level classification are the main criteria used for evaluation and models’ comparisons. In this paper, we propose and investigate a deep learning-based COVID-19 detection model that integrates TL with an Xception pre-trained module for feature extraction. Compared to [
27], our system provides a multi-level approach for the accurate detection of COVID-19 using X-ray scans. In our model, we did not employ the pre-trained model as is. We applied the TL principle in the first stage of building the structure for the feature extraction step and benefited from its experience in order to achieve high performance. Additionally, we have tested our method on a larger dataset and investigated different activation layers, functions, and optimizers to attain the best performance with minimum losses.
4. Performance Evaluation and Validation
Generally, there are a variety of metrics that can be used to evaluate the performance of classification models. Those include classification accuracy, precision, and the F1-score. All those metrics depend on four expected outcomes: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). In our particular model, TP represents the COVID-19 patients identified correctly by our system, FP represents healthy subjects identified as patients, TN is the healthy subjects identified correctly, and FN are COVID-19 patients categorised as healthy subjects.
Based on those outcomes, various evaluation parameters can be measured. For example, the classification accuracy represents the number of labels correctly categorised divided by the total number of labels to be classified [
40]. The recall (specificity) is another metric that calculates the fraction of positive (negative) patterns that are correctly categorised. The positive patterns that are correctly predicted from the total predicted patterns in the positive class are estimated by the precision metric. Finally, the harmonic mean between the REC and PER values is described by the F1-score. Those metrics are, respectively, defined mathematically by [
41] as:
In addition to accuracy metrics, the receiver operating characteristics (ROC) curve is also employed to confirm and support the robustness and accuracy of our deep learning model. ROCs depict a model’s classification output based on the true positive and false positive rates at various classification thresholds. Quantitatively, an additional evaluation metric is usually employed in classification systems using the area under the curve (AUC) of an ROC. The AUC value illustrates how well the model performs by discriminating between classes [
42]. AUC can be calculated according to Equation (
10):
where
is the sum of all the positive examples ranked, and
,
signifies the number of positive and negative examples.
5. Experimental Results
The test dataset contains 7395 images that have been prepared as a benchmark for the research community for the multi-classification task of detecting COVID-19 from CXR images. This dataset is collected from many links available on the “kaggle” website [
43,
44,
45,
46] The COVID-19 class of images are labelled by a licensed radiologist, and only those with a specific sign are used for the purpose of the study. A total of 7395 images were used (1371 for COVID-19, 1751 for normal, and 4273 for pneumonia) and
Figure 3 shows samples of every class in the dataset.
In this research, to precisely identify COVID-19, we employed a DL algorithm that utilized TL for the multi-classification of chest X-rays into three classes: COVID-19, normal and pneumonia. As can be readily seen, the performance measures of the proposed model for COVID-19 classification depends on different optimizers and different activation functions. The model was developed using the python programming language. Particularly, we used the TensorFlow open-source model, which is commonly used for machine learning applications, such as neural networks and Keras, a high-level model of neural networks built on top of TensorFlow. The proposed model was trained with a learning rate
of
, a batch size of 32 and categorical cross-entropy with RMSprop was used as the loss and optimization function. The training used 80% of the dataset, and validation and testing used 10% each. During model training, the loss curve for the training vs validation showed that the validation loss became low after a few epochs, see
Figure 4, thus we select a number of epochs equal to 10 in our experiemnts. All images were resized to the target size of 200 × 200 × 3 with scale, and the augmentation principle was applied before being fed into the neural network. In addition to quantitative accuracy, the system performance against the number of epochs was analyzed. First, we needed to define the parameters used in the DL criteria. Furthermore, the training and validation loss was calculated as the sum of the errors made for each example in the validation or training sets. Generally, DL-based classification systems, including CNNs, utilize gradient descent (GD)-based optimization techniques to lower the error (or loss score) during the training process to adjust the network parameters. In this work, we have examined multiple optimization algorithms like Adam, RMSprop, SGD, AdaGrad, AdaDelta and AdaMax [
47] and compared their performance.
In the first set of experiments, we evaluated the performance using different optimizers to select the best network parameters for diagnosis with the LeakyReLU activation function at
and
of
. The overall result accuracies for different optimizers are shown in
Table 2. As can be readily seen, the performance measures of the proposed model for COVID-19 classification depends on the employed optimizer. Additionally,
Table 3 summarizes the class-wise performance metrics of each optimizer in detail, and
Figure 5 and
Figure 6 represent the confusion matrices and the ROC curves for different optimizers.
Similarly, the system’s performance against different optimizers using the ELU function has been also studied.
Table 4 illustrates the performance metrics of the proposed model using ELU at
= 0.2, and the associated confusion matrices are shown in
Figure 7. Furthermore,
Figure 8 displays the ROC curves for the examined optimizers, and the detailed class-wise performance metrics are given in
Table 5.
In addition to the evaluation of our system using different activations and optimizers, we evaluated the performance of our system against other DL methods. Particularly, we applied different pre-trained networks such as Inception V3, DenseNet121, ResNet50, MobilNet, and VGG16 on our dataset. The class identification was done by using the softmax layer of the pre-trained networks.
Table 6 displays the performance of each model with all evaluation metrics and TP. Namely, the proposed model achieved a high validation accuracy of 99.3% compared to other traditional pre-trained models.
Figure 9 presents the confusion matrices for different pre-trained models, and
Figure 10 demonstrates their respective ROC curves. It is worth mentioning that our approach achieved high accuracy with the advantage of being shallow (unlike the deep pre-trained models) with a low training burden (10 epochs). Our approach also investigated using different activation functions and optimizers. In addition, our approach results are promising and are on a comparable bar with other DL studies for COVID-19 detection. The results of other studies on DL with the details of their respective datasets are illustrated in
Table 7.
6. Discussion
The recent statistical data from WHO and CDC has shown a significant increase in COVID-19 cases, hospitalizations, and deaths in the first week of January 2022. In the US, 98% of cases are caused by the spread of the Omicron variant of the Coronavirus [
49]. Therefore, the early prediction of COVID-19 is of immense importance to help in avoiding the disease spread and thus protecting immune-vulnerable people. As in many other diseases, early detection can be very helpful for diagnosis and evaluation, as well as for follow-up on disease progression. Thus, an automated system that utilizes patient data could be a valuable decision support tool for COVID-19 detection and provide significant downstream treatment/follow-up implications.
For that purpose, we have developed a deep learning system (DLS) to predict COVID-19. The proposed DLS is validated experimentally on a benchmark dataset of chest X-ray images, which contains three classes: COVID-19, normal control, and pneumonia. The proposed pipeline utilized a pre-trained Xception model and modified the network structure by including the global average pooling layer. Then, we used the layer to solve the vanishing gradient problem and reduce the probability of overfitting; the activation layer was used for reducing losses. Furthermore, we have explored different activation functions at different thresholds to improve performance and reduce the losses of the proposed pipeline. Additionally, extensive evaluation has been conducted using different optimizers.
Developing an analysis pipeline with high accuracy is our ultimate goal, for which various experiments have been conducted to evaluate the performance using various evaluation metrics, such as precision, recall, accuracy, F1-score, and area under the ROC. Before performing any experiments, data augmentation (i.e., rotation by 25 degrees, a zoom range equal to 0.2 and fill mode set to nearest) has been applied before feeding the data into the DLS to overcome the dataset imbalance.
Experimental results have revealed that the proposed DLS performance can be improved by carefully employing a nonlinear activation function and an optimization method. Particularly, the proposed model achieved an accuracy of 99.3% and a minimum loss of 0.02 using the combination of the RMSprop optimizer and LeakyReLU with
. Other optimization techniques demonstrated closer results over shorter times (e.g., AdaMax). This has been documented using overall and class-wise accuracies, ROC curves and confusion matrices. Similar experiments were conducted using the ELU activation at
= 0.2. As
Figure 7 showed, the COVID-19 class samples (i.e., 130) were correctly classified, while six normal cases resulted in negative and positive prediction errors. Overall, the LeakyReLU activation function at a threshold of
equal 0.1 combined with the RMSprop optimizer is the best case (see
Table 2 and
Table 3 and
Figure 5 and
Figure 6). It is worth mentioning that, additionally, the use of TL helped our model to start from good initial parameters, and few training epochs were needed to achieve better performance. Particularly, during model training, the loss/accuracy curves for the training vs validation showed that the validation accuracy became high with small losses after a few epochs (=10).
Additional comparisons between the developed DLS and other pre-trained deep learning-based methods, including Inception V3, DenseNet121, ResNet50, MobilNet, and VGG16, have been conducted. We unified the parameters for all systems by using the ReLU activation function,
= 0.1, and the Adam optimizer. The results shown in
Table 6 and
Figure 9 and
Figure 10 emphasize the benefits of the proposed DLS, especially its advantage of being shallow (unlike the deep pre-trained models) with a low training burden (10 epochs). The comparative results with other deep learning systems, tested on different COVID datasets, highlighted that although the proposed model is shallow, it provides high accuracy in a short time when tested on CXR images. The above results document that the proposed DLS system can be helpful in healthcare systems in several ways, such as decreasing consultation-associated costs, enhancing detection consistency, and thus lowering the risk of disease spread.
Despite the promising results, our analysis pipeline has some limitations. Firstly, we trained and tested our approach from a single benchmark dataset. Thus, using external test datasets from other different centers should further enhance the robustness of our automated DLS. Secondly, our DLS only utilized a single input data type: chest X-ray images. In clinical practice, multiple data sources are usually used, such as clinical biomarkers and, sometimes, chest CT scans, which can provide more localized views of the affected lungs. Recent studies, e.g., [
28,
29,
30,
31,
32], explored the various DL methods available to diagnose COVID-19 based on chest CT images. The studies have documented high accuracy rates for both binary and multi-class classification tasks. Therefore, an update to the proposed model could integrate multiple inputs and give a more robust prediction at the patient level. Finally, our system concentrated on two activation functions without further investigation of the effect of
or other activation functions.