1. Introduction
Coronavirus disease 2019 (COVID-19), which is caused by the SARS-CoV-2 virus, has wreaked havoc on humanity, especially healthcare systems. For example, recently, the wave of infections in India has caused a great number of families to seek care at home due to a lack of intensive care units. Worldwide, millions have succumbed to this pandemic and many more have suffered long- and short-term health problems. The most common symptoms of this viral syndrome are fever, dry cough, fatigue, aches and pains, loss of taste/smell, and breathing problems [
1]. Other less common symptoms are also possible (e.g., diarrhea, conjunctivitis) [
2]. Infections are officially confirmed using real-time reverse transcription polymerase chain reaction (RT-PCR) [
3]. However, chest radiographs using plain chest X-rays (CXRs) and computerized tomography (CT) play an important role confirming the infection and evaluating the extent of damage incurred to the lungs. CXR and CT scans are considered major evidence for clinical diagnosis of COVID-19 [
4].
Chest X-ray images are one of the most common clinical diagnosis methods. However, reaching the correct judgement requires specialist knowledge and experience. The strain on medical staff worldwide incurred by the COVID-19 pandemic, in addition to the already inadequate number of radiologists per person worldwide [
5], necessitates innovative accessible solutions. Advances in artificial intelligence have enabled the implementation of sophisticated applications that can meet clinical accuracy requirements and handle large volumes of data. Incorporating computer-aided diagnosis tools into the medical hierarchy has the potential to reduce errors, improve workload conditions, increase reliability, and replace by enhance the workflow and reduce diagnostic errors by providing radiologists with references for diagnostics.
The fight against COVID-19 has taken several forms and fronts. Computerized solutions offer contactless alternatives to many aspects of dealing with the pandemic [
6]. Some examples include robotic solutions for physical sampling, vital sign monitoring, and disinfection. Moreover, image recognition and AI are being actively used to identify confirmed cases not adhering to quarantine protocols. In this work, we propose an automatic diagnosis artificial intelligence (AI) system that is able to identify COVID-19-related pneumonia from chest X-ray images with high accuracy. One customized convolutional neural networks model and two pre-trained models (i.e., MobileNets [
7] and VGG16 [
8]) were incorporated. Moreover, CXR images of confirmed COVID-19 subjects were collected from a large local hospital and inspected by board-accredited specialists over a period of 6 months. These images were used to enrich the limited number of existing public datasets and form a larger training/testing group of images in comparison to the related literature. Importantly, the reported results come from testing the models with this completely foreign set of images in addition to evaluating the models using the fused aggregate set. This approach exposed any overfitting of the model to a specific set of CXR images, especially as some datasets contain multiple images per subject.
2. Background and Related Work
COVID-19 patients who have clinical symptoms are more likely to show abnormal CXR [
9]. The main findings of recent studies suggest that these lung images display patchy or diffuse reticular–nodular opacities and consolidation, with basal, peripheral, and bilateral predominance [
10]. For example,
Figure 1 shows the CXR of a mild case of lung tissue involvement with right infrahilar reticular–nodular opacity. Moreover,
Figure 2 shows the CXR of a moderate to severe case of lung tissue involvement. This CXR shows right lower zone lung consolidation and diffuse bilateral airspace reticular–nodular opacities, which are more prominent on peripheral parts of lower zones. Similarly,
Figure 3 shows the CXR of a severe case of lung tissue involvement. This is caused by diffuse bilateral airspace reticular–nodular opacities that are more prominent on peripheral parts of the lower zones, and ground glass opacity in both lungs predominant in mid-zones and lower zones. On the other hand,
Figure 4 shows an unremarkable CXR with clear lungs and acute costophrenic angles (i.e., normal).
AI, with its machine learning (ML) foundation, has taken great strides toward deployment in many fields. For example, Vetology AI [
11] is a paid service that provides AI-based radiograph reports. Similarly, the widespread research and usage of AI in medicine have been observed for many years now [
12,
13]. AI-based web or mobile applications for automated diagnosis can greatly aid clinicians in reducing errors, provide remote and cheap diagnosis in poor undermanned underequipped areas, and improve the speed and quality of healthcare [
14]. In the context of COVID-19 radiographs, ML methods are feasible to evaluate CXR images to detect the aforementioned markers of COVID-19 infection and the adverse effects on the state of the patients’ lungs. This is of special importance considering the fact that health services were stretched to their limits and sometimes to the brink of collapse by the pandemic.
Deep learning AI enables the development of end-to-end models that learn and discover classification patterns and features using multiple processing layers, rendering it unnecessary to explicitly extract features. The sudden spread of the COVID-19 pandemic has necessitated the development of innovative ways to cope with the rising healthcare demands of this outbreak. To this end, many recent models have been proposed for COVID-19 detection. These methods rely mainly on CXR and CT images as input to the diagnosis model [
15,
16]. Hemdan et al. [
17] proposed the COVIDX-Net deep learning framework to classify CXR images as either positive or negative COVID-19 cases. Although they employed seven deep convolutional neural network models, the best results were 89% and 91% F1-scores for normal and positive COVID-19, respectively. However, their results were based on 50 CXR images only, which is a very small dataset to build a reliable deep learning system.
Several existing out-of-the-box deep learning convolutional neural network algorithms are available in the literature [
18], and they have been widely used in the COVID-19 identification literature with and without modifications [
15]. They provide track-proven image detection and identification capabilities in many disciplines and research problems. Some of the most commonly used models are: (1) GoogleNet, VGG-16, VGG-19, AlexNet, and LetNet, which are spatial exploitation-based CNNs. (2) MobileNet, ResNet, Inception-V3, and Inception-V4, which are depth based CNNs. (3) Other models include DenseNet, Xception, SqueezeNet, etc. These architectures can be used pre-trained with deep transfer learning (e.g., Sethy et al. [
19]), or customized (e.g., CoroNet [
20]).
Rajaraman et al. [
21] used iteratively pruned deep learning ensembles to classify CXRs into normal, COVID-19, or bacterial pneumonia with a 99.01% accuracy. Several models were tested and the best results were combined using various ensemble strategies to improve the classification accuracy. However, such methods are mainly suitable for small numbers of COVID-19 images as the computational overhead of multiple model calculations is high, and there is no guarantee that they will retain their accuracy with large datasets [
15,
22]. Other works for three-class classification using deep learning were also proposed in this context. The studies by Ucar et al. [
23], Rahimzadeh and Attar [
24], Narin et al. [
25], and Khobahi et al. [
26] classify cases as COVID-19, normal, or pneumonia. Others replace pneumonia with a generic non-COVID-19 category [
27,
28], or severe acute respiratory syndrome (SARS) [
29]. Less frequently, studies distinguish between viral and bacterial pneumonia in a four-class classification [
18]. A significant number of studies conducted binary classification into COVID-19 or non-COVID-19 classes [
19,
30]. Although these methods achieved high accuracies (i.e., greater than 89%), the number of COVID-19 images from the total dataset is small. For example, Ucar et al. [
23] used 45 COVID-19 images only. Moreover, subsequent testing of the models used a subset of the same dataset, which may give falsely improved results, especially as same subject may have multiple CXR images in the dataset.
4. Results and Discussion
Four different approaches were used to evaluate the three deep learning models. First, only the public dataset was used to train and test the models. Second, the fused dataset was used to test and train the models (i.e., the sets were combined together and treated as one without any distinction). Third, the public dataset was used for training the model and the locally collected dataset was used for testing. This approach shows the ability of the model to generalize to new data and avoid overfitting to specific images/subjects. Fourth, the combined (i.e., fused) dataset was used for training and local dataset for testing.
Table 2 shows the number of training and testing subjects used for each approach. Note that the local dataset did not include normal CXR images as those are abundantly available. The confusion matrices resulting from testing were analyzed to produce several standard performance measures. These include accuracy, specificity, sensitivity, F1-score, and precision as defined in Equations (
1)–(
5).
where
: true positive, represents the subjects correctly classified in predefined (positive) class.
: false negative, represents the subjects misclassified in the other (negative) class.
: false positive, represents the subjects misclassified in predefined (positive) class.
: true negative, represents the subjects correctly classified in the other (negative) class.
Figure 5 shows the training and validation loss for the two model training methods. The 2D CNN model required more epochs to reach the appropriate accuracy improvement, but the training was smooth with little oscillation. Moreover, the other two models required very few epochs (e.g., VGG-16 required one epoch with the fused dataset, hence the missing plot).
Figure 6 shows the training and validation accuracy. The figures generally show that the models are able to properly fit training data and improve with experience. It is clear that the MobileNets and VGG-16 models achieve superior and high classification accuracy.
The testing dataset (i.e., the locally collected COVID-19 CXR images) is different from the training dataset.
4.1. 2D Sequential CNN
Table 3 and
Table 4 show the values for the performance evaluation metrics and the corresponding confusion matrices for the 2D sequential CNN model. The architecture achieved the best accuracy of 96.1% over all training and testing methods. However, the accuracy drops sharply to 79% when the testing was carried out using a database (i.e., the locally collected COVID-19 CXR images) different from the training one (i.e., the public dataset). This indicates the failure of the model to generalize to new data, and that there may be subtle or obscure differences between the images from the two datasets. This is further confirmed by the fact that normal images (see
Table 4c), which were taken from the public dataset, were mostly correctly classified. The source of errors came from false negative classifications (i.e., type II errors). However, the accuracy improved to 89.3%, when a separate part of the testing dataset was included in the training. Still, most of the errors were type II (see
Table 4d). This is a model performance mismatch problem of the custom CNN, which is typically caused by unrepresentative data samples. However, since the other models were trained on the same data, then this reason could be discounted. The MobileNets and VGG-16 models were employed using transfer learning, which inherently reduces overfitting. Moreover, these models are larger and deeper than the custom CNN, which due to overparameterization can lead to better generalization performance [
41].
4.2. MobileNets
Table 5 and
Table 6 show the values for the performance evaluation metrics and the corresponding confusion matrices for the MobileNets model. It achieved accuracy values between 97.1% and 98.7%, which shows stability when faced with new data, and the ability to generalize. Errors, although few, were caused by misclassifying COVID-19 CXR images as normal. However, the type I errors increased slightly (
Table 6c).
4.3. VGG-16
Table 7 and
Table 8 show the values for the performance evaluation metrics and the corresponding confusion matrices for the VGG-16 model. The model achieved the best accuracy over all models (i.e., 99%) when the fused dataset was used for training and the local dataset was used for testing, which indicates its ability to capture various properties from different sets. However, it fell behind MobileNets slightly when the training dataset (i.e., the public dataset) was different from the testing dataset. Moreover, the model achieved the highest accuracy (98.7%) with the fused dataset for both training and testing. However, MobileNets achieved slightly higher accuracy when trained and tested with the public dataset alone. Such slight performance differences when the dataset is augmented with data from other sources may need further investigation. The confusion matrices show that, for VGG-16, the majority of errors are type I over all evaluation methods, which is different from the CNN or MobileNets errors (i.e., type II). Improving VGG-16’s handling of normal images should cut the error rate significantly.
4.4. Comparison to Related Work
Table 9 shows a performance comparison of deep learning studies in binary classification using CXR images. Some studies did not report the accuracy as their datasets were largely imbalanced. Although most related studies reported high accuracy values, a common theme among them is the lack of a significant number of COVID-19 cases for this type of classification model. For example, Narin et al. [
25] mention that the excess number of normal images resulted in higher accuracy in all of those models. This is useless considering the fact that very few differences exist among normal images of lungs across different subjects. Similarly, Hemdan et al. [
17] stated the limited number of COVID-19 X-ray images as the main problem in their work. Moreover, the dataset that we included in this work contains only one image per subject, unlike other datasets which include more images than subjects. In addition, special consideration was paid to the type of cases included in the dataset, because the effect of COVID-19 on the lungs does not necessarily appear immediately with symptoms and it may take a few days.
The literature on deep learning for medical diagnosis in general and COVID-19 classification in particular is vast and expanding. However, large datasets are required to truly have reliable generalized models. We believe that development of mobile and easy access applications that capture and store data on the fly will enable better data collection and improved deep learning models.