1. Introduction
The coronavirus (COVID-19) pandemic can be described as a respiratory infection majorly caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and has infected more than 44 million individuals globally [
1]. The effect of this 21st-century pandemic has negatively affected global economic activities such as finance [
2], security [
3], food security, education, and global peace [
4], with some positive results in reducing urban pollution [
5]. The influence of this virus from the alpha to the beta variant has affected both the health and the welfare status of citizens around the world [
6]. The World Health Organization (WHO) declared it to be a novel coronavirus disease and named it as a Public Health Emergency of International Concern (PHEIC) on 30 January 2020 due to the easy spread and high transmission rate and communicability of this disease [
7].
Previous studies have shown that some of the clinical signs of patients infected with COVID-19 are closely related to other viral upper respiratory diseases such as a respiratory syncytial virus (RSV), influenza, and bacterial pneumonia, while other common symptoms are sore throat, pleurisy, shortness of breath, dry cough, fever, headache, etc. [
8]. Different tools and methods have been used for monitoring and diagnosing this virus, such as Real-Time Polymerase Chain Reaction (RT-PCR) [
9], medical imaging such as computer tomography (CT) scan images [
10,
11], chest X-ray [
12,
13], and lung ultrasonography [
14], as well as blood samples [
15], urine [
16], feces [
17], etc. However, some of the limitations of previous studies include inaccuracies of results, cost implications, varying quality and reliability of available SARS-CoV-2 nucleic acid detection kits, and the insufficient number and throughput of laboratories performing the RT-PCR test, etc. [
18]. Similarly, the use of medical images for diagnosis has its share of limitations, such as the cost implications of setup, and insufficient machines in hospitals for conducting timely COVID-19 screening [
19]. These medical images are processed using various machine learning, deep learning [
20], and other artificial intelligence methods [
21], making them more effective.
Recently, the use of respiratory sound or human audio samples such as coughing, breathing, counting, and vowel sounds for the detection of COVID-19 [
22,
23,
24], are being presented as alternative, simple and inexpensive methods for monitoring the disease. Sound or audio classification tasks have continued to increase thanks to their wide span of applications in our everyday lives, including medical diagnostics for cognitive decline [
25] and laryngeal cancer [
26]. The concept of sound or audio recognition involves recognizing the audio stream, related to various environmental sounds. Thus, the advancement of deep convolutional neural network (CNN) applications in sound classification have shown very impressive performances. This is based on the strong capabilities of deep CNN architectures in identifying key features that are mapping audio spectrograms to relative or different sound labels such as the time and frequency energy modulation patterns over spectrogram data inputs [
27]. The need for a deep CNN model in sound classification is due to some of the challenges posed by conventional machine learning methods, which include the inability to effectively identify features in Spectro-temporal patterns for different sounds [
28]. The recent adoption of a deep network is basically due to its stronger representational ability, thereby achieving better classification performance [
29]. However, the real-life applications of deep neural networks suffer from overfitting, which is always a result of limited datasets (data starvation), class imbalance, and the challenges of proper annotations in many practical scenarios due to the cost and time complexity of carrying out such annotations [
30]. In addition to these challenges, there are also some shortcomings in traditional audio features techniques, such as Mel Frequency Cepstral Coefficients (MFCC), which is the problem of identifying important features within different audio/sounds for efficient classification. Therefore, alternative methods such as cochleagrams are sought for audio feature extraction [
31].
CNNs are effective at learning from images. Deep CNNs are particularly well suited to the problem of sound classification for two reasons: first, when used with spectrogram-like image inputs, they can capture energy modulation patterns across time and frequency, which has been shown to be a key characteristic for differentiating between different sounds [
32]. Deep CNNs are particularly suited for sound classification because they can learn discriminative spectro-temporal patterns [
27]. The human body is too complex for performing effective classification, making it difficult to spot data′s underlying patterns. The introduction of image-based sound classification allows for the efficient recording of a variety of sound patterns, including those coming from the heart and lungs [
33,
34]. However, in many situations, data augmentation is required to accomplish generalization [
35].
Data augmentation has consistently shown its relevance in improving data generalization based on the application of one or more deformations properties in a set of labelled training samples, thus generating additional training data samples. Some of the most effective data augmentation methods proposed in existing studies for audio/sound dataset include the following: semantics-preserving deformations in music datasets, random time-shifting [
36], pitch shifting and time stretching, etc. Some of the traditional data augmentation techniques have proven to be insufficient in other sound datasets with very high time complexity for training, and to have an insignificant impact on the performance of some state-of-the-art models [
27]. Wang et al. [
30] applied GAN-based semi-supervised learning using a low-density sample annealing scheme for generating a new fake audio spectrogram with labelled IFER data. Other studies also adopted image augmentation techniques for increasing spectrogram images. Mushtaq et al. [
37] applied some of the most widely used image augmentation techniques on the converted audio files to spectrogram images. The authors also applied five of the most popular deformation approach to the audio files which include the pitch shift, time stretch, trim silence, etc. The study concluded that their proposed data augmentation method improved the performance of the DCNN model more than the traditional image augmentation methods with increasing accuracy for training, validation and test datasets. Based on some of the findings deduced from recent studies, we can agree that the combination of appropriate feature extraction methods with deep learning models using suitable data augmentation technique(s) can aid the performance of classifiers in sound classification. Therefore, this paper introduces effective and improved data augmentation schemes on deep learning models for sound record classification in COVID-19 detection.
In summary, the main contributions of our study are as follows: Firstly, applied simple and effective data augmentation schemes for efficient data generalizations for COVID-19 detection. Secondly, a pre-trained CNN architecture called DeepShufNet was analyzed and evaluated. The experimental analysis of the augmented datasets in comparison with baseline results showed significant improvement in performance metrics, better data generalization and enhanced optimal test results. In addition, we compared and investigated the impact of data augmentation on two methods (Mel-spectrograms and GFCC) for the detection of COVID-19 symptomatic cases, positive asymptotic cases, and fully recovered cases. The results showed an impressive result with near-optimal performance, especially in the rate of recall, precision, and F1-Score. The remaining part of this paper is sectioned as follows.
The related work is presented in
Section 2, where we discuss in detail all significant approaches used for data augmentation, and learning classifiers concerning audio/sound classification. In
Section 3, an introduction to our proposed methodology is fully discussed with emphasis on the dataset used, as well as our proposed data augmentation and deep learning methods. Detailed results from and discussions on the comparison of the proposed method with others’ published results are presented in
Section 4. In
Section 5, conclusive remarks are given.
2. Related Work
This section discusses in detail some of the state-of-the-art methods used by previous researchers for data augmentation techniques and classification models in sound/audio classification. Research trends in COVID-19 detection include the use of conventional machine learning algorithms in sound datasets, which include but are not limited to coughing, deep breathing, sneezing, etc. Machine learning algorithms have been applied in the detection of COVID-19 with improved results, such as a study by Sharma et al. [
22], who analyzed audio texture for COVID-19 detection using datasets with different sound samples and a weighted KNN classifier. Tena et al. [
38] conducted COVID-19 detection using five classifiers, namely: Random Forest, SVM, LDA, LR, and Naïve Bayes algorithms. RF classifier outperformed other machine learning methods with significant improvement in the accuracy on five datasets; however, the shortfall is lower specificity rates. Chowdhury et al. [
39] presented an ensemble method using the multi-criteria decision making (MCDM) method, and the best performance was obtained with extra tree classifier.
The authors of [
40] applied Gaussian noise augmentation techniques and AUCOResNet for the detection of COVID-19. Loey and Mirjalili [
41] compared six deep learning architectures such as GoogleNet, ResNet 18, 50 and 101, MobileNet and NasNetmobile for detection of COVID-19 using the Coughdataset. The study shows that ResNet-18 outperforms the other models with a significant performance result. Pahar et al. [
42] presented three pre-trained deep neural networks CNN, an LSTM and a Resnet50 architecture for detection of COVID-19 using five datasets. Erdogan and Narin [
43] applied deep feature ResNet 50 and MobileNet architecture on support vector machine in the detection of COVID-19 and the feature extraction method used two conventional approaches, which are empirical mode decomposition (EMD) and discrete wavelet transform (DWT). The study shows a high-performance result with ResNet50 deep features. Sait et al. [
44] proposed a transfer learning model called CovScanNet for classification of COVID-19 using multimodal datasets. Soltanian and Borna [
45] investigated the impact of the lightweight deep learning model on classification of Covid from non-Covid cough Virufy datasets. The authors combined separable kernels in deep neural networks for COVID-19 detection.
Despotovic et al. [
46] applied a CNN model based on VGGish in a Cough and Voice Analysis (CDCVA dataset) and the study gave an improved performance of 88.52% accuracy, while Mohammed et al. [
47] presented shallow machine learning, Convolutional Neural Network (CNN), and pre-trained CNN models on Virufy and Coswara datasets with performance metrics showing 77% accuracy. Brown et al. [
48] presented ML algorithms such as Logistic Regression (LR), Gradient Boosting Trees, and Support Vector Machines in the detection of COVID-19.
Some of the data augmentation techniques presented by previous researchers include studies by Lella and Pja [
49], which applied traditional audio augmentation methods on a one-dimensional CNN for diagnosing respiratory diseases of COVID-19 using human-generated sounds such as voice/speech, cough and breath datasets. Salamon and Bello [
27] examined the impact of different data augmentation methods on the CNN model. Authors concluded that there is a need for class-conditional data augmentation for improved performance of deep learning models. Leng et al. [
29] proposed a Latent Dirichlet Allocation (LDA) approach for augmentation of audio events from audio recordings. The authors compared the performance of the proposed LDA algorithm to other data augmentation techniques such as time and pitch shifting and Gaussian noise. Based on this thorough literature review, we can agree that to a great extent, existing data augmentation and classification methods in COVID-19 using sound/audio datasets still suffer from setbacks in identifying an appropriate and lightweight data augmentation method to overcome the problem of limited training data and data imbalance. The issue of background noise on sound datasets affects effective feature extraction; therefore, creating synthetic datasets from such noisy datasets would also affect the efficiency of the classification of deep learning models. There is a need to collect more quality data and thereby improve the performance of the learning models [
38,
50]. Therefore, this study proposed a simple and efficient deep learning architecture referred to as DeepShufNet model for improved classification of COVID-19. In addition, we applied effective data augmentation techniques using noise and color transformation methods in generating better synthetic datasets, thus improving data generalization and COVID-19 detection.
4. Experimental Results and Discussion
This section is based on an extensive experiment and effective investigation of all the different datasets on the proposed DeepShufNet. All experiments were conducted in MATLAB R2020b on a desktop PC built with an Intel(R) core i5 (3.2 GHz) processor, 8 GB of RAM, and an NVIDIA GeForce GTX 1070 GPU server with 120 G memory.
Taking into consideration the condition of the hardware and the issue of out-of-memory errors, we reduced the batch size to 200 for both training and testing. Considering the huge data sparsity within the Coswara dataset classes, the repeated experiments were conducted five times.
4.1. Training and Testing Prediction
The proposed DeepShufNet model was trained and tested on the feature-extracted images combined from all Coswara datasets. Cross-validation method was applied to find the optimal parameter configuration and the model was trained and validated on 80% of the total images extracted from the sound dataset, which consist of 1706 data samples comprising healthy, positive asymptotic, positive mild, positive moderate, recovered full, RINI, and NRIE with 1098, 34, 185, 58, 79, 120, and 132, respectively. The adaptive momentum algorithm ADAM was used as the training algorithm, and different hyperparameter values as summarized in
Table 6. The learning rate controls the rate of the weights update, therefore reducing the prediction error, while the batch size helps to determine the number of sample rows processed/time before updating the parameters of the internal network. The baseline experiment was evaluated using the raw feature-extracted images, the training process was with and without fine-tuning. The final DeepShuffleNet model was selected using the model with the least loss in the validation set during training.
The training model for each experiment was analyzed and observations of improvement in the classification results to validation accuracy and losses were noted. The results of the original dataset without augmentation suffer from the increasing misclassification rate of the minority class, especially in the case of classifying positive asymptotic and positive COVID-19 classes with a recall and precision rate of almost NA to less than 10%. However, training the DeepShufNet model with our categories of synthetic dataset gave a near-optimal result with a better performance in detection of COVID-19.
The experimental results are presented in four comparative categories and all results were obtained based on the experiments with the test dataset. The overall performance of the model with each category of dataset is compared using an optimal model in five recorded experiments in this research. In each comparative experiment, the combination of accuracy, recall and specificity is the main metric to judge the performance of the model in each dataset’s categories, since it examines both classes′ outcomes and improvement in the classification results for the minority class. The detailed summary of all measures for each category is all stated as follows.
All Positive vs. Healthy;
Positive Asymptotic vs. Healthy;
Healthy vs. Recovered Full.
4.2. Classification Deep Breath Sound (All Positive COVID-19 vs. Healthy)
This section compares the results of the transfer learning DeepShufNet on
pixels for binary classification of healthy versus all positive classes. Due to the similarities between the positive mild and moderate classes, we combined these two classes to create a new class called the All-positive-Covid class. A comparison of the detection power of our proposed DeepShufNet on the Mel spectrogram feature images and GFCC features is shown in
Table 7. The classification results reflect some improvement and stability of the DeepShuffleNet in the data augmentation datasets.
On the test set, the best performance for DeepShufNet was achieved using the Mel spectrogram image in the COCOA-2 dataset (see
Table 6), with an enriching positive COVID-19 detection case summarized as mean accuracy with 85.1 (standard deviation [SD], 4.23), 70.85 (SD, 7.7) for recall/sensitivity, 59.64 (SD, 13.12) for precision, 88.25 (6.14) for specificity, and 63.61 (SD, 6.7) for F1-score. However, the test set results of our proposed model on COCOA-3 show a substantial improvement in accuracy mean of 87.82 (SD, 1.3), 69.49 (SD, 4.9) for recall/sensitivity, 64.82 (SD, 4.7) for precision, 91.75 (1.9) for specificity, and 66.9 (SD, 2.8) for F1-score. Therefore, the test set comparison of the original dataset without augmentation can be said to perform the worst when compared with the outcome of the other datasets. The datasets using the GFCC images with augmentation still outperforms the original datasets with significant comparison result of accuracy as 83.1 (SD, 1.4), 83.05 (SD, 0.9), 76.4 (SD, 2.5), and 74.9 (SD, 3.8) for COCOA-3, COCOA-2, COCOA-1, and raw data (no augmentation), respectively. More interesting is the increasing mean recall for DeepShufNet being 71.33 (SD, 2.2), 48.7 (SD, 14.1), 46.7 (SD, 11.5), and 38.8 (SD, 9.3) for COCOA-1, no aug, COCOA-2, and COCOA-3, respectively.
The summary of DeepShufNet on the Mel spectrogram images is presented in
Figure 6, which reflects the best experimental outcome for COCOA-2 with values for accuracy, recall, specificity, precision, and F1-score being 90.1%, 62.71%, 95.99%, 77.1%, and 69.2%, respectively. The second-best results were achieved with COCOA-3, with an accuracy of 89.5%, 71.2% recall, 93.4% specificity, 70% precision, and 70.6% F1-score. The worst result was achieved by the raw dataset without augmentation, with an accuracy of 79%, 54.23% recall, 84.3% specificity, 42.67% precision, and 47.76% F1-score.
In the same manner,
Figure 7 shows comparison results of DeepShufNet for GFCC images. The application of noise augmentation COCOA-2 and the combo datasets (COCOA-3) show 84.1% and 84.7% accuracy, respectively. The two best recall results were achieved by COCOA-1 and COCOA-2, which depicts that the application of the data augmentation approach helps to improve classification results.
4.3. Experimental Results: Positive Asymptotic vs. Healthy
Aiming to indicate the contribution of our proposed DeepShufNet models, a second experiment was conducted to classify the healthy versus positive asymptotic alone. The wide margin in data sparsity between these two classes could result in serious overfitting of the model. However, the growth in the performance metrics for both Mel-spectrogram and GFCC images has not been continuous for the raw dataset, but the application of data augmentation approach on training data has reduced overfitting with a training accuracy much lower than the accuracy of testing in the last epoch. In summary, the experimental results indicate that the training with augmented datasets has not had a significant influence on the improvement of classification accuracy; however, training the model with COCOA-1 showed a good classification performance on the test sets in terms of accuracy, but the second worst results for recall rate. On the other hand, training our DeepShufNet with COCOA-2 slightly increases the test classification accuracy, specificity, and F1-score. Considering the efficiency of the data augmentation methods, classification using noise augmentation is more suitable for practical application when the dataset is small, as reflected in
Table 7.
Figure 8 and
Figure 9 show an improvement in the augmentation of Mel-spectrogram images with higher performance results in recall rate, precision, and F1-score. Therefore, we can claim that the impact of data augmentation methods in both feature extraction images achieved a more remarkable improvement in classification results on the proposed DeepShufNet model.
The experimental results from
Table 8 show an improvement using the data augmentation method as compared to the baseline experiment with the best accuracy being achieved by COCOA-1 with an accuracy of 97.15% (SD, 0.5); 95.8% (SD, 1.1) for COCOA-2; 92.7% (SD, 0.17) for COCOA-3; and 92.2% (SD, 0.9) for no aug data.
4.4. Experimental Results: Healthy vs. Recovered-Full
In this experiment, we tried to validate the effectiveness of our proposed model by analyzing the detection rate of the DeepShufNet model in classifying healthy against recovered. This experimental results of the applied model on the four datasets based on MFCC feature-extracted images, namely raw data (no aug), COCOA-1, COCOA-2, and COCOA-3, which gave the following performance results for accuracy: 93.45 (SD, 0.41) for COCOA-2: 93.33 (SD, 0.51) for COCOA-1; 91.68 (SD, 4.0) for COCOA-3; and 91.03 (SD, 0.8) for no augmentation (see
Table 9).
Figure 10 and
Figure 11 show the best results of all the four datasets on the DeepShufNet model, and it reflects that the combination of the two data augmentation techniques (COCOA-3) gave the best results.
4.5. Limitations
One of the major issues faced in this study is the problem of misclassification errors associated with the poor generalization of some noisy images. As expected, the majority of the error in misclassification can be attributed to a serious imbalance of classes and limited data samples. The differences between each class of sound, when represented as either Mel-spectrogram images or GFCC feature images, are almost similar to power representation and this could impact the ability of the model to generalize the data efficiently. The generated spectrogram for each audio file is a two-dimensional array of intensity values that is majorly noisy because of environmental noises connected to audio signals [
55]. Therefore, it is important to equalize values distribution to enhance feature learning.
The proposed model is designed based on existing data augmentation techniques (color transformation and noise) and the features in the frequency domain, which makes the model simple and intuitive with low space cost. On the one hand, image spectra for sound signals could be a complex system, since some of the images cannot fully reflect the characteristic information of sound signals, although the frequency-domain feature has been used by previous researchers in sound classification tasks.
Regardless of these limitations, the proposed DeepShufNet model has proven to be effective in terms of the detection of COVID-19, despite the gross imbalance in classes and the limited dataset. Moreover, it has low computational complexity in terms of resources and time. In the future, there is still a need to explore more complex data augmentation methods to overcome some of the errors due to the misclassification of the images by generating a cleaner dataset for proper generalization.
4.6. Comparison to Related Work on COVID-Sound Databases and Discussion
Further comparison in terms of accuracy, recall, and precision was carried out between our proposed system and other existing COVID-19 sound database systems. Despite applying different experimental conditions to each classification task, the proposed DeepShufNet model shows improved and promising results with respect to COVID-19 detection compared to the existing studies. The summary of the comparison table with related work is presented in
Table 10.
5. Conclusion
The increasing popularity of the application of different deep neural network models in sound classification tasks is quite impressive. However, there has been some research work on COVID-19 detection based on different CNN architectures and some of the publicly available datasets still suffer from huge data imbalance, limited datasets, and poor classification of some of the machine learning models. Therefore, this work aims to apply a deep learning model, called DeepShufNet, to different categories of data augmentation techniques. The main contributions of this work include:
Covering the gap between limited datasets and class imbalance by creating a larger corpus of synthetic datasets using some simple and effective data augmentation techniques. Additionally, three different synthetic datasets were created, namely COCOA-1, COCOA-2, COCOA-3.
Deep learning based on pre-trained Shufflenet architecture, called the DeepShufNet model, was trained and evaluated on the analyzed datasets for comparison. The experimental analysis of the augmented datasets in comparison with baseline results showed significant improvement in performance metrics, better data generalization and enhanced optimal test results.
We compared and analyzed the effects of the two different feature extraction methods, namely Mel-spectrogram and GFCC imaging, on the DeepShufNet model. This study investigated the effects of augmented images in the detection of COVID-19, including positive asymptotic cases, and fully recovered cases. The results showed that the DeepShufNet model had the highest accuracy on COCOA-2 Mel-spectrogram images for almost all the comparison cases. The proposed DeepShufNet models showed an improved performance, especially in the recall rate, precision, and F1-Score rate for all three types of augmented images. The proposed model showed the highest test results, with scores for accuracy, precision, recall, specificity, and f-score being 90.1%, 77.1%, 62.7%, 95.98%, and 69.1%, respectively, for positive COVID-19 detection using the Mel COCOA-2 training datasets. In the same manner, the experimental result for the detection of positive asymptotic achieved the best recall rate of 62.5% and specificity rate of 97.1%, and a 48% F1-score.
In the future, we will explore advanced data augmentation techniques such as the application of generative adversarial networks (GANs) to train and test the model. Furthermore, more deep learning architectures will be implemented to improve and enhance COVID-19 recognition performance. In addition, the proposed DeepShufNet deep learning model could also be applied and evaluated with the combination of all the different sound datasets.