1. Introduction
Globally, lung diseases are acknowledged as highly fatal and dangerous, affecting millions of people every year. According to the Forum of International Respiratory Societies (FIRS), respiratory disorders cause almost four million fatalities annually and are among the leading causes of morbidity worldwide [
1]. Furthermore, the World Health Organization (WHO) reported that after cardiovascular diseases, respiratory diseases are the second largest contributor to the global disease burden; approximately 10 million people lose their lives to respiratory diseases every year [
2]. The diagnostic procedures for respiratory diseases primarily involve auscultation, wherein medical specialists listen with a stethoscope to the sounds as air moves in and out of the lungs [
3]. Lung auscultation is among the traditional diagnostic techniques employed [
4] by medical specialists to assess the status of respiratory diseases. Crackles and wheezes are the two most frequently heard abnormal lung sounds [
5]. These sounds are identified based on their frequency, pitch, energy, intensity, and duration. Wheezes are continuous, high-pitched noises typically occurring in the 400–500 Hz range with a duration longer than 100 ms. Wheezes are generally heard in individuals with asthma and chronic obstructive pulmonary disease [
6]. Crackles are discontinuous sounds with a pitch ranging between 100 and 2000 Hz. Crackles are generally heard in patients suffering from heart failure, pneumonia, and bronchitis [
7]. Auscultation is cost-effective, easy to apply, and provides essential details about lung conditions and symptoms for a quick diagnosis [
8]. However, traditional auscultation with a stethoscope is not infallible, because it depends on the clinician’s expertise and auditory sensitivity. Sometimes, during an examination, this leads to misclassification, even when carried out by an expert physician [
9]. Research by Salvatore and Nieman [
10] revealed that more than half of the pulmonary sounds were incorrectly identified by medical trainee students in the hospital. Since lung sounds are non-stationary, it is challenging to distinguish them through traditional auscultation techniques. Therefore, there is a need to develop a respiratory disease detection system to ensure more efficient clinical diagnoses.
The 2017 public respiratory sound dataset released by the International Conference on Biomedical Health Informatics (ICBHI-2017) [
11] has attracted significant interest among research teams developing automated systems for distinguishing lung sounds. Deep learning (DL) and conventional machine learning (ML) have been utilized in studies over the last decade to address the classification task [
12,
13,
14]. Several attempts have been made to develop algorithms and methods for feature extraction aimed at automatically identifying abnormal lung sounds. Among them, some common feature extraction techniques include spectrograms [
15], mel spectrograms [
16], wavelet coefficients [
17], and the mel-frequency cepstral coefficient (MFCC) [
18], as well as a wide range of DL and ML approaches.
Pham et al. [
4] extracted various features, including the short-time Fourier transform (STFT) and mel spectrogram. Gairola et al. [
19] employed a convolutional neural network (CNN), leveraging mel spectrograms to identify adventitious lung sounds. Bardou et al. [
20] utilized the MFCC and traditional ML features (such as local binary patterns) for feature extraction, replacing the CNN model with fully connected layers to train these features, and integrating the output of four CNN models with softmax activation. The authors of [
21] optimized an AlexNet pre-trained CNN model, utilizing scalograms to extract a visual representation of the pixel values to accurately detect and classify lung sounds. Tariq et al. [
22] developed a model that concatenates three distinct features (a chromagram, the MFCC, and a spectrogram) to classify lung audio samples using ideal CNN models. Similarly, the study in [
23] presented various feature extraction techniques to classify different respiratory diseases such as COPD and asthma.
In addition to lung sound analysis, other research has utilized methods such as the wavelet transform and the spectrogram [
24], or empirical mode decomposition (EMD) and bandpass filtering for scale selection, as well as processing continuous wavelet transform (CWT)-based scalogram representations with a lightweight CNN for classification of various respiratory diseases. Recent advancements in noninvasive monitoring have led to significant progress in deriving respiratory signals from ECG data. thereby enhancing traditional respiratory sound analysis. Yi and Park [
25] demonstrated the derivation of respiratory signals using wavelet transforms directly from the ECG, establishing a foundation for reliable respiratory monitoring without the subject’s awareness. O’Brien and Heneghan [
26] presented a comparative examination of methods for extracting respiratory signal extraction approaches from the ECG, highlighting the accuracy and robustness of these techniques across various body postures during sleep. Furthermore, Campolo et al. [
27] introduced a novel technique employing EMD to derive respiratory signals from the ECG, showcasing its superior performance in accurately reconstructing respiratory waveforms. This approach offers a dual-modality method that enhances diagnostic capabilities by simultaneously analyzing cardiac and respiratory data. In [
28], the authors classified electroencephalogram (EEG) signals using CWT and a long short-term memory (LSTM) model, similar to the study in [
29], in which a dual scalogram comprising the Stockwell transform and a CWT scalogram was employed for fault diagnosis in centrifugal pumps. Furthermore, recent studies have explored different ML and DL techniques for binary-class (normal vs. abnormal) classification and multi-class classification of respiratory diseases [
30].
In order to achieve improved performance for multi-class and binary classification tasks, Nguyen and Pernkopf [
31] developed approaches that include sample padding, feature splitting, an ensemble of CNNs, and a focal loss objective. Acharya and Basu [
15] introduced a deep hybrid-based CNN and a recurrent neural network (RNN) framework for detecting respiratory sounds utilizing mel spectrograms. Concurrently, Demir et al. [
32] identified four different lung sounds by combining deep CNN features with a linear discriminant analysis and random subspace ensemble classifier. Additionally, to resolve imbalances in the training data, Petmezas et al. [
33] employed a model combining a CNN with LSTM networks that include the focal loss function. The respiratory sound cycle was transformed into a time-frequency representation and processed using the CNN.
In this study, we propose a hybrid DL technique with signal processing techniques for detecting various lung disorders. We introduce parallel transformation for rich features using a parallel convolutional autoencoder (CAE). Initially, the auscultation recordings undergo preprocessing through segmentation of respiratory cycles, followed by a padding technique to modify the length of each respiratory cycle to a fixed size. The respiratory cycle audio signal is transformed into a time-frequency representation using CWT and a mel spectrogram. Two parallel CAEs extract rich features from scalograms, concatenate features in a hybrid pool, and subsequently feed them into an LSTM model that indicates different respiratory diseases.
Our principal contributions are as follows:
- (1)
We present a novel method that combines deep learning and signal processing for enhanced lung auscultation analysis and classification. This approach addresses the limitations of traditional techniques utilized for lung auscultation.
- (2)
This approach utilizes parallel transformation using both CWT and a mel scalogram. A parallel CAE is utilized to extract rich features from the scalograms transformed by CWT and mel at latent spaces.
- (3)
A hybrid feature pool is created by fusing the features collected from both the CWT and mel scalograms using CAE latent spaces. These latent spaces provide an extensive and enriched representation of lung sound features, enhancing the analysis and classification approach.
- (4)
An LSTM network is employed to classify various lung sounds, leveraging its proficiency in handling time-series data. Lung sounds are sequential, and LSTM is particularly suited to recognizing complex patterns and handling sequential information in time-dependent data.
The rest of this paper is organized as follows.
Section 2 provides background information on the dataset and comprehensive details of the proposed model.
Section 3 describes the experimentation and the model’s performance. Finally,
Section 4 summarizes the proposed study along with future expansion and enhancement planned for this work.
3. Results and Discussion
In this study, a publicly available dataset of respiratory sounds was chosen to evaluate the performance of the proposed framework [
11]. The proposed DL framework was developed and implemented in Python 3.9.18, leveraging TensorFlow 2.15.0 as the foundation for the Keras library. All experiments were conducted using a desktop computer with an AMD Ryzen 9 5900X 12-Core 3.70 GHz CPU, 64 GB of RAM, and an NVIDIA GeForce RTX 3080 GPU with 64 GB of memory. The respiratory sound dataset encompasses four sub-tasks, which include a binary-class problem distinguishing between normal (N) and abnormal (Ab) samples. Three-class and four-class tasks categorize respiratory cycles into one of four classes (W, C, N, and B). Eight-class categorization is also performed, where classifications include healthy samples and seven distinct lung diseases: pneumonia, LRTI, asthma, bronchiectasis, URTI, bronchiolitis, and COPD. The dataset was split, allocating 80% for training and 20% for testing. After evaluating the proposed model for the binary-class problem, the experimentation was extended to the three-class, four-class, and eight-class problems. We used several metrics to evaluate the performance of respiratory sound classification: accuracy, F1-score, precision, and sensitivity. These metrics collectively provide a nuanced view of the model’s ability to correctly identify and differentiate between the various respiratory diseases. In the classification framework, true positive (TP) is when an instance was accurately identified as positive, and true negative (TN) means an instance was accurately identified as negative. A false positive (FP) is an instance incorrectly identified as positive, and a false negative (FN) is a positive instance incorrectly labeled as negative. The following equations are used to calculate these metrics:
In this study, the proposed model, based on a hybrid approach involving digital signal processing and DL, was evaluated using various classification tasks to assess its effectiveness in distinguishing various respiratory diseases. This evaluation was conducted across multiple classification tasks ranging from simple binary classification problems to three-class, four-class, and eight-class problems. The following are the specific scenarios for each classification problem:
Binary-class problems: N-Ab, C-W, B-C, B-W, C-N, C-W, and W-N.
Three-class problems: B-C-W, and C-N-W.
Four-class problems: C, W, N, and B.
Eight-class problems: Healthy (H), pneumonia (P), LRTI (L), asthma (A), bronchiectasis (B1), URTI (U), bronchiolitis (B2), and COPD (C).
3.1. Binary Classification
In the binary classification problems, the proposed model demonstrated remarkable accuracy in identifying crucial respiratory sounds. Our model exhibited remarkable performance in differentiating between C, W, N, and B. Several experiments were conducted on both the official and the augmented datasets to validate the effectiveness of our proposed model. For the N-Ab problem, our model achieved an average accuracy of 85.61%, an F1-score of 84.21%, a precision of 85.36%, and a sensitivity of 83.44%. Similarly, for the B-C problem, the results were 94.41%, 93.65%, 93.57%, and 93.74% for accuracy, F1-score, precision, and sensitivity, respectively. For the C-W problem, the results were 93.57%, 93.51%, 93.50%, and 93.53%, respectively. The results for the remaining binary-class problems are shown in
Table 6.
Figure 6 depicts the confusion matrices, showing the predicted versus the true labels for different binary-class problems.
3.2. Three-Class Classification
After achieving promising results for the binary-class problems, we extended our evaluation to three-class classification problems. We further examined and compared the internal relationships and variations for the B-C-W and C-N-W problems. On the official and augmented datasets, our proposed model achieved an average accuracy of 89.45%, an F1-score of 88.41%, a precision of 88.68%, and a sensitivity of 88.16% for the B-C-W problem. For the C-N-W problem, the results were 82.04%, 82.15%, 81.94%, and 82.41%, respectively, as shown in
Table 7.
Figure 7 presents the confusion matrices for the B-C-W and C-N-W three-class problems.
3.3. Four-Class Classification
To evaluate the model’s ability to identify four-class respiratory sound problems, both datasets were used to compare the C, W, N, and B categories. The model demonstrated promising performance across all scenarios, as illustrated by the confusion matrix in
Figure 8a. The proposed model achieved an average accuracy of 79.61%, an F1-score of 78.67%, a precision of 78.86%, and a sensitivity of 89.56% on the augmented dataset, as shown in
Table 8.
3.4. Eight-Class Classification
Finally, the evaluation of the proposed framework for eight-class problems included healthy samples and seven distinct lung diseases (P, L, A, B1, U, B2, and C), as shown in
Table 9. The confusion matrix in
Figure 8b illustrates that the model yielded an overall accuracy of 94.16%, a sensitivity of 89.56%, an F1-score of 89.56%, and a precision of 89.87%. In summary, these findings demonstrate the proposed model’s robust and reliable performance across various respiratory sound classification scenarios. including binary-class, three-class, four-class, and eight-class problems, even on unbalanced datasets.
3.5. Discussion
We proposed a novel approach to evaluating various adventitious lung sounds by employing a hybrid model that combines parallel CAEs and an LSTM network. The model’s performance was evaluated across multiple classification problems: binary-class, three-class, and four-class problems, as well as eight-class problems involving healthy samples and seven distinct diseases. In this study, lung sound signals were not directly fed into the classification model—all lung sound signals were transformed into the frequency domain as spectrograms. For feature extraction, dual CWT and mel transformations were fed into parallel CAEs, and the features extracted from CAE latent spaces were concatenated to create a hybrid feature pool. This parallel transformation allows for more precise extraction of rich features, while fusion improves data classification by efficiently capturing diverse signal characteristics. The sequential nature of LSTM is utilized for the classification of various diseases. To assess the impact of hybrid features from the CAE latent space features from both CWT and the mel spectrogram, we conducted an ablation study using an eight-class classification framework. The results of training the LSTM network with various feature sets are shown in
Table 10. When solely CAE latent space features of CWT were used, the LSTM model achieved an average accuracy of 78.50%, an F1-score of 82.14%, a precision of 85.34%, and a sensitivity of 80.42%. In contrast, training with only latent space features from the mel spectrogram resulted in an average accuracy of 90.83%, an F1-score of 85.7%, a precision of 88.31%, and a sensitivity of 84.59%. However, the model’s performance significantly improved when combining both the CAE latent space features, with the accuracy rising to 94.69%, F1-score to 90.69%, precision to 91.89%, and sensitivity to 89.78%. This shows that the fusion of both CAE latent space features significantly improves the LSTM network’s capacity to classify and detect various respiratory disorders in multi-class problems.
Table 11 illustrates the overall performance of our proposed model in multiple-class tasks using a publicly available respiratory disease dataset.
The overall accuracy, sensitivity, specificity, and F1-score for the eight-class problems were 94.16%, 89.56%, 99.10%, and 89.5%, respectively. Similarly, for the four-class problems, the overall results were 79.61%, 78.55%, 92.49%, and 78.67%, respectively, and for the three-class problems, the overall results were 89.45%, 88.16%, 94.54%, and 88.41%, respectively. Meanwhile, for the binary-class problems, the overall results for normal vs. abnormal were 85.61%, 83.44%, 83.44%, and 84.21% for accuracy, sensitivity, specificity, and F1-score, respectively, and for crackles and wheezes, they were 84.21%, 93.57%, 93.53%, and 93.15%, respectively. To further validate the robustness of our framework, we also conducted experiments using another public dataset, the SJTU Paediatric dataset [
53], for various respiratory diseases, including healthy samples and seven distinct lung diseases: coarse crackle (C), fine crackle (F), rhonchi (R), stridor (S), wheeze (W), and both wheeze and crackle (B). The results, presented in
Table 11, demonstrate that our findings are not only applicable to a single dataset but also generalize well across different datasets. This additional validation underscores the generalizability of our model, reinforcing its effectiveness on diverse datasets. The variations in the error rates are associated with the imbalanced nature of the dataset, where some classes are over-represented, influencing the model’s learning bias. Furthermore, the inherent acoustic similarities across various respiratory disorders make it more complex for the model to correctly identify the lung sound. For example, high-pitched sounds like crackles and wheezes provide a special problem since their slight acoustic variances are hidden behind similar spectral sequences.
Several experiments were performed to optimize the proposed model. Specifically, performance was evaluated while varying the learning rate and the number of epochs.
Figure 9 shows the classification accuracies across different learning rates ranging from 0.00001 to 0.01 over 200 epochs. The results indicate that for the binary-class problems, the accuracy remained high as the learning rate increased from 0.00001 to 0.001. For the three-class problems, a slight decline in accuracy was observed as the learning rate increased. For the four-class problems, increasing the learning rate noticeably reduced the model’s accuracy after the initial increase, and for the eight-class problems, increasing the learning rate to 0.001 gradually increased the accuracy.
Figure 9 indicates that a learning rate of 0.001 over 200 epochs achieved the highest scores across all classification problems. The hybrid approach, combining DL with digital signal processing techniques such as parallel CAEs and dual scalograms, achieved promising results, even on imbalanced datasets.
4. Conclusions and Future Work
Our study introduced an advanced, intelligent, lung sound recognition framework for detecting respiratory diseases. We applied dual transformation using mel scalograms and continuous wavelet transform to generate detailed time-frequency scalograms. Parallel convolutional autoencoders were trained to extract essential features from CWT and mel samples. This framework integrates parallel convolutional autoencoders and an LSTM network, reducing the possibility of misclassifying significant features while extracting rich features. The features extracted from both latent spaces are concatenated into a hybrid feature pool and processed through the LSTM model, addressing multiple-class problems. We evaluated our method on the ICBHI 2017 dataset, and the experimental results showed that our proposed model achieved promising results across multiple classification problems. For eight-class problems involving healthy samples and seven distinct lung diseases (asthma, bronchiectasis, bronchiolitis, COPD, LRTI, pneumonia, and URTI), the proposed model achieved an average accuracy of 94.16%, an average sensitivity of 89.56%, an average specificity of 99.10%, and an average F1-score of 89.56%. For the four-class problems, including crackles, wheezes, no label, and both crackles and wheezes, the model achieved an average accuracy of 79.61%, an average sensitivity of 78.55%, an average specificity of 92.49%, and an average F1-score of 78.67%. The results for the three-class problems were an average accuracy of 89.45%, an average sensitivity of 88.16%, an average specificity of 94.54%, and an average F1-score of 88.41%. Finally, for the normal vs. abnormal binary-class problems, the model achieved an average accuracy of 85.61%, an average sensitivity of 83.44%, an average specificity of 83.44%, and an average F1-score of 84.21%, outperforming all other research. In future work, we will deploy the proposed framework in a clinical setting. Additionally, we plan to enhance the robustness of the framework by increasing the number of sound samples through the integration of multiple datasets.