2.2. Pre-Processing
Figure 1 shows the preprocessing steps applied to the ICBHI 2017 Respiratory Sound Database. The recorded dataset of the ICBHI 2017 Respiratory Sound Database was used in the research work, recorded at sample rates that varied from 4 KHz to 44.1 KHz. Resampling all recordings at a sampling rate of 4 KHz was deemed a safe approach because the frequency range of the signal of interest remained below 2 KHz. The resampled dataset was subsequently divided into segregated respiratory cycles and annotated. The overall duration of the respiratory cycles in a dataset ranges from 0.2 to 16 s, with a mean value of 2.7 s. The methodology reported by [
15] was deployed to extract respiratory cycles of consistent duration. Specifically, cycles exceeding 2.7 s were truncated, retaining only the initial 2.7 s.
In addition, the segmentation process was used to extract the necessary wheeze and crackling information from the audio signal. During the model’s training, the signals were divided into different samples to enhance their accuracy, with each sample having a duration of 20 to 25 milliseconds. Each of these samples contained either a crackle or wheeze sound. The purpose of this segmentation was to facilitate a more accurate detailed analysis and description to confirm the high accuracy of the sound signals. To further validate the quality of the signals, spectrograms of each signal were generated. Any audio segments deemed invalid—those lacking wheezing or crackling, as confirmed by the spectrogram—were discarded. This process ensures the integrity and reliability of the data. This step is shown in
Figure 2. The absence of wheezing or crackling sounds in typical respiratory patterns was confirmed through waveform analysis. A continuous and periodic signal, symbolized by the wave, indicated normal respiratory cycles. The consistent recurrence of smooth waveforms served as an indicator of a normal respiratory cycle. To handle the combined crackles and wheezes category, we extracted relevant segments containing both crackles and wheezes from the audio samples. These segments were then included in the training and testing datasets alongside the other categories.
This comprehensive approach to segmentation and validation not only optimized the model performance but also ensured the integrity of the dataset for subsequent analyses and interpretations.
The crackle signal, characterized by bubbling, rattling, or clicking sounds, is shown in
Figure 3.
Figure 4 shows wheezes within a respiratory cycle during both inhalation and exhalation. The wave plots and spectrograms of these two categories exhibit distinct characteristics.
Figure 5 shows the wave plot for an audio segment along with its spectrogram, containing both wheezes and crackles.
After segmentation and validation, the audio segments were split into two distinct sets: a training set (70%), used to train the 1D-CNN model, and a testing set (30%), reserved for evaluating its performance on unseen data. To further assess the robustness and generalization capability of our model, we utilized the k−fold cross-validation technique. This involves segmenting the training data into k subsets (folds), training the model on k−1 folds, and evaluating it on the remaining fold. The process is repeated k times, with each fold serving as the evaluation set once.
2.3. Convolution Neural Network (CNN)
The CNN algorithm has become a widely recognised and extensively utilized method in the domain of Deep Learning (DL). Compared to its predecessors, one notable feature of CNN is its autonomous ability to identify relevant components or features within the data without human intervention [
28]. While traditional Convolutional Neural Networks (CNNs) are widely used and perform well in many deep learning tasks, their inherently two-dimensional (2D) nature can make them computationally demanding in certain applications. However, 1D-CNNs emerge as a powerful alternative, with improved computational efficiency and an enhanced performance in processing sequential data [
20,
23,
24,
25,
26].
The mathematical representation of a CNN for every given signal is expressed as Equation (
1).
where
is the sequence, such that
and for
,
.
The generic architecture of a CNN is shown in
Figure 6. The CNN’s convolutional layers are the basic structural elements responsible for extracting features from the time series or uni-dimensional signals. A series of filters, commonly known as kernels, are used to accomplish this task. These kernels, whose values are obtained through the training process, act as the impulse response of the filters [
18], effectively capturing the salient characteristics of the input signal. In the proposed 1D-CNN architecture, we employ multiple convolutional layers with increasing filter sizes to capture both local and global patterns in the lung sound signals. The first convolutional layers uses a a smaller filter size to learn local features, while subsequent layers gradually increase the filter size to capture more global patterns [
29]. This hierarchical approach allows for the network to learn a robust representation of the lung sounds at different scales.
To further enhance the performance of our 1D-CNN, residual connections inspired by the ResNet architecture are incorporated [
9,
30]. Residual connections allow the network to learn residual functions with reference to the layer inputs, enabling the training of deeper networks without degradation. By adding skip connections between convolutional layers, we facilitate the flow of information and gradients throughout the network, improving its ability to learn complex patterns in the lung sound data.
The pooling layers serve as an additional element placed directly after each convolutional layer. Pooling layers perform nonlinear down-sampling on the extracted feature maps while reducing their spatial dimension while preserving the essential information. This process helps reduce the network’s computational complexity, making it more efficient in sequential data processing. In our 1D-CNN, we employ max pooling layers to down-sample the feature maps and retain the most prominent features.
Furthermore, dropout layers are used to reduce over-fitting in the network by deactivating a subset of its neurons during training. This promotes better generalization. On the other hand, batch normalization layers are also used to enhance the performance of the training procedure by normalizing the activation of the previous layer, reducing internal covariate shift.
Finally, it is important to note that fully connected layers are a type of feedforward neural network (FNN) that is usually connected at the end of a network. These layers establish weighted connections between the overall functionality of the previous layers and the distribution of class probabilities, simplifying the network’s ability to make predictions. In our 1D-CNN, we utilize fully connected layers to map the learned features to the desired output classes (normal, crackles, wheezes, and a combination of both crackles and wheezes).
To address the class imbalance present in the lung sound dataset, the Synthetic Minority Over-sampling Technique (SMOTE) was applied to the training data [
31]. SMOTE generates synthetic examples of the minority classes by interpolating between existing examples, effectively balancing the class distribution. The SMOTE algorithm can be mathematically represented as follows:
For an individual minority class sample , select one of its k-nearest neighbors .
Generate a new synthetic sample
using the following equation:
where r is a random number between 0 and 1.
Repeat steps 1 and 2 until the desired balance between the minority and majority classes is achieved.
By applying SMOTE to the training data, we ensured that the 1D-CNN model was exposed to a balanced distribution of samples from each class, mitigating the bias towards the majority class.
To find the optimal hyperparameters for our 1D-CNN model, we used a grid search with k−fold cross-validation. Grid search is an exhaustive search technique that evaluates model performance for different combinations of hyperparameters. The hyperparameter grid is defined in
Table 2.
The k−fold cross-validation technique segments the training data into k. subsets or folds, trains the model on k−1 folds, and evaluates it on the remaining fold. This process is repeated k times, with each fold serving as the validation set once. The average performance across all folds provides a robust estimate of the model’s generalization ability. For k−fold cross-validation, the training data are divided X into k disjoint subsets , , …, of approximately equal size. For each fold :
- (a)
Train the model on (all subsets except ).
- (b)
Evaluate the model on and compute the performance metric.
Afterward, the average performance across all k folds is calculated. By combining grid search with k−fold cross-validation, we can identify the optimal hyperparameters that yield the best performance on unseen data, ensuring the robustness and generalization capability of our 1D-CNN model.
The modelled 1D-CNN architecture comprises 1,610,466 parameters. The model summary of the 1D-CNN is depicted in
Table 3.
The 1D-CNN model effectively categorizes the signals into the four classes (normal, crackles, wheezes, and combined crackles and wheezes) and yields a classification accuracy of 0.95. The identification of wheezing and crackles in lung sounds is typically performed by pulmonologists, who distinguish them from extraneous environmental noises such as chairs being dragged, fans in motion, or rustling paper, among others. Subsequently, a meticulously organized repository of categorized lung sound samples is generated. The classification method based on CNN is employed to discern the lung sounds and categorize them into their respective classes.
In summary, our proposed 1D-CNN architecture leverages the strengths of convolutional neural networks in capturing local and global patterns, residual connections for improved information flow, and focal loss to address class imbalance. This combination of techniques enables the network to learn robust representations of lung sounds and accurately classify them into the four categories of normal, crackles, wheezes, and combined crackles and wheezes.
In this method, the initial step involves classifying data into the following categories: normal, cracking, wheezing, and combined crackles and wheezes. This classification was performed by pulmonologists using the existing data set. Environmental artifact noises were removed from the data set due to the presence of wheezing and crackles, which were the sounds of interest. Subsequently, the model underwent training using the processed classified data set. Using a 1D-CNN demonstrated encouraging outcomes in distinguishing between crackle and wheeze respiratory sounds and discerning between crackle and wheeze sounds. The classification method based on deep learning achieved a high accuracy of 0.95 in detecting lung sounds. The categorization of abnormal sounds into distinct subtypes, such as crackles and wheezes, yielded similar results. These findings are noteworthy, particularly considering the diverse range of auditory stimuli utilized for analysis. In the context of screening and subsequent testing for respiratory infections, it is deemed satisfactory to achieve accuracies of 0.95. This research’s main contribution is utilizing a 1D-CNN prediction model to categorize respiratory sounds. The modelled 1D-CNN comprises 1,610,466 parameters resulting from the training of the 1D-CNN model. A model summary of the 1D-CNN is depicted in
Table 3.
The 1D-CNN model effectively categorized the signals into the four classes—normal, wheezes, crackles, and combined crackles and wheezes—yielding a classification accuracy of 0.95. The identification of wheezing and crackles in lung sounds is typically performed by pulmonologists, who distinguish them from extraneous environmental noises such as chairs being dragged, fans in motion, or rustling paper, among others. Subsequently, a meticulously organized repository of categorized lung sound samples was generated. The classification method based on CNN was employed to discern and categorize the lung sounds into the respective classes.
In summary, our proposed 1D-CNN architecture leverages the strengths of convolutional neural networks in capturing local and global patterns, residual connections for improved information flow, and focal loss to address class imbalance. This combination of techniques enables the network to learn robust representations of lung sounds and accurately classify them into the four categories of normal, wheezes, crackles, and combined crackles and wheezes.