1. Introduction
Currently, heart valve diseases (HVDs), which include conditions such as mitral Aortic Stenosis (AS), Mitral Regurgitation (MR), Mitral Stenosis (MS), and Mitral Valve Prolapse (MVP) [
1], are continually increasing in prevalence and mortality, and has become a great threat to the human health globally [
2,
3]. Early detection of underlying heart diseases is crucial for improving patient survival rates. While a variety of tests, such as blood tests [
4], coronary angiograms [
5], electrocardiograms (ECGs) [
6], and various imaging techniques [
7], have been effectively used in detecting heart diseases. These methods, however, often require specialized equipment and knowledge, as well as high costs. In contrast, auscultation, which is a low-cost, rapid, and straightforward diagnostic method for assessing a patient’s heart conditions, has been widely employed by physicians. However, auscultation requires a significant amount of professional knowledge and experience, and training to diagnose heart conditions based on the analysis of heart sounds (HS) takes considerable time [
7].
To address this issue, many researchers proposed the use of artificial intelligence (AI) algorithms to automatically analyze and diagnose HVDs from the HS signals [
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21]. Previous researches were mainly focusing on the dichotomous classification of HSs as normal and abnormal. However, for physicians to make a more detailed and in-depth diagnosis of a patient’s heart conditions, it is necessary to pursue a quantitative assessment of HS signals. Therefore, there is a need to further improve the accuracy and efficiency of the multi-classification systems. In addition, most of the HS multi-classification systems proposed in the existing work use traditional neural network models, such as convolutional neural networks (CNN), long-short-term memories (LSTM), recurrent neural networks (RNN), and their variants, all of them have some limitations such as gradient disappearance and explosion, and long training or prediction times [
22,
23,
24].
To achieve this goal, we proposed a self-attention mechanism-based method for multi-classification of HS signals, along with a data-augmentation technique to improve the model robustness. The proposed model is fine-tuned on HS signals via a pretrained Transformer encoder. Our method offers several advantages over traditional neural network algorithms. Firstly, the self-attention mechanism allows the network to automatically learn important features from HS signals and reduces the dependence on large training datasets. Secondly, the self-attention mechanism helps the network better handle the incomplete or noisy HS signals. Finally, our data augmentation technique takes into account various situations that may arise in practical applications and improves model generalization. We applied our method to a five-category dataset [
20] including AS, MR, MS, MVP, and normal HS signals (N), and procured excellent results, as described in
Section 4.
The paper is organized as follows:
Section 2 summarizes the existing work on HS classification.
Section 3 describes the dataset, data preprocessing, data augmentation, model architecture, and experiments.
Section 4 presents the experimental results and comparison with other state-of-the-art methods.
Section 5 discusses the results, and
Section 6 is the conclusion.
2. Literature Review
Over the past several years, a variety of methods have been proposed to detect and diagnose HVDs, in terms of the HS signals and artificial intelligence (AI) methods [
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21]. One common method for detecting or diagnosing HVDs using HS signals involves recording the sounds produced by the heart using a specialized microphone or stethoscope, and analyzing the recording for abnormalities or changes in the sounds that may indicate the presence of a HVD. In 2012, Jia et al. [
8] used fuzzy neural networks, which were initially extracted using features such as discrete wavelet transform and Shannon Entropy, to classify the normal and abnormal HS signals collected by themselves. Subsequently, more researchers began to focus on the research and analyze of open-source HS datasets such as the PASCAL dataset [
9] as well as the 2016 PhysioNet/CinC Challenge database (PNC) [
25], both of which are well-known and popular datasets in the field of HS classification. In 2017, Zhang et al. [
10] proposed a binary HS classification (normal or abnormal) using spectrograms and support vector machine (SVM) algorithm, working on PASCAL dataset. Afterwards, these researchers [
11] extended the work to PhysioNet dataset, and improved feature extraction using the tensor decomposition method. They obtained the accuracy, sensitivity, and specificity rates of 93.85%, 97% and 92%, respectively. In 2018, Hamidi et al. [
12] proposed a curve fitting feature extraction combined with k-nearest neighbors (kNN) algorithm for HS classification and reached the highest accuracy rate of 98%. Recently, Krishnan et al. [
16] obtained a balanced accuracy of 85.7% with a sensitivity of 86.73% by use of the deep neural networks (DNN). Based on the same dataset, Deng et al. [
15] developed a framework for HS signal classification using the Mel-frequency Cepstral Coefficients (MFCC) combined with convolutional recurrent neural networks and produced an accuracy of 98%. Deng’s model took only 0.9–1.2 s to predict a single HS signal. With the maturity of the transformer structure, Dahr et al. [
21] suggested that using the AlexNet to classify the HS signals, all the accuracy, precision, sensitivity, and specificity were achieved with the same accuracy, i.e., 98%. Zeinali et al. [
13] made further far-reaching contributions to PASCAL dataset as well as the PhysioNet dataset. They used special filters to eliminate the noise and improve the sound quality and accuracy of feature mining. After that, they achieved an accuracy rate of 87.5% and 95% for multi-class data (3 classes) and 98% for binary classification (normal or abnormal) using the machine learning (ML) algorithms.
Admittedly, previous researchers have contributed greatly to the automatic HS classification based on the AI algorithms, but most of their work has focused on the simple dichotomous work of classifying the HSs into normal or abnormal categories. It is necessary to carry out a quantitative assessment of HS signals for physician to make a more detailed and in-depth diagnosis on the patient’s heart condition. Yaseen et al. [
20] proposed a five categories dataset containing four categories for abnormal HSs (AS, MR, MS, MVP) and one category for normal HSs. Furthermore, they proposed a multi-classification method based on the DNN and the ML techniques, and obtained 97% in accuracy, 94.5% in sensitivity, and 98.2% in specificity. Oh et al. [
17] put forward a Deep WaveNet-based framework for categorizing HSs, and obtained 97% in accuracy, 92.5% in sensitivity, and 98.1% in specificity. Based on the same dataset, Turker et al. [
19] further improved the average accuracy and produced an excellent result of 98.38% in accuracy. However, the average computation time for predicting one HS signal was 9.69 seconds, which was a long computing process. Khan [
14] implemented further segmentation work on Yaseen’s dataset. They subdivided the HS classification problem into five classes’, four classes’, three classes’, and two classes’ problems and achieved accuracies of 98.53%, 98.84%, 99.07% and 99.70% respectively for these four problems using Fourier-Bessel Series Expansion-based Empirical Wavelet Transform (FBSE-EWT) and Salp Swarm Optimization Algorithm.
In summary, the field of HS classification using AI algorithms has made significant progresses in recent years. Various methods have been proposed to detect and diagnose the HVDs, by utilizing HS signals and AI methods. Researchers have worked on open-source HS datasets and proposed different classification methods based on features extraction, deep neural networks, and machine learning algorithms. While most of the previous work focused on the simple dichotomous classification, recent studies have proposed more detailed and in-depth classification methods with multi-class datasets, achieving high accuracy, sensitivity, and specificity rates. However, there is still a need for further research and development to improve the accuracy and efficiency of HS multi-classification systems. The use of advanced techniques such as transformers and self-attention mechanism-based models may be the promising directions for obtaining even more accurate and efficient HS classification systems in the future.
3. Materials and Methods
3.1. Dataset Introduction
In this dataset, Yaseen et al. [
20] proposed four types of abnormal HS signals, namely AS, MR, MS, MVP, and the normal signals, with 200 recordings for each type. The actual sampling rate is 8 kHz. In addition, all signal amplitudes are normalized between −1 and 1. The length of HS signals is from 1.16 to 3.99 s. Representative examples of each type of HS signals are shown in
Figure 1a–e.
Figure 1a shows the waveform of AS,
Figure 1b the waveform of MR,
Figure 1c the waveform of MS,
Figure 1d the waveform of MVP,
Figure 1e the waveform of standard HS signal, and
Figure 1f how the HS signal is collected.
3.2. Data Preprocessing
Data Preprocessing is an essential pre-work in deep learning. In this work, to ensure the consistency of the HS signal and to avoid the effects of noise interference on subsequent work, we first denoised the audio signal to reduce the noise interference using adaptive filters. Next, we resampled the HS signals to 16 kHz to ensure that the HS transformer model could learn as more details as possible from the input data while minimizing the amount of computation. To ensure that HS signals of any length can be input into the model, the signals should be trimmed or supplemented. The excess part of the original HS signals must be trimmed if their lengths exceed the input threshold (400 ms). When the original signals were shorter than the threshold, a blank signal should be added at the end to fill the length.
3.3. Data Augmentation
To improve the model’s generalization and to prevent over-fitting, we pursued data augmentation processing on the existing dataset. Due to the nature of the original audio data, we mainly augmented the dataset in three domains—the frequency domain, time domain, and time-frequency domain. In our work, eight data augmentation methods, i.e., noise adding, loudness adjustment, amplitude clipping, volume amplification, waveform displacement, pitch adjustment, partial erasure, and speed adjustment. During the data augmentation process, each HS underwent processing using the aforementioned eight methods, with each method applied for ten times. Especially, the noise addition method employed two types of noise-white noise and ambient noise. Furthermore, each of the eight methods randomly selected a value from a specified range during the processing. After the data augmentation, the number of HS signals increased from 1000 to 90,000. The follows are the reasons for using these eight data augmentation methods.
Noise adding. Adding artificial noise to the HS data can simulate variations that might occur due to the external factors, namely the white noise and environmental conditions. The signal-to-noise ratio is between 5–15 dB. This can help make the model more robust and able to deal with the real-world variability.
Volume reduction. Refers to reducing the overall volume or loudness of HSs, which can be achieved by reducing the amplitudes of the waveforms. The volume reduction range should be 0.5–1.0 times the original volume. This helps the model learn to recognize HSs at low volumes. The reason for the low volume of HSs may be affected by factors such as the distance between the microphone and the sound source or the level of ambient noise.
Volume amplification. This involves increasing the overall volume of the HSs, either by increasing the amplitudes of the waveforms, or by applying a gain to the audio signal. The volume reduction range should be 1.0–1.5 times the original volume. This can help our model to learn to recognize HSs that are louder or more dominant in the audio mix.
Waveform displacement. Waveform displacement involves shifting the waveform of the HSs either forward or backward in time. Waveform displacement ranges from −0.75–1.25. This can help simulate the effects of delays or latency in the recording equipment, or create examples of HSs that are out of synchronization with other audio sources.
Pitch adjustment. Adjusting the pitch of the HS data can simulate the variations in the heart rate or other physiological processes that might affect the frequency of the HS. Pitch adjustment ranges −0.8–1.2. This can help the model to better handle the changes in heart rates or other physiological conditions.
Partial erasure. This involves removing or erasing part of the original HS data, either by deleting a section of the waveform or applying a mask to the audio signal. Partial erasure ranges 0–50% of the HS. This can help simulate the effects of missing or incomplete data, which might be caused by issues with the recording equipment or by other factors.
Speed adjustment. This involves changing the speed at which the HSs were recorded, either by speeding up or slowing down the waveform or by applying a time stretch to the audio signals. The range of speed adjustment is 0.5–1.5 of the original speed. This can help the machine learning model to learn to recognize HSs that are recorded at different speeds, which might be affected by factors such as the age or health or exercise status of the person producing the sounds.
Overall, these techniques can be used to create a large and diverse dataset of artificially generated HS samples that includes a wide range of variations in the data. This can be useful for training our model to recognize and classify different types of HSs, as the models will be exposed to a wider range of examples and will be better able to handle real variations in the data in the future.
The representative HS signals after data augmentation are shown in
Figure 2a–i.
Figure 2a shows the waveform of the original standard HS signal;
Figure 2b the waveform of the standard HS signal after adding noise;
Figure 2c the waveform of the standard HS signal after amplitude clipping;
Figure 2d the waveform of standard HS signal after volumn aplification;
Figure 2e the waveform of standard HS signal after partial erasing;
Figure 2f the waveform of standard HS signal after speed adjustment;
Figure 2g the waveform of standard HS signal after the pitch adjustment;
Figure 2h the waveform of standard HS signal after volume reduction;
Figure 2i the waveform of standard HS signal after waveform displacement. These eight methods are used for data enhancement for all HS signals. As can be seen from
Figure 2b–i, the red part represents the original HS signals. In contrast, the blue part represents the HS signals after the data augmentation, which introduces differences without changing their essences.
3.4. Model Architecture
In this work, we abandoned the decoder of the encoder-decoder architecture in the ViT architecture [
26] and used only the original encoder for multi-classifying HS signals. Also, we added a SoftMax layer at the end of the classification model. To avoid overfitting, we introduced a dropout mechanism (the dropout rate was set as 0.5) before the output of the transformer unit. In the end, we randomly froze 25% of the parameters in each training period to speed up the convergence of the network.
Figure 3 shows the workflow of the model we proposed.
In
Figure 3, first, to get the input of our model, the HS signals were converted with a 128-dimensional log Mel filter bank each 10 ms, with a Hamming Window of 25 ms to prevent the spectral leakage. In detail, the method of Mel frequency analysis is closer to the auditory characteristics of human ears. By converting the HS signals into spectrograms, we were able to represent the audio data in a 2D matrix form, which allowed us to leverage the power of Transformer-based models for audio processing. Several studies utilized spectrograms for HS classification, including literatures [
15,
27]. Our work extended on these studies by utilizing a transformer-based architecture that is capable of modeling long-range dependencies and capturing complex patterns in the HS signals. By considering the advantages and potential effects of using the spectrograms, we provided a more comprehensive discussion of the core ideas and contributions of our study.
The spectrogram sequence was divided into 16 × 16 sizes in the time domain and frequency domain, with an overlap of 6. However, the transformer architecture could not capture the spatial and temporal relationships as the traditional neural networks do. The trainable positional embedding of the same size as the input sequence was added to each patch embedding to ensure that the model could handle the spatial information of the audio spectrogram. At last, we added a class token at the beginning of the input sequence.
Furthermore, the preprocessed HS signal was further processed into the input sequence of the proposed model. The transformer encoder contained an embedding dimension of 768, 12 layers, and 12 heads [
26,
28]. As a final step, the transformer encoder’s output was subjected to a linear layer with a Sigmoid activation function to facilitate the classification.
Figure 4 explains how our workflow works in detail. Firstly, the model was pre-trained based on the AudioSet dataset [
29], which collected over 2 million audio clips (10 s each) from YouTube videos and tagged the clips with 527 labels. Then we transferred it to the five-classification tasks for HSs, where parameters were randomly frozen and discarded. After that, we validate the model and continuously adjusted the parameters until the model achieved a satisfactory classification level. In the final step, we selected the best trained-model to test on testing dataset (independent from training dataset and validation dataset) to evaluate the performance. Specifically, to begin with, the pretrained model was loaded after being trained on a very large amount of audio data spectrograms. Then, the network was modified to adapt to the specific needs of the 5-classification task for HSs. During training, HSs were loaded according to the prescribed cross-validation strategy and given parameters, such as batch size. The measures to prevent overfitting, such as dropout and parameter freezing, were implemented during the training process. Following each round of training, the divided validation set was utilized to verify the classification performance of the model, and the training was adjusted as needed to fine-tune the parameters. Once the training and verification process was completed, the optimal model was selected and used for the final testing, which characterized the actual results of the model and potential practical applications.
3.5. Experiment
First, we selected a random portion of the orginal dataset as the testing set, i.e., 10% of the data from each category. After the selection, the number of testing set was 20 for each category and the number of training set was 180 for each category. Then, each HS signal was processed using the data augmentation methods mentioned above, and the final number of dataset was 1800 for each category in the testing set and 16,200 for each category in the training set.
We ran a series of simulations with different sets of hyperparameters. For each simulation, 80% of the 1000 original HS signals were randomly selected for training and 20% for validation before data augmentation. The data used for training and validation were independent of each other. The hyperparameter set included the initial learning rate (
), batch size, and training epochs.
Figure 5a–c show the relationship among the three hyperparameters—precision, loss, and epoch. Formulas (
1) and (
2) show the loss function. In addition, we used the Adam optimizer [
30] in the proposed model. The final initial learning rate, batch size, and training epoch were 0.000125, 12, and 25, respectively.
where, ‘
N’ denotes the number of batches; ‘
n’ the number of labels.
After finding out the most appropriate set of hyperparameters, the original HS and augmented mixed HS datasets were used for the final model training. Besides, the 5-fold cross-validation method as well as the 10-fold cross-validation method were used in this work.
5. Discussion
From the results obtained above, the merits of our model can be discussed as follows.
Firstly, the proposed framework improves the accuracy of automatic multi-classification of HS signals and the computing speed of prediction. On one hand, one of the main advantages of using self-attention Transformer for HS classification is the improved performance. Self-attention mechanisms allow the model to capture long-range dependencies in the input data, which can be particularly useful for analyzing the sequential data like HSs. On the other hand, another advantage of using self-attention Transformer for HS classification is the efficiency when it processes long sequences of data. Traditional CNN models can struggle with long sequences, as the number of operations required to process them increases linearly with the length of the input data. In contrast, self-attention Transformer can process long sequences more efficiently, as the number of operations required is constant regardless of the length of the input. These can lead to a better classification accuracy compared to other methods. Our method can not only distinguish between the normal and the abnormal HSs but also refine the classification of abnormal HSs into AS, MR, MS, and MVP. When the medical resources, experience, and hardware are insufficient, doctors can use the specific abnormal results to make a quick diagnosis. In the meantime, the typical results can be used to determine whether the patient has recovered to reduce the waste of medical resources.
Secondly, the proposed approach achieves outstanding and promising performance compared to the current work. A diverse approach to audio data enhancement is used to improve the generality and robustness of the model. Furthermore, the spectrograms of the HS signals contain a large amount of detailed information that can be extracted as the specific features of different types of HSs. In this way, self-attention mechanisms allow the model to attend to different parts of the input data at different times, which can help learn more robust and transferable features. This can lead to better performance on unseen data, making the model more useful. Based on these two points, our model has shown promising results.
It can be seen that the proposed method outperforms existing works in terms of both the classification accuracy and the computing speed. The proposed model is proved to be effective in HS multi-classification. Despite of the achievements of our model, there are still some limitations worth discussing.
One potential limitation of our model is its applicability to different measuring systems. Our model was trained on Yaseen’s dataset [
31], which is randomly collected from books or websites without the data containing extreme noise. It is possible that the results may be different if the model is applied to HSs recorded using a different type of measuring system. For example, if our model was trained on HSs recorded using an electronic stethoscope, it may not perform as well as on HSs recorded using a chest-mounted sensor. To address this limitation, it will be important for future research to evaluate the performance of our model on different types of measuring systems. This could involve collecting additional data using a range of measuring systems and retraining the model on this expanded dataset. In addition, it will be important to consider any practical considerations related to the use of different systems. For example, chest-mounted sensors may be more costly or less portable than stethoscopes, which could impact the feasibility of using these systems in different settings. Besides, in terms of the potential impact of collecting HSs from different parts of the body, our model is currently based on the best results achieved using open-source datasets. These datasets themselves are sourced from medical official websites or textbooks. However, further optimization is required to account for the variability introduced by HSs collected in different medical determination of locations. For example, the mitral valve is usually ausculated at the left intercostal space, medially from the midclavicular line. In contrast, the agortic valve is usually auscultated in the 2nd parasternal line on the right side of the chest. This is an ongoing area of research for us.
Another potential limitation of our model is its intended recipient. Our model was designed for use by medical professionals, and it is possible that the results may be different if the model is used by laypeople, although our model has been able to specifically and accurately output the type of the underlying cardiac problem (AS, MR, MS, MVP, Normal). For example, medical professionals may have more training and experience in interpreting HSs, which could impact the performance of the model when used by other people. To address this limitation, it will be important for future research to evaluate the performance of our model when used by different groups of recipients. This could involve collecting additional data from a range of recipients and retraining the model on this expanded dataset. It may also be necessary to fine-tune the model or develop new models specifically for use by different groups of recipients. In addition, it will be important to consider any practical considerations related to the use of the model by different groups. For example, medical professionals may require additional training or education to use the model effectively, while laypeople may need more guidance or support to understand the results of the model. Furthermore, it should be noted that a heart disease may present with the presence of multiple HSs concurrently, and our current model demonstrates superior classification performance for individual HSs. However, the primary objective of our study is to facilitate disease diagnosis through HS classification. Even if we cannot accurately distinguish the accompanying HSs, our model can still provide valuable assistance. Hence, we recognize the importance of identifying potential concomitant HSs, and this will be a focus of our ongoing research efforts.
Last but not least, it must be considered the feasibility of implementing our model in different settings when deploying it. Our model relies on self-attention transformer technology, which can be computationally intensive and may require specialized hardware and software to run effectively. This could impact the feasibility of deploying the model in certain settings, such as resource-limited environments or settings with limited access to specialized hardware. To address this limitation, it will be important for future research to consider the feasibility of deploying our model in different settings. This could involve optimizing the model for use on different hardware platforms or developing more efficient implementations of the model. It may also be necessary to consider alternative approaches, such as using cloud-based services or edge computing, to facilitate the deployment of the model in different settings like wearable devices or Internet of Things (IoT) devices. In addition, it will be important to consider any practical considerations related to the use of the model. This could include the cost and availability of the necessary hardware and software, the training or education required for different groups of users, and any potential ethical considerations.
Overall, in future researches, it will be important to address the applicability of our model to different measuring systems and the intended recipient of our model. We will also consider the feasibility of deploying the model in various situations. By addressing these limitations, we can work to improve the performance and usability of our model for a range of users and measuring systems.
6. Conclusions
For HVDs with high morbidity and mortality, HS signals obtained through auscultation can feedback important information about the patient’s heart condition. Our study used these signals to train the proposed model to classify various heart valve diseases, including AS, MR, MS, MVP, and normal types. We validate our model using 5-fold cross-validation as well as 10-fold cross-validation and train it using the datasets after augmentation. In the 5-fold cross-validation, our model achieved the highest ACC of 98.74% and a mean AUC of 0.99. The highest PRE achieved in the proposed model is 99.16% with the label of MR, while the highest ACC is 99.67% with the label of N. In 10-fold cross-validation, our model achieved highest ACC, SENS, SPEC, PRE, and F1 score all at 100%. This study is the first to classify HS signals using a transformer model of HS spectrogram and five classes, achieving extremely high diagnostic reliability. It is expected that, the proposed model, can assist medical experts in the diagnosis of HVDs.