1. Introduction
Sleep has been proven to play an important role in maintaining balance, preventing disease, and healing [
1]. One of the most common sleep disorders is sleep apnea (sleep apnea syndrome; SAS), which is characterized by episodes of inadequate respiratory activity during sleep [
2]. According to the American Academy of Sleep Medicine (AASM), sleep apnea is defined as the cessation of airflow that lasts for at least 10 s, accompanied by a decrease in blood oxygen saturation by at least 3%. Hypopnea is defined as a 30% decrease in general ventilation, with the same minimum decrease in blood oxygen saturation (3%) for at least 10 s. The most common form of sleep apnea is obstructive sleep apnea syndrome (OSAS) [
1]. Research has suggested that the brain damage associated with OSAS could increase the risk of developing dementia [
3,
4]. Sleep apnea is described by the apnea hypopnea index (AHI), which is the average number of apnea and hypopnea episodes per hour of sleep [
5]. Classification according to the AHI uses four groups: Physiological standard (AHI < 5), mild sleep apnea (AHI < 15), moderate sleep apnea (AHI 15–30), and severe sleep apnea (AHI > 30) [
6]. About 2% of middle-aged women and 4% of middle-aged men are affected by the apnea disease, and the initial examination of patients is carried out using a polysomnography system [
7]. This disease has a large impact on middle-aged people, who are considered the most commercially productive part of the population.
Polysomnography (PSG) is used for monitoring sleep, particularly for patients with suspected OSAS [
7]. PSG can record respiratory signals (RS), electroencephalographic (EEG) data, electro-oculographic (EOG) data, electromyographic (EMG) data, electrocardiographic (ECG) signals, and pulse oximetry (PO) data [
8]. In order to assist in the time-consuming scoring [
9] of PSG records, efforts have been made to automate the detection of SAS [
5,
8,
10,
11]. At present, neural networks are among the most innovative and widely used classifiers in the field of biomedicine [
12]. There exist methods that can accurately distinguish sleep with apnea from physiological sleep [
5,
13], but the automatic detection of sleep apnea events is an open topic [
11]. A recent study by Zhao et al. [
11] utilized a support vector machine (SVM) and a
k-nearest neighbors (
k -NN) model for the apnea classification problem. They also implemented a random forest (RF) model for data classification. Although studies using
k-NN and SVM classifiers predominate, the trend of using artificial neural networks (ANN) has been carried over to the case of sleep apnea event detection [
14].
There is also a laboratory [
15] which is currently attempting to automatically detect apnea and hypopnea through the use of neural networks (NN) to analyze simple signals (e.g., the respiratory signal), such that they can be used in clinical practice. The trend of automating apnea detection while simplifying measurements—such that the entire PSG is not required—has been applied at various levels [
16], including the use of motion sensors outside the human body [
17]. Overall, apnea detection research has been shifting towards the use of machine learning, which is more flexible for working with quasi-stationary biological data [
18,
19]. The disadvantage of some machine learning approaches is the need to extract features. In deep learning methods (deep neural networks; DNNs), flags are extracted from a large database automatically, without the subjective choice of calculating a specific feature (e.g., amplitude or signal frequency). There are many types of DNNs; the two most commonly used types of which are the recurrent long short-term memory (LSTM) [
20,
21], which has an artificial recurrent neural network (RNN) [
22] architecture, and convolutional neural networks (CNNs) [
23]. These can be applied to various signals, such as ECG [
24,
25], airflow in the nasal cavity [
26,
27], EEG [
14], or chest movements [
28].
As not all published methodologies have publicly available software, and as no appropriate implementation had an interface that would be usable for physicians in our sleep laboratory, we designed, tested, validated, and implemented a convolution neural network with a graphical interface. The main goal of our research was to create automatic software for detecting periods of possible sleep apnea or hypoponea for future use in our sleep laboratory (it is also free for use in other laboratories). The proposed methodology imitates decisions made by a physician, according to American Academy of Sleep Medicine (ASSM) scoring [
6]. This software can alert physicians to areas of interest and speed up the process of scoring PSG recordings.
4. Discussion
Deep learning techniques have been increasingly used to diagnose sleep apnea [
38], including convolutional neural networks, such as was used in our study. As in a large part of neurodiagnostic methods, the problem is the small number of measured subjects (i.e., small data set size). The main benefit of this work can be considered to be the training of a deep convolutional network on a large database. PSG records were obtained with a sampling frequency of 1000 Hz. Such a high frequency is due to the acquisition of EEG recordings, in which it is necessary to evaluate brain frequencies up to 30 Hz as standard. Only airflow and SpO
signals were processed to detect apnea and desaturation in this study. Therefore, it was possible to down-sample these signals. The shapes of the airflow and SpO
signals are similar to a sinusoid with a frequency of about 0.2 Hz. It was therefore possible to sample at a frequency of 50 Hz. The signal was down-sampled five times, which significantly reduced the computational complexity and memory usage without loss of information. Even so, the design and training of the network were very time-consuming; for example, to determine the ideal number of neurons in fully connected layers and the number of filters for convolutional layers, 79 random combinations were calculated. In total, the calculation took 92 h. Additionally, hardware limitations prevented testing a batch size larger than 2000, as the used computing unit had limited parallel computation ability.
We designed a neural network and then performed its learning, validation, and testing stages. Thanks to the use of a large database, we adapted the network and its parameters to detect apnea. Pre-trained networks have also been used in the literature [
39], but we still assume that the greatest effectiveness can be obtained by training for a specific pattern (in this case, apnea). The baseline approximate neural network depth was determined from a previous study [
31]. The parameters of the NN were further estimated to achieve the highest validation accuracy.
The hyperparameters of CNN were estimated separately, not taking into account the fact that some parameters depend on each other; however, if the parameters were tested as dependent variables, it would not be possible to evaluate their effect separately. For the desaturation detection, the same NN design with the same hyperparameters as in the case of apnea detection was used, based on the knowledge that the analyzed signals were generated from same biosignals. For future research, it may be interesting to test and optimize the hyperparameters for each of the events separately.
The resulting dimensions of the convolution kernel were 1, 15, 45. Gradual enlargement of the convolution kernel is an expected phenomenon, which guarantees that features of different lengths gradually flow between layers. Our test results indicated that a size of 1 for the first layer kernel gave the best results; see
Table 11. It is possible that this layer could be replaced by a simpler layer in the future, which may speed up the calculation and the detection itself. The batch size parameter was set to 2000. A larger batch size decreases the accuracy of a NN on the training data set [
40]. In our case, the use of a larger batch ensured greater regularization and, thus, greater accuracy on the validation set. A higher batch size could not be tested, due to hardware limitations.
The training of the NN was performed four times on randomly generated training, validation, and testing data sets. The graph in
Figure 8 shows the course of the training of the proposed NN for apnea detection. According to the results, it can be concluded that the NN had similar character for the different data sets (different curve colors). The graph in
Figure 10 shows the course of the training of the NN for desaturation detection. The accuracy and error had similar characters when comparing the training and testing phases, so it should be easy to train the NN for desaturation detection.
During the training of the proposed CNN for apnea detection, an increase in the validation error was observable, which increased with the steps of the training while, at the same time, the validation accuracy decreased. This indicates the over-fitting of the neural network. In order not to implement an over-fitted network, the method of early interruption of learning was used. The final epoch was not included but, instead, the one that had the smallest validation error during training was used. For the network trained for desaturation classification, the setting with the smallest validation error was selected. The final accuracy of the designed neural network was 84% for apnea detection and 74% for desaturation detection. For the
k-NN reference method, the accuracy was 83% in the case of apnea detection, while that for desaturation detection was 64%; that is, both were lower than those of the proposed CNN. Studies comparing the results of classification using
k-NN and ANNs vary in the difference between the accuracy of these classifiers (accuracy of ANN minus
k-NN). For example, some studies [
14,
41,
42] have described a positive difference (i.e., higher accuracy for the ANN) of 9.7% on average. In contrast, other studies [
43,
44] have reported a negative difference (i.e., higher accuracy for
k-NN) of 2% on average. Furthermore, the study by Mendel et al. [
45] reported zero difference (identical accuracy for both methods). When detecting desaturation, the difference in our accuracy study was similar to that of [
14,
42]. Although the
k-NN classifier is much simpler, compared to the ANN, the accuracy on the testing data set of
k-NN (83%) was similar to that obtained by the ANN (84%) in our case; this was also comparable with the previous papers. This result ensured us that the
k-NN is a meaningful reference method for ANN-based classifiers. Obviously, detection using a deep neural network has a great advantage over the use of a
k-NN classifier, in terms of the significantly lower time requirements of the detector. Apnea detection using CNN processed up to 8 h of recording in 36 s, while using
k-NN required 45 min (as measured on an AMD Ryzen 9 3900X processor). Calculating
k-NN with high-volume data is very demanding. With a lower class processor (i.e., standard sleep department equipment), the calculation would take even longer. This is an essential parameter for practical implementation in a clinical setting. The disadvantage of using a neural network, in terms of hardware, is that it requires a processor that supports the Advanced Vector Extensions (AVX) instruction set. AVX support has been introduced for most processors since 2013; so, for implementation in practice, it is necessary to install the proposed detector only on devices containing such processors, which do not have to be in all sleep labs.
For the purposes of training, as well as the detection itself, the original continuous signal was segmented. For subsequent display by the software used in the sleep laboratory, it is necessary to convert the segments back to a continuous signal after classification. Each segment containing 80% of the event in a sequence of several such segments in a row (positive) indicates an event (apnea/hypopnea). In the study by Choi et al. [
31], valid apnea events were considered as five consequent events in row. We also differ from other studies, in that we detected apnea with an accuracy of 1 s (10 s window with a bias of 90%). For example, in the study by Varady et al. [
19], a 16 s window was used without an overlay, such that they detected apnea with an accuracy of 16 s. Furthermore, in the study by McClure et al. [
46], a 15 s window was used, also without an overlay.
The event detection process takes place on data with the same pre-processing as was used to learn the CNN. Apnea and desaturation detection are performed separately, and the detected events are saved to a newly copied file. The output of a neural network is the probability that a given segment contains an event. This allows the threshold to be selected, according to whether greater specificity or sensitivity is desired.
Various biological signals can be used to identify apnea. Neural networks have been applied to EEG [
14], EKG [
47], respiratory [
48], and SpO
[
49] signals. Apnea detection is often performed using random forest, support vector machine, and
k-NN methods, which have accuracies of about 80–90% [
7]. Some new studies have reported high classification accuracy (sensitivity and specificity
%), but the data sample was very small (17 subjects) [
50]. Our proposed NN had lower classification accuracy, but we trained and tested the NN on a data set, which was 15 times larger. Some previous studies have used an existing NN topology and used transfer learning to recompute the ideal parameters; for example, Singh et al. [
51] have used the AlexNet NN with different types of classifiers.
Table 17 summarizes the results of apnea detection in the case of different classifiers and used signals in the previous studies.
Note that the SpO
and nasal airflow signals were used in our study to detect apnea, in order to refer to the standard scoring procedure of the AASM scoring manual. Studies which specifically used the SpO
or nasal airflow as reference signal resulted in accuracies ranging from 79.6% to 97.64%; see
Table 17.
The highest accuracy was obtained in the study by Mostafa et al. [
49]; namely, 97.64%. The deep belief network (DBN) classifier was used in this case, and the SpO
reference signal was used. The first two layers of the DBN classifier were constructed using restricted Boltzmann machines (RBM), and the final layer was a softmax layer. The methodology was tested on two publicly available data sets consisting of 32 and 25 records, respectively. The structure of the DBN was fixed, and the hidden layer neurons were only optimized for the UCD database. The same DBN classifier was used on different data set, and the accuracy decreased to 85.26%. Lower accuracy values can be reached when different data sets are used for training and testing. In the first case, of training and testing on the same data set, the accuracy values were higher than in proposed methodology. On the other hand, the sensitivity reached lower values in the study of Mostafa et al. [
49]. In case of testing on a different database, the accuracy was comparable, and the sensitivity was lower than those reported in our study. Furthermore, AUC values were not presented in the study by Mostafa et al. [
49].
In the study by Biswal et al. [
21], SaO
, airflow, and signals from the chest and abdomen belt were used as reference. Two publicly available data sets were used in this case. They utilized an RCNN for classification. Different accuracies were reached in the case of training and testing on the same data set; namely, 85.7% and 81.7%, respectively. When testing on different data sets, it decreased to 78.7% in the first case while, in the second case, the accuracy increased to 83.3%. This study used more reference signals than in the proposed methodology, with comparable results.
In the study of Pathinarupothi et al. [
20], SpO
and instantaneous heart rate (IHR) were used for apnea classification. The LSTM method was used in this case. The accuracy when using SpO
reached 95.5% and, in the case of using a combination of SpO
and IHR, the accuracy decreased to 92.1%. This study proposed minute-to-minute apnea classification. It involved the analysis of 35 subjects but, in the case of the SpO
reference signal, it involved only 8 subjects. In comparison to the proposed methodology, the study of Pathinaupothi et al. [
20] was performed on a data set which was approximately 30 times smaller, and with 60 times smaller time resolution.
In a study by Cen et al. [
28], the 2D CNN method was used for apnea detection, based on the SpO
, oronasal airflow, and movements of ribcage and abdomen reference signals for classification. The study was performed using 25 patients. The method was based on feature extraction, and the concrete features were validated by the CNN. An average accuracy value of 79.61% was reached across all classes; however, the average classification accuracies in normal, hypopnea, and apnea classes were 82.20%, 53.61%, and 66.24%, respectively. In comparison to the proposed methodology, the study by Cen et al. [
28] was performed on a smaller data set, with lower accuracies and smaller time resolution; however, the study by Cen et al. [
28] was designed on a data set based on more reference signals than our proposed methodology.
The study by Mostafa et al. [
52] used an SpO
reference signal for classification by a 1D CNN. Three publicly available databases were used, consisting of the records of 8, 25, and 70 subjects, respectively. The largest database was used for training purposes, and then the model was tested on the others. This study also performed a transfer learning technique to retrain the CNN on the other databases. All of the tests were performed considering 1 min-, 3 min-, and 5 min-long segments. The accuracy was in the range of 84.53–92.65% in the first case, and the sensitivity was in the range 56.72–91.64%. Lower values were reached in the case of testing on different databases than the training set. In the second case (i.e., transfer learning), the accuracy values were in the range of 84.85–94.24%, and the sensitivity was in the range 93.32–96.78%. Therefore, transfer learning improved both the accuracy and sensitivity. In comparison to the proposed methodology, the study of Mostafa et al. [
52] was performed on a smaller data set, with comparable accuracies and smaller time resolution. The sensitivity was lower in the study by Mostafa et al. [
52]. The study by Mostafa et al. [
52] used only one reference signal; namely, SpO
.
Overall, the use of ECG signals generally corresponds with high apnea classification accuracy. On the other hand, in the study of Pathinarupothi et al. [
20], the success of the classification decreased with the use of the ECG signal. In our manuscript, the proposed solution was to mimic the work of a physician who scores apnea according to ASSM standards.
A large number of approaches have been tested to detect apnea; however, a major limitation of most studies is the size of the used data set. For example, Varady et al. [
19] conducted research on a set of 18 PSG recordings, but with an average recording time of only 4 h. The study of Almazaydeh et al. [
53] had a sample of 50 subjects, but these were healthy individuals who consciously simulated respiratory pauses. In the study of Janbakhshi et al. [
54], 35 records were used. Probably the most extensive study we were able to find is that of Steenkiste et al. [
48], who used a data set [
55] (available online at
https://physionet.org/content/shhpsgdb/1.0.0/, accessed on 2019) from which they selected 2100 individuals, of which 100 individuals were used for training, and 2000 for testing. However, this is a data set [
55] from 1997, where PSG is presented as a “home” variant and apnea was marked from saturation and EEG; the primary orientation of the data set is on cardiovascular disease.
A limitation of the study was the lower accuracy than the highest transmission mentioned in the current literature. The lower portability was due to the effort to imitate the ASSM manual while, at the same time, streamlining the network in the context of minimizing network dispositions. The established network was tuned within a limited number of iterations, so it is possible that a higher accuracy could be achieved. At the same time, unlike other studies, we did not use segment-level information (from each patient) to train the network, instead focusing at the record level from each patient. Subsequently, our network obtained information from new patients (the network has no previous information), and we presented the accuracy of the classification on this approach. Another limitation lies in the usage of only the k-NN reference method to assess the performance of the proposed CNN. Further work can be carried out to interpret the weights across kernel layers and to bring insights into the features differentiating respiration events from the baseline. The current paper relied on the scoring guidelines, and more information about the respiration events could be obtained from the trained network. In the case of finding the optimal kernel length, the number of filters, and the number of neurons, it was observed that the best-performing structures only slightly differed in their accuracy. For this reason, a mean value was used. It is possible that some other metric may be more suitable for identifying the best solution. We presented how the convolutional network behaved when training on such a large database. In addition, we offer open access to our code for further use. The proposed approach mimics the work of a physician, such that it is possible to use the methodology for the preparation of training for physicians and laboratory nurses in sleep laboratories. Another advantage is the high time resolution that the proposed network offers.
Table 17.
Comparison of different types of classifiers and their accuracies.
Table 17.
Comparison of different types of classifiers and their accuracies.
Ref. | Signal | Classifier | Accuracy | Results Sensitivity | Specificity |
---|
[21] | CB, AB, SaO, AF | ut. RCNN | 85.7%/81.7% | - | - |
[21] | CB, AB, SaO, AF | ut. RCNN | 78.7%/83.3% | - | - |
[49] | SpO | DBN | 97.64% | 78.75% | 95.89% |
[49] | SpO | DBN | 85.26% | 60.36% | 91.71% |
[20] | SpO, IHR | LSTM | 92.1% | 84.7% | - |
[20] | SpO | LSTM | 95.5% | 92.9% | - |
[28] | SpO, ONAF, RIB, AB | CNN2D | 79.6% | - | - |
[52] | SpO | CNN1D | 84.53–92.65% | 56.72–91.64% | 90.19–94.60% |
[52] | SpO | CNN1D | 84.85–94.24% | 58.32–92.04% | 93.32–96.78% |
[31] | NP | CNN | 96.6% | 81.1% | 98.5% |
[27] | THO | CNN | 70.7% | - | - |
[27] | ABD | CNN | 72.3% | - | - |
[27] | NAF | CNN | 77.6% | - | - |
[27] | THO, ABD | CNN | 77.7% | - | - |
[27] | NAF, THO | CNN | 82.0% | - | - |
[27] | NAF, ABD | CNN | 82.6% | - | - |
[27] | NAD, ABD, THO | CNN | 83.5% | - | - |
[26] | NAF | CNN | 74.70 ± 1.43% | - | - |
[24] | ECG | CNN | 98.91% | 97.82% | 99.20% |
[56] | ECG | AB | 87.33% | 81.99% | 90.72% |
[57] | ECG | LS-SVM | 84.74% | 84.71% | 84.69% |
[44] | ECG | BA | 85.97% | 84.14% | 86.83% |
[58] | ECG | SLBP | 89.80% | 88.46% | 90.63% |
[59] | ECG | 1-D CNN | 87.90% | - | - |
[60] | ECG | CNNLSTM | 86.25% | - | - |
[51] | ECG | AlexNet CNN | 86.22% | 90.00% | 83.82% |
[14] | EEG | k-NN | 75% | 70% | 92% |
[14] | EEG | SVM (RBF) | 95% | 90% | 100% |
[14] | EEG | SVM (polynomial) | 99% | 100% | 98% |
[14] | EEG | ANN | 86% | 75% | 100% |
[5] | EEG | DFA + SVM | 95.1% | 93.2% | 98.6% |
[10] | EEG | CNN + LSTM | 76.5–84.5% | - | - |