1. Introduction
Obstructive Sleep Apnea (OSA) is a syndrome characterized by the partial or complete obstruction of the upper airways during sleep. This blockage leads to frequent awakenings to reopen the airway, disrupting sleep, causing excessive daytime sleepiness, and triggering a stress response in the body. The obstruction can also result in lowered blood oxygen levels during sleep [
1], increased carbon dioxide levels, and potential damage to the cardiovascular system. OSA is also linked to a variety of health issues including stroke, high blood pressure, and even death [
2,
3,
4,
5,
6]. These health problems are especially pronounced in individuals who are overweight and vary based on gender and age.
The occurrence of OSA, estimated to be between 9% and 38% of the Italian population, varies widely, with a higher likelihood in older adults, men, and those who are obese [
1,
7,
8]. Among older individuals, its prevalence may rise up to 84% [
1]. Despite an increase in research and medical attention towards OSA in recent years, it remains a condition that is frequently not diagnosed. This underdiagnosis can be attributed to the lack of biomarkers capable of identifying the disease [
9,
10,
11,
12,
13].
In 2019, CERGAS (Research Center on Health and Social Care Management at the Bocconi University) released data estimating that the annual costs associated with OSA in Italy are approximately 31 billion euros. On average, the cost for each patient with severe OSA was calculated to be around 3850 euros. Despite having an estimated 12 million people with moderate to severe OSA, only about 460,000 individuals in Italy have been formally diagnosed, and merely half of these diagnosed patients have received treatment. This situation places Italy at the bottom among major countries in terms of the number of individuals diagnosed with OSA [
14]. Considering that each patient is diagnosed many years after the onset of the disease, the direct and indirect healthcare costs impose a significant burden for the National Health System (NHS), which affects every single citizen. Prevention and early diagnosis are the only ways to achieve an improved quality of life and cost containment [
9,
15].
For the diagnosis of OSA, polysomnography (PSG) is considered the gold standard, and the severity of OSA is typically measured using the apnea–hypopnea index (AHI), with thresholds set at ≥5/h for OSA diagnosis, ≥15/h for moderate to severe OSA, and ≥30/h for severe OSA [
1]. However, this method is expensive [
9] and requires the patient to be monitored continuously by healthcare professionals [
16], leading to a scarcity of available testing and, consequently, delays in diagnosis and an increase in the burden of disease [
17,
18,
19]. Therefore, Home Sleep Apnea Testing (HSAT) is often used as an alternative. HSAT offers several advantages over traditional PSG. One of the foremost benefits of HSAT is the convenience it provides; patients can undergo testing in the familiar and comfortable setting of their own home. This not only reduces the anxiety and discomfort often associated with spending a night in an unfamiliar sleep lab environment, but also removes the logistical challenges of arranging for an overnight stay away from home. Furthermore, HSAT stands out for its cost-effectiveness. Generally costing less than laboratory-based PSG, it becomes a more accessible option for a broader range of patients, breaking down financial barriers to obtaining a diagnosis.
Recent advancements in software technologies and Machine Learning (ML) methods have significantly enhanced the development of effective predictive and diagnostic tools, becoming increasingly prevalent in various fields of medical research and applications, including for OSA [
11,
12,
20,
21,
22,
23,
24,
25,
26,
27]. The prediction models described in existing research primarily utilize clinical data, such as demographic information (age and gender), comorbid conditions, anthropometric measures (Body Mass Index (BMI), waist and neck circumferences), symptoms of OSA, and physiological parameters (blood pressure, overnight pulse oximetry, and lung function tests). The effectiveness of these models in predicting OSA, as indicated by an AHI ≥ 5/h, has shown sensitivity rates ranging from 66% to 100% and specificity rates ranging from 30.8% to 76.2%. For predicting more severe OSA (AHI ≥ 15/h), the sensitivity ranges from 60.3% to 92.7%, with the specificity ranging between 33.3% and 90.7% [
24]. The variability in these models’ ability to discriminate between cases may be due to factors such as the complexity of the models, sample size, OSA prevalence, and the proportion of cases with different severities of OSA. It is noted that most OSA prediction models prioritize higher sensitivity over specificity to facilitate early diagnosis, although this approach may result in a higher rate of false positives and potentially lead to unnecessary PSG testing [
24].
The Berlin questionnaire (BQ) [
28] stands out as one of the simplest and most widely implemented non-invasive screening tools for diagnosing OSA, demonstrating a sensitivity of 86% and a specificity of 95% for OSA diagnosis. Originally introduced in the United States (US), the BQ consists of a concise set of questions focused on the risk factors and symptoms associated with OSA, aimed at identifying patients at high risk who might benefit from undergoing PSG to facilitate increased diagnosis rates. While the standard BQ comprises 10 questions, we previously introduced a streamlined questionnaire version by using a trained classifier [
22], reducing the questionnaire to just two questions (“simplified Berlin questionnaire”, or BQ-2). This abbreviated version has been shown to achieve results comparable to the original BQ, offering an efficient means of rapidly screening high-risk OSA patients.
The main aim of this research was to enhance the sensitivity, specificity, and accuracy of the conventional BQ by incorporating ML techniques. For this purpose, we developed an ML-enhanced BQ model (ML-10) capable of predicting the risk of OSA using the BQ items as model features. Additionally, we explored a simplified version of ML-10, called ML-2, based on BQ-2 [
22], to determine whether it yields comparable results. The predictive performance of these models was evaluated against the conventional BQ approach, which does not incorporate ML techniques. Furthermore, we utilized the ML-10 and the ML-2 models to identify patients with OSA at two different AHI thresholds: ≥15/h, and ≥30/h, thereby assessing their efficacy across a spectrum of OSA severity.
In conclusion, the integration of an ML algorithm into the conventional BQ demonstrated a significant enhancement in the ability to predict the risk of OSA across various severity thresholds. This advancement underscores the potential of ML-enhanced diagnostic tools in improving the early detection of OSA. The findings of this research validate the application of innovative ML approaches in enhancing the diagnostic processes for OSA, potentially leading to more timely and effective interventions for this widely prevalent but underdiagnosed condition.
The remaining sections of this paper are organized as follows:
Section 2 details the participants and methods used in this study, including the study design, OSA diagnosis process, and ML predictive models.
Section 3 presents the results of our experiments, comparing the performance of the conventional BQ, the ML-10 model, and the simplified ML-2 model.
Section 4 discusses the implications of our findings, situates our work within the broader context of existing research, and outlines the limitations of our study. Finally,
Section 5 concludes the paper with a summary of our contributions and suggestions for future research.
2. Participants and Methods
2.1. Design
From January to December 2023, an observational multicenter study was conducted across two Italian hospitals: the Otorhinolaryngology Unit at the “Vito Fazzi” Hospital in Lecce and the Otorhinolaryngology Head & Neck Surgery Unit at the IRCCS Humanitas Research Hospital in Milan. A total of 462 subjects, including 112 from Lecce and 350 from Milan, were screened due to suspected symptoms of OSA and underwent HSAT.
2.2. Participants
The inclusion criteria for this study were as follows: (1) participants aged ≥ 18 years and (2) who had undergone a HSAT recording. Before the HSAT examination, a baseline screening questionnaire was used to assess each participant’s basic information, medication history, and surgical history. The participants were measured for height, weight, and BMI (kg/m
2) [
28] at the time of registration.
2.3. OSA Diagnosis
All the sleep-related signals were obtained using a HSAT device (Embletta Gold Portable Testing Device
®, RemLogicE
® Software v3.4.4 (2015), Embla System Inc., Broomfield, CO, USA, used in Lecce, and the Embletta
® Multi Parameter Recorder-Polygraph (MPR-PG), RemLogicE
® 3.4.1, Embla Systems, Kanata, ON, Canada, used in Milan). This study adhered to the guidelines set forth by the American Academy of Sleep Medicine (AASM) [
29,
30].
2.4. The Berlin Questionnaire and the Simplified Berlin Questionnaire
The BQ [
28] is structured into three categories that assess the risk of sleep apnea. Patients are classified as either high risk or low risk for OSA based on their responses to individual items and their cumulative scores within these symptom categories. Category 1, comprising five items, focuses on snoring behaviors. Category 2, with three items, investigates daytime somnolence. Category 3 consists of a single item that evaluates the presence of hypertension. A positive score in the first two categories requires frequent symptom occurrence, defined as more than 3–4 times per week. In contrast, a positive score in the third category results from either a history of hypertension or a BMI greater than 30 kg/m
2 [
28]. The overall assessment is based on the collective responses across these categories, with patients categorized as high risk for OSA if they have positive scores in two or more categories; otherwise, they are deemed low risk [
28].
Our previous research showed that, among the ten questions in the standard BQ, two questions were sufficient to closely approximate the BQ output using a trained classifier. Further details are available in [
22]. In summary, the first critical question assesses high blood pressure, asking, “Do you have high blood pressure?”. This inquiry is followed by one of two options regarding fatigue: “How often do you feel tired or fatigued after your sleep?” or “During your waking time, do you feel tired, fatigued or not up to par?” These questions are designed to be selected independently yet provide insightful data for OSA risk assessment. Despite their independence, we arbitrarily opted to utilize the first fatigue-related question (“How often do you feel tired or fatigued after your sleep?”). This decision was based on the observation that the models using one or the other yielded comparable results when applied independently, suggesting that favoring one fatigue-related question over the other offers no significant advantage in the context of our study.
2.5. Statistical Analysis
The baseline characteristics and BQ items for all participants, encompassing patients with confirmed OSA and those without, underwent descriptive statistical analysis. Continuous variables were summarized using the mean and standard deviation (SD), whereas categorical variables were described using frequencies and percentages. Fisher’s exact test was employed to explore the associations between two categorical variables. Additionally, the Mann–Whitney U-test was utilized to assess the statistical significance of differences between the distributions of two continuous variables among participants categorized on the basis of their AHI values, specifically those who are not at risk of OSA (AHI < 5) and those who are (AHI ≥ 5), according to the threshold defined in the BQ [
28]. A
p-value of less than 0.05 was considered statistically significant. The scoring of the BQ and all statistical analyses, including evaluations of both qualitative and quantitative variables, were performed using Matlab software, version 2023b.
2.6. Machine Learning Predictive Value
Calculating group statistics is crucial in establishing the statistical relevance of variables within a diagnostic context, allowing for the assessment of risk factors and relationships with comorbidities. However, it is widely recognized that statistical relevance does not equate to discriminant power, which is more critical for classification and prediction tasks. Variables that are statistically significant in a model do not necessarily guarantee superior prediction performance, and attributes deemed non-significant might be predictive. Therefore, we opted to investigate the predictive capabilities of the BQ using ML techniques. To this end, six distinct classifiers were evaluated for their suitability in the predictive task: Naive Bayes, Support Vector Machine (SVM), Decision Trees, Error-correcting Output Codes (ECOCs), Discriminant Analysis, Ensemble of decision trees, and Artificial Neural Networks (ANNs). Among these, the Ensemble of decision trees demonstrated the best performance. This model was initially trained with the ten responses from the standard BQ and then separately with only the two responses from the simplified version, BQ-2, independently, resulting in the development and evaluation of two distinct models designated as ML-10 and ML-2, respectively.
We employed a 10-fold cross-validation (CV) approach for the training and quality assessment. For both models, features were normalized to a 0–1 range using min–max normalization on the training dataset in each CV iteration, with identical normalization parameters applied to the corresponding validation set.
The Receiver Operating Characteristic (ROC) curve was used to illustrate the diagnostic capability of the models at various decision thresholds, providing a graphical representation of the trade-off between sensitivity (true positive rate) and 1-specificity (false positive rate). Initially, we identified the specific operating point in the ROC space corresponding to the conventional BQ, indicating the combined sensitivity (ability to correctly identify cases at high risk of OSA) and specificity (ability to correctly identify low or non-OSA cases) achieved without integrating ML techniques. Subsequently, we compared this point with the performance of the ML-enhanced models (both ML-10 and ML-2) at equal specificity and equal sensitivity, by vertically and horizontally adjusting them from the BQ point until the ROC curve of the ML-10 model was intersected. This approach allowed us to evaluate how ML-10 and ML-2 could enhance sensitivity while maintaining the specificity of the conventional BQ, and vice versa.
Subsequently, we extended our analysis to evaluate the ML-10 and ML-2 models across two different AHI thresholds (AHI ≥ 15, and AHI ≥ 30) referenced in the literature to classify OSA as moderate to severe, or severe, respectively [
1]. For this purpose, the ROC curve was utilized to assess the classifier performance and to determine an “optimal” prediction threshold that maximizes accuracy. Binary classifiers were derived from this optimal operating point. Performance metrics including the Area Under the Curve (AUC), accuracy, sensitivity, and specificity were used to measure the models’ effectiveness. All computational analyses were performed using MATLAB software, version R2023b.
2.7. Ethical Considerations
The experimental protocol received approval from the Bioethics Committees of the Local Health Authorities of Lecce (Protocol Number 74, dated 22 April 2022) and Milan (Protocol Number CET Lombardia 5-PIO X-153 /23, dated 19 September 2023). Conducted in full compliance with the Helsinki Declaration for Human Research, this study ensured the ethical treatment and protection of all participants. Written informed consent was secured from each subject who agreed to partake in the study, underscoring our commitment to ethical research practices. The ethical considerations of the study were meticulously outlined in the questionnaire introduction, designed in alignment with the principles established by the Italian Data Protection Authority (DPA). Participants were informed of their right to voluntary participation, with the explicit option to withdraw from the study at any point should they choose to. The process of obtaining informed consent was structured to emphasize the voluntary nature of participation, while highlighting that the confidentiality and anonymity of all collected information would be ensured. This approach ensured that participants were fully aware of their rights and the ethical standards of the study, fostering an environment of trust and respect for individual autonomy.
4. Discussion
OSA is increasingly recognized as a significant concern within global health and economic contexts, underlining the importance of its early detection and diagnosis in the realm of preventive medicine [
1,
17,
31]. The prompt identification of OSA is essential for initiating timely interventions, which can mitigate a broad range of associated health risks and enhance patient outcomes. Given that the standard diagnostic test for OSA, namely in-laboratory PSG, is expensive and often subject to long wait times due to high demand, there is a clinical imperative to identify the key factors and develop a simple yet reliable tool for estimating the OSA risk [
17,
18]. In general, BQ has an expectedly high sensitivity, as this tool has been developed for the identification of patients at a high risk of OSA in primary care settings. Despite this advantage, the BQ’s low specificity and consequent high misclassification rate reveal its limited discriminatory capability, rendering its utility comparable to subjective clinical judgments [
30,
32]. In the quest for a straightforward questionnaire to ascertain OSA risk, clinicians are demanding enhancements to existing tools. Arunsurat et al. [
33] posited that with certain modifications, the BQ could serve effectively as an OSA screening instrument. Furthermore, Stelmach-Mardas et al. [
34] added to the growing body of evidence indicating the BQ’s inadequacy in distinguishing between high- and low-risk patients, suggesting the need for the development of alternative protocols to heighten the diagnostic precision for such individuals.
In this research, we sought to advance the capabilities of the traditional BQ through the integration of ML techniques. Our research integrates ML models with the standard BQ to harness Artificial Intelligence capabilities for analyzing patterns and correlations in data that might not be immediately apparent to human evaluators. This method facilitates a more detailed assessment of risk factors, potentially identifying the subtle signs of OSA risk overlooked by conventional approaches. To determine whether our ML-10 and the simplified two-item version ML-2 outperform traditional BQ in predicting patients at a high versus low risk of OSA, we conducted a comparative analysis using the established threshold used in the standard BQ (AHI ≥ 5) and by comparing points in the ROC space. The findings underscore the efficacy in terms of sensitivity and specificity of the ML-10 model when contrasted with conventional BQ. A sensitivity of 93% at the same specificity as conventional BQ indicates that the model can correctly identify 93% of individuals (at low or high risk of OSA), operating with the same TN-rate. This result is significant as it demonstrates that, while maintaining the same rate of false alarms (1—Specificity), the ML-10 model is more effective in detecting OSA risk cases compared to conventional BQ. On the other hand, a specificity of 73% at the same sensitivity as conventional BQ emphasizes that the ML-10 model reduces the number of false positives (healthy individuals erroneously identified as at risk of OSA) compared to the conventional BQ, while still correctly detecting 82% of true positives. In this way, the ML-10 model shows excellent performance in identifying non-risk cases, surpassing conventional BQ.
These results indicate that the ML-10 model surpasses conventional BQ both in terms of sensitivity (when specificity is maintained) and specificity (when sensitivity is maintained). This implies that, depending on clinical or screening needs, the ML-10 model can be adjusted to optimize the ability to detect OSA risk cases (by maximizing sensitivity) or the ability to reduce false positives (by maximizing specificity), offering a more flexible and accurate approach in the diagnosis of OSA.
In the comparative evaluation between conventional BQ and the classifier based on its simplified version, the results indicate that ML-2, despite the significant reduction in the number of questions to only two, slightly outperforms BQ in terms of sensitivity and specificity (fixing one of the two variables at the BQ value). Additionally, the use of ML-2 offers the flexibility needed to adjust the operating point on the ROC curve depending on the specific needs of clinical or screening applications, thus providing a potential advantage in terms of customizing the diagnostic approach.
After assessing the ML-10 and ML-2 performance against traditional BQ using a single cutoff, we expanded our analysis to include two clinically relevant AHI cutoffs. This step involved utilizing two AHI thresholds (AHI ≥ 15, and AHI ≥ 30) commonly used in the literature to categorize the OSA severity as moderate to severe, and severe, respectively [
1]. The decision to employ these specific AHI thresholds is rooted in their widespread acceptance and use in clinical practice and research for defining the severity of OSA. Such a differentiated approach allows for a more detailed assessment of the models’ performance, providing insights into their predictive capabilities across a spectrum of OSA severity. This is particularly relevant for clinicians and healthcare providers seeking to tailor interventions and management strategies based on the severity of the condition. By choosing the optimal threshold for maximum accuracy, the ML-10 model performance consistently demonstrated its strength at both AHI thresholds.
These results highlight the potential for a more streamlined and efficient screening process. By examining whether a simplified model can retain or surpass the full BQ predictive accuracy, this study suggests the possibility of more accessible and less cumbersome OSA screening approaches. This is especially pertinent in primary care environments or areas with limited access to specialized sleep medicine services, where a rapid and dependable screening tool could significantly improve the early detection of individuals at risk of OSA. However, we should consider that using only two questions likely makes the test sensitive but not specific, as various diseases could present with the same broad symptoms.
The present study is subject to several limitations that merit consideration. Firstly, the participant cohort was drawn exclusively from two hospitals in Italy, limiting the data set representativeness of the broader population. Consequently, the predictive model developed herein might not possess widespread generalizability, potentially limiting its applicability to populations beyond the initial study setting or to diverse ethnic groups [
24,
35]. Secondly, this observational study did not account for undiagnosed medical conditions commonly associated with OSA, such as neurological, cardiovascular, and pulmonary disorders. The absence of these variables could impact the model’s predictive accuracy. Furthermore, our model lacked detailed anthropometric imaging or measurements, which might have restricted its ability to identify disease-specific causes of OSA accurately.
In light of these limitations, there is a clear need for further research to enhance the model’s robustness and applicability. To this end, we are planning a prospective clinical trial aimed at evaluating ML-10 and ML-2 across a more representative sample of the general population. This forthcoming trial is expected to address the current study limitations by incorporating a broader range of demographic and clinical variables, thereby improving the model’s predictive performance and generalizability.
5. Conclusions
Given the substantial proportion of individuals still undiagnosed with OSA, coupled with the current absence of definitive diagnostic biomarkers for the condition, there is a pressing need for improved screening methodologies. The BQ, when enhanced with ML techniques, stands out as a significant advancement in this regard. This study discovered that the ML-10 model was particularly effective in identifying individuals at risk of OSA with greater accuracy than the traditional BQ. By integrating ML techniques, we achieved a notable improvement in sensitivity and specificity, highlighting the potential of ML to refine diagnostic processes. This suggests that the ML-10 model can more effectively distinguish between high-risk and low-risk individuals, thereby reducing the likelihood of false positives and negatives. Furthermore, ML-2, with its reduced question set, also showcased its utility by maintaining slightly better diagnostic accuracy than the full BQ while offering a more streamlined and accessible screening tool. This adaptation could facilitate wider screening efforts, particularly in primary care settings or areas with limited access to sleep medicine specialists. Additionally, the flexibility of the classifier allows for adjustments across different operating points, enabling the selection of an optimal threshold that best balances sensitivity and specificity for the targeted population. This adaptability is crucial in tailoring the screening process to diverse clinical environments and patient needs, optimizing the early detection and management of OSA.
Moreover, the application of the ML-10 model extends beyond the commonly used AHI threshold of the standard BQ (AHI ≥ 5), demonstrating its utility across other clinically relevant AHI thresholds, specifically ≥15 and ≥30, which are frequently used in the literature to categorize the severity of OSA as moderate to severe, and severe, respectively. This versatility underscores the model’s ability to adapt to varying clinical requirements, offering a nuanced approach to diagnosing OSA across its spectrum. Such adaptability ensures that the ML-10 model is not only a tool for preliminary screening but also a significant asset in stratifying OSA severity, thus enhancing the precision of diagnostic decisions and subsequent management plans.
By leveraging these insights, healthcare professionals can better stratify individuals based on their risk levels, paving the way for more tailored diagnostic and management strategies for sleep apnea. ML-10 embodies the potential to transform the approach to diagnosing OSA, offering a more individualized assessment of risk. Looking forward, the insights gained from this research could serve as a foundation for further innovations in the field, ultimately leading to earlier detection, improved patient outcomes, and a reduction in the healthcare burden associated with OSA. These results can be achieved with minimal effort, because no modification to the BQ itself is necessary. The approach does not necessitate developing new questions or methodologies; instead, it leverages AI techniques to optimize an existing, widely used tool. This means that new screenings could achieve greater accuracy, and previously administered questionnaires could be easily re-examined using the ML-10 model. Consequently, more cases of OSAS could be identified, and more healthy individuals could be correctly reassured. In the end, this study underscores the value of combining traditional clinical assessment tools with cutting-edge technology to address complex health challenges, marking a significant stride towards the future of personalized medicine in sleep health.