1. Introduction
Cardiotocography (CTG) is a non-stress diagnostic method for monitoring the fetal well-being during the third trimester or during labor [
1]. CTG continuously records maternal uterine contractions (UC) via a pressure transducer placed on the abdominal wall, and fetal heart beats (FHR) via an external ultra-sound probe on the maternal abdominal wall. The simultaneous readouts can be displayed in real time. Based on expert criteria [
1], CTG is typically interpreted by clinicians as Normal, Suspect or Pathologic. In developed countries, CTG is one of the most popular choices of assessing the fetal well-being [
2]. Some authors are even arguing that CTG is being overused in low-risk cases [
2]. There is a connection between CTG and perinatal mortality and morbidity, as a pathological CTG result is linked to a low APGAR score and neonatal intensive care units (NICU) [
3]. The status of fetus can also be used to observe fetal distress. Depending on the underlying causes, the degree of the distress, and the promptness of medical interventions, fetal distress can result in a variety of outcomes. If fetal distress is temporary, then it can be resolved by changing the mother’s position, administering oxygen (to the mother), adjusting intravenous fluids, or performing an emergency cesarean section (around the end of third trimester), if necessary. All these steps can help improve the baby’s condition and lead to a positive outcome. However, if fetal distress is prolonged, then it can lead to long-term negative outcomes such as cognitive impairments, learning disabilities, motor impairments, conditions such as cerebral palsy or even childbirth (in rare cases). Lack of oxygen usually leads to prolonged fetal distress [
4]. In some cases, it also results in birth asphyxia (which accounts for approximately 900,000 neonatal deaths annually) [
5]. Fetal mortality is more common in low-income nations than in high-income nations overall, underscoring the differences in healthcare access and resources across these areas. Although the global neonatal mortality rate (per 1000 live births) has decreased from 36.7 (1990) to 17 (2020) in the past three decades, it is still comparatively higher for low-income regions [
6]. Even in high-income regions, one of the most common causes of fetal death was complications of the placenta (which is related to fetal distress too). In the District of Columbia, USA, 24.4% fetal deaths (in 2020) were due to complications of the placenta [
7]. Hence, recognizing the status of the fetus is important in assessing the fetal well-being. CTG can provide early indication of fetal distress. CTG tests are time- and resource-efficient; thus, they mitigate patient discomfort especially if numbers are high. CTG tracing patterns such as fixed FHR baselines, loss of FHR variability, and absence of accelerations, are indicative of a non-reassuring case [
8,
9]. CTG is visually interpreted by an expert and to supplement this activity, automated mechanisms are being proposed. Machine learning can be used to detect fetal hypoxia and status of the fetus [
10,
11,
12,
13]. This research proposes a diagnostic model that classifies and predicts the fetus status as well as the CTG morphological patterns. Missing data in CTG recordings can have a significant impact on the interpretation of the fetal well-being and can lead to suboptimal decision-making in managing labor and delivery. Missing data can lead to the following issues:
An incomplete assessment of the fetal well-being because the CTG recording contains crucial information regarding the fetal heart rate and uterine contractions. As a result, chances to detect fetal distress or hypoxia early may be lost.
Misinterpretation of the CTG pattern. As a result, unneeded interventions such as emergency cesarean sections may occur when they were not needed.
A delay in the decision-making and the proper management of labor and delivery. As a result, this may have negative effects on the well-being of the mother and the fetus.
Issues such as missing values can be resolved during the preprocessing stage. Thus, preprocessing of the CTG dataset is quite necessary. In [
14], an algorithm is described that involves two iterative steps for filling in missing data. In the first “reconstruction step”, an adaptive dictionary is used to reconstruct the signal that leads to estimation of missing data, and then, in the second step, a new dictionary is calculated using the KSVD (k-singular value decomposition) algorithm based on the reconstructed signal from the first step. These two steps are repeated until convergence is achieved. The algorithm displayed good results particularly for consecutive missing samples. The dataset [
15] considered for this research was the result of an automated analysis of the SisPorto 2.0 program [
16]. The proposed program solved the missing data problem. The hypothesis for this research is that by using a machine learning model based on feature extraction, feature selection, and Bayesian optimization, it is possible to accurately diagnose and classify the various fetal conditions (Normal, Suspect, Pathologic), as well as the CTG morphological patterns, offering a potential decision support tool for managing pregnancies. Elaborating the hypothesis, the following objectives are proposed for this research: to diagnose the fetal well-being, the proposed objectives of this study are to counter the imbalanced nature of the CTG dataset; to propose an encoder-bottleneck information variable (discussed in Methodology section); to implement feature extraction (to counter the comparatively larger size of the CTG dataset achieved after implementing the first objective); to implement feature selection; to perform Bayesian optimization (to further increase the performance of the proposed model); to implement classification and to formulate a method to integrate all the above-mentioned modules.
2. Related Work
Several comparative studies [
17,
18,
19,
20,
21,
22,
23,
24,
25] have been conducted to evaluate the performance of various classifiers on the CTG dataset [
15]. These studies have utilized a variety of classifiers such as Artificial Neural Network (ANN), Long Short-Term Memory (LSTM), eXtreme Gradient Boosting (XGB), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Light Gradient Boosting Machine (LGBM), Random Forest (RF), Ada Boost, Bagging and Stacking, Decision Tree (DT), Naïve Bayes (NB), Logistic Regression (LR), Classification and Regression Trees (CART), Levenberg–Marquardt (LM) backpropagation, Resilient Backpropagation (RP), and Gradient Boosting Machine (GBM). The studies achieved accuracy rates ranging from 83.65% to 96.61%. The studies have generally concluded that RF is the best-performing classifier. The NB classifier combined with the Firefly algorithm and random feature selection resulted in an accuracy of 86.54% (8 features) [
26]. A stacked model approach was used in [
27], which included a combination of multiple models, to counter the imbalance in the CTG dataset [
15] with its anti-interference traits. The results showed an accuracy of 96.08%. An AutoML approach with Synthetic Minority Oversampling Technique (SMOTE) was implemented [
28] for the CTG dataset [
15]. Out of all the models used in PyCaret, LGBM had an accuracy of 95.61%. Authors in [
29] proposed their own model (95% accuracy) for feature selection after implementing SMOTE on the imbalanced CTG dataset [
15]. The Differential Privacy (DP) framework-based neural network model (91% accuracy) [
30] had two binary classifiers that classified the CTG dataset [
15]. An a priori algorithm-based classification model was proposed in [
31]. The proposed model (with Adaboost and RF) had feature selection as well. In addition, the suspect class of the CTG dataset [
15] was split into normal and pathological classes to increase overall model accuracy. Relevant CTG features of the CTG dataset [
15] were selected via Principal Component Analysis (PCA) and then fed to an SVM-AdaBoost model (93% accuracy [
32]). The adjustment parameters were tweaked via a self-learning algorithm in a Fuzzy C means clustering-based ANFIS model [
33], and model accuracy was 96.39% when 9 features were manually selected from the CTG dataset [
15]. In [
34], it was observed that the two outputs (of the CTG dataset [
15]) have shared representations which allowed the model to utilize shared features between the two outputs.
The inspiration of using different modules (discussed in
Section 3.6) came from the above-mentioned related literature. Hence, the proposed model of this study includes modules such as a method for balancing the dataset, feature extraction, feature selection, and hyperparameter optimization. The main difference between the proposed model and the above-mentioned related literature is that not all the modules used in the proposed model are utilized together in such a manner. The type of method for balancing the dataset, feature extraction, feature selection, hyperparameter optimization method, and classification mechanism was selected based on their respected performances in the related literature review. A method for balancing the dataset was implemented using SMOTE (
Appendix A.4), feature extraction was implemented using Autoencoder (
Section 3.1), feature selection was implemented using Recursive Feature Elimination (
Section 3.2), hyperparameter optimization was implemented using Bayesian optimization (
Appendix A.1), and classification was implemented using Random Forest (
Section 3.3).
4. Results
The simulations were performed in a Python 3.8 environment. The simulations were divided into two parts. Part 1 covers the fetal status aspect of the CTG dataset, whereas Part 2 covers the CTG morphological pattern aspect of the CTG dataset. For comparison purposes, RF (without the proposed algorithm) was also used on the CTG dataset. For both parts, the training to testing ratio was set as 75:25. RF can naturally support multiclass classification, so it was directly used for this multiclass dataset.
4.1. Fetal Status Classification
The performance analysis (using the performance metrics given in
Table 1) of the proposed model for the fetal status is given in
Table 5. For an easier comparison, the table also contains entries from the case in which only basic RF (without the proposed algorithm) was used.
The model accuracy of the proposed model for CTG fetal status was 96.62% (with 13 features). Whereas if only basic RF was used on the same dataset (with all 21 features), an accuracy of 93.61% was achieved. The confusion matrix of the proposed model for fetal status is shown in
Table 6. For ease of comparison, the entries in the confusion matrix are depicted as percentages and the table also contains entries from the case in which only basic RF (without the proposed algorithm) was used.
The ROC (with AUC) and PR were measured for all three classes (Class 1 = Normal, Class 2 = Suspect, and Class 3 = Pathologic) individually, as observed in
Figure 5 and
Figure 6, respectively.
The variation in the model accuracy during the full run of the proposed model for fetal status can be observed in
Figure 7. The highest accuracy, 96.62%, was achieved by the proposed model, when 13 features were selected.
4.2. CTG Morphological Pattern Classification
The performance analysis (using the performance metrics given in
Table 1) of the proposed model for the CTG morphological pattern is given in
Table 7. For an easier comparison, the table also contains entries from the case in which only basic RF (without the proposed algorithm) was used.
The model accuracy of the proposed model for the CTG morphological pattern was 94.96% (with 14 features). Whereas if only basic RF was used on the same dataset (with all 21 features), an accuracy of 87.22% was achieved. The confusion matrix of the proposed model for the CTG morphological pattern is shown in
Table 8. For ease of comparison, the entries in the confusion matrix are depicted as percentages, and the table also contains entries from the case in which only basic RF (without the proposed algorithm) was used.
The ROC (with AUC) and PR were measured for all ten classes (Class 1 = A, Class 2 = B, Class 3 = C, Class 4 = D, Class 5 = E, Class 6 = AD, Class 7 = DE, Class 8 = LD, Class 9 = FS, and Class 10 = SUSP) individually, as observed in
Figure 8 and
Figure 9, respectively.
The variation in the model accuracy during the full run of the proposed model for the CTG morphological pattern can be observed in
Figure 10. The highest accuracy, 94.96%, was achieved by the proposed model, when 14 features were selected.
Figure 7 displays the complete run for fetal status (in which 3 conditions of the fetus were used as target output), whereas
Figure 10 displays the complete run for the CTG morphological Pattern (in which 10 CTG classes were used as target output). The difference between the accuracy of both graphs stems from the fact that for the fetal status case, the target output had only three classes; thus, it was easier to classify that model. Whereas for the CTG morphological pattern model, the target output had 10 classes (refer to the dataset subsection) and it was comparatively difficult to obtain a better classification. Still, the proposed model presented good results for the latter case as compared to using only the basic RF classifier.
4.3. Overview of Bayesian Optimization
The main reasons for using Bayesian optimization in this proposed study are to efficiently explore the hyperparameter space, to reduce computational cost in fine-tuning the hyperparameters, and to improve the overall performance of the proposed model. For instance, for the fetal status part, if 13 features were selected (after the RFE module) and no Bayesian optimization was used, then the accuracy would be 96.54%. However, if Bayesian optimization is used after the RFE module, then the accuracy for 13 features is 96.62%. In essence, Bayesian optimization fine-tunes the proposed model and yields better results. Performance metrics table and confusion matrix of both the above-mentioned cases are given in
Appendix A.2 for comparison. The optimum hyperparameters of the proposed model for fetal status and for the CTG morphological pattern obtained after the Bayesian optimization module are given in
Table 9.
4.4. SHAP Analysis
The SHAP summary plot is Beeswarm-type plot, in which the features are represented in the y-axis (the features are sorted with respect to their importance) and the SHAP values are represented in the x-axis (the SHAP measures the contribution of each feature to predicted output). The SHAP output was different for both cases, as for fetus status, the target output consisted of 3 classes, whereas for the CTG morphological pattern, the target output consisted of 10 classes. After the implementation of autoencoder, the new features were labeled as New Extracted Features (NEFs), which ranged from NEF 1 to NEF 14. The low feature value was depicted as a blue dot, whereas a high feature value was depicted as a red dot. For non-binary cases (e.g., in this research), the color range was depicted between blue and red, with purple being the middle feature value. The dots represented individual SHAP values for each data point in the test set. The horizontal bars, along the x-axis, represented the range of the SHAP values for each feature, whereas the length of those bars depicted the extent of the effect each feature has on the model.
For the case of fetal status, in
Figure 11, it can be observed that NEF5 has the highest positive impact on the model. NEF4 has the highest negative impact on the model. However, it should be noted that NEF9, NEF6, NEF4, and NEF14 have a high negative impact on the model.
For the case of the CTG morphological pattern, in
Figure 12, it can be observed that NEF14 has the greatest positive impact on the model; however, the strength of the impact lies between low to medium (as observed by the blue and purple colored dots, respectively). NEF11 has the greatest negative impact on the model. Moreover, NEF4, NEF6, and NEF12 have a high positive impact on the model.
The main difference between the two graphs is that for the fetal status part, the NEFs had a high, positive and negative (maximum impact reached around 1.0) impact on the model and also comparatively less features had a significant impact on the model. Whereas for the CTG morphological pattern part, the NEFs had a comparatively higher impact (maximum impact reached around 2.0) on the model and also more features had a significant impact of the model.
In [
39], when SHAP was implemented, there were some original features (such as NZEROS: Number of Histogram Zeros, and DS: severe decelerations) that had no impact on the model whatsoever. In this proposed model, all those irrelevant features had been removed via the proposed algorithm. Thus, all new features had an impact on the model output.
5. Discussion
The general trend in the relationship between number of features and the accuracy of the proposed model has a negative relation, with fewer number of features leading to lower model accuracy (as observed in
Figure 7 and
Figure 10).
The performance analysis metrics (
Table 5) of the proposed model ranged from 0.92 to 0.99. This is a significant improvement from when only RF was used on the CTG dataset. When basic RF was used (without the proposed algorithm), the Precision and Recall values of the suspect case were very low (0.83 and 0.69, respectively), whereas in the proposed model, those values were 0.92 and 0.98, respectively. For the confusion matrix (
Table 6) of the proposed model, a great reduction was achieved in the “incorrect” predictions of suspect and pathological cases. When basic RF was used (without the proposed algorithm), the suspect cases that were incorrectly predicted as normal cases were 26.4%, whereas using the proposed model, this incorrect prediction fell to only 1.5%, a decrease of 94.31% in the incorrect predictions between normal and suspect cases. For a sensitive field such as fetal well-being, the reduction in incorrect prediction is a good aspect of this proposed model. The ROC (
Figure 5) and PR (
Figure 6) curves for the fetal status case provide good insight about the ability of the proposed model to accurately predict all the three classes with good confidence (as all AUC values are above 0.99). The most important conclusion from the ROC and PR curves is that the model works very good in classifying and predicting the pathological cases. In the medical context, the pathological cases are more concerning than the normal cases. This is because pathological cases need immediate care (as observed from
Table 4), so that the well-being of the fetus can be corrected. Although the basic RF classifier (without the proposed algorithm) was able to display good results for the normal cases, the suspect and pathological cases were not being predicted with good confidence level. Many suspect cases were incorrectly predicted either as Normal or Pathologic. Considering the medical implications, this incorrect prediction poses more harm compared to a normal case being incorrectly predicted as either suspect or pathological. In the case of fetal status classification, the accurate classification of pathological and suspect cases holds more significance than the classification of normal cases. Thus, the proposed model was able to increase the confidence levels for predicting both suspect and pathological cases.
The model accuracy of the proposed model for the CTG morphological pattern case was 94.96%. This was an increase of 8.87% as the accuracy was 87.22% when only RF was used without the proposed algorithm. For the basic RF classifier, only A, B, and LD had comparatively better predictions, whereas there were significant incorrect predictions for the rest of the classes in the CTG morphological pattern case, as observed in
Table 7. Moreover, for the class E (shift pattern between Calm Sleep, CLASS A and Suspect pattern, CLASS SUSP), the correct predictions, while using only basic the RF classifier, were only 45.8% (with a recall value of 0.45). However, in the proposed model, the incorrect predictions were significantly reduced throughout all the classes. Another improvement was observed in the F1-score, where all morphological patterns displayed good metrics. All classes (except class A) had an F1-score of above 0.91. The good performance of the proposed model can also be highlighted via a confusion matrix (
Table 8). For instance, for class E, the correct prediction increased to 98.7%, a percentage increase factor of 115.5%. In addition, the recall value of class E increased to 0.98. Moreover, the important pathological and suspect-related classes (such as FS and SUSP) have comparatively lesser incorrect predictions in the proposed model as compared to using only the basic RF classifier. The average correct predictions of the CTG morphological pattern using the basic RF classifier (without the proposed algorithm) was 80.94%. Whereas using the proposed model, the average correct predictions across all classes were 94.99%.
Before discussing the ROC and PR curves of the proposed model for CTG morphological pattern, the relationship between fetal status types and the CTG morphological pattern classes should be discussed. As observed from
Figure 13, the fetal status classes are distributed over the whole CTG morphological pattern classes. Classes A, B, C, D equate to the normal fetal case. Classes AD and DE equate to the mostly normal fetal case with a minority of suspect case. Classes SUSP and E equate to the suspect case. Class E has a shifting pattern that shifts between a normal calm sleep and a suspected pattern. Moreover, classes LD and FS equate to the pathological case. Although the ROC (
Figure 8) and PR (
Figure 9) curves of the proposed model for CTG morphological pattern are better than basic RF (without the proposed algorithm), only class A has a decrease in Recall and F1 score. There is a compromise on this as most of the incorrect predictions for class A were distributed in other normal case-related morphological patterns. The pathological case-related morphological patterns (LD and FS) and suspect case-related morphological patterns (E and SUSP) had very good performance analysis metrics.
Another aim of this research is to provide ease to the future authors to select the tuned hyperparameters from this work in their work related to cardiotocography with machine learning. This borrowed knowledge would increase the net productivity of any future related work in this field.
A major issue of using this CTG dataset [
15] is that this dataset has been derived from subjects of a developed country. Moreover, the sociological, demographic, and medical characteristics (such as maternal nutritional data, maternal health, etc.) of the subjects are not provided in this CTG dataset. All these variables affect the third-trimester events and can potentially be used to fine-tune the proposed model. Further research is needed to verify the actual performance of the proposed model for given subjects from developing countries. Instead of solely relying on the fixed CTG database, future research can be done on direct hardware integration with the proposed mechanism, which would facilitate real-world clinical trials of on-device CTG classification. The accuracy of the proposed mechanism can further be improved by utilizing a combination of more classifiers in future works. In this research, SMOTE synthetically increased the size of the dataset and an improvement in results were achieved. However, if more real entries are added into the CTG dataset [
15], then a further improvement can be achieved. Future work for this research can include a larger and real-time CTG dataset. Moreover, future work for this model can include deployment during multiple stages of labor, as inspired by [
40].
As this research used the CTG dataset that was sourced from Sisport 2.0, the proposed model can be generalized to work with CTG datasets that have been sourced from the Sisporto programs. The current version of Sisporto 4.0 [
41] is also adapted to the 2015 FIGO guidelines for intrapartum fetal monitoring. Related research [
42] also highlights the benefits of utilizing computerized CTG (specifically Sisporto) by concluding that Sisporto has many advantages in clinical practice as compared to traditional CTG analysis. Another research paper [
43] corroborates the notion that the inclusion of Sisporto in health care results in reductions in the incidence of hypoxic-ischemic encephalopathy (HIE) and cesarean-based deliveries. Hence, in the domain of CTG, Sisporto and the CTG dataset related to it provides a good standard. The CTG dataset is widely used in experiments and research relating to CTG (a fact that is also depicted in
Table 10).
The comparison of the results of the proposed model with prior related work is given in
Table 10. All of the research work displayed in the table also used the same CTG dataset [
15], which was used in this research as well. This was done to highlight the merit of this research by linking it with reputed prior related works and also for providing a better comparison. The defining feature of this research is that it proposed a new model that utilized SMOTE, feature extraction, feature selection, and Bayesian optimization to classify and predict (and hence diagnose) both fetal status as well as CTG morphological patterns. Although there are previous studies that utilize multiple machine learning algorithms for classifying and predicting fetal condition, utilizing multiple machine learning algorithms to achieve this task along with countering the CTG dataset [
15] class imbalance issue while utilizing the same CTG dataset (and the model) for classifying and predicting the fetal status as well as the CTG morphological pattern, can be considered a novelty of this research. In terms of clinical applicability, the study (also backed by results in
Section 4) suggests that the proposed model has the potential to serve as a decision support tool for managing pregnancies. By accurately diagnosing and classifying fetal conditions and CTG morphological patterns, the model can aid healthcare professionals in making informed decisions and providing appropriate therapeutic interventions when necessary. This clinical applicability implies that the model could be integrated into existing healthcare systems (versions of Sisporto or Sisporto-inspired systems) to support prenatal care and delivery management, potentially leading to improved outcomes. The hypothesis (in
Section 1) was substantiated by the results. Thus, this proposed model can be used in tandem with the healthcare system to reduce the adverse fetal outcomes. It can be inferred from the results that the accurate diagnosis and classification of fetal conditions, particularly identifying suspect and pathological cases with a good confidence margin. The proposed model could help in timely intervention and appropriate management of high-risk pregnancies. By providing healthcare professionals with a decision support tool to monitor high-risk pregnancies more effectively, there is potential to detect, diagnose, and address complications or adverse outcomes (for both the fetus and the mother) in a timely manner.