1. Introduction
The success of supervised machine learning algorithms in classification tasks is dependent on a massive amount of training data, in spite of some alternatives that are being developing for learning models, such as flew-shot learning [
1,
2,
3], or methods designed for learning with label noise [
4]. Usually, in the medical field, human-annotated labels are [
5] taken to be the ground truth. However, if there is noise in the labels, it undermines the robustness of any subsequent predictive models [
6]. Label noise can arise in numerous ways. According to the work [
4], the source of label noise can be categorized into four aspects: a. insufficient information; b. non-expert labeling; c. subjective labeling; d. data encoding problems. In our study, the label noise mainly comes from subjective labeling. In subjective labeling, label noise can be further divided into two categories: random label noise and class-dependent label noise. Random label noise could be introduced when annotation fatigue takes place leading to occasional random assignment of class labels. On the other hand, class-dependent label noise could be introduced when annotators have an unconscious bias or misunderstanding of annotation protocol for each class definition. The potential sources of label noise are even more complex in the case of medical data annotation. The rules for defining labels can vary enormously and may require physiological and clinical knowledge [
6]. Label noise can get worse when more annotators are involved in the project because of the variabilities in the annotation ‘quality’ achieved by different annotators [
7]. Meanwhile, the process is susceptible to high levels of inter- and intra-observer variabilities. One strategy to minimize the error in annotations is to have the same sample annotated by multiple annotators and then combine them to generate the final ground truth. Such an approach can mitigate the effect of subjective biases. However, the strategy is expensive and time-consuming, which in turn is rarely adopted for building large-scale training datasets and typically reserved for just the test set instead [
4].
The major issue of label noise lies in the implication to classification performance of supervised learning algorithms. Several studies have investigated its potential impact and have shown that noisy labels can adversely affect the classification accuracy of trained classifiers [
8]. However, some deep learning models appear to be robust against even high levels of label noise [
9]. In view of the possible impact of label noise on the final classification, a number of novel algorithms have been developed that can tolerate certain amounts of label noise [
6]. These include semi-supervised strategies, where all of the training data are used as data samples, but only the labels of selected samples are leveraged and the rest of the samples are treated as unlabeled [
10]. In the work [
11],
gray-zone was defined when intracranial pressure (ICP) value range between normal and great risk, samples in
gray-zone were considered as unlabeled, and then a semi-supervised approach was utilized. A similar framework was also conducted for the false alarm detection in patient monitoring system [
12], which indicated that adopting semi-supervised learning with an unlabeled sample can be beneficial to both false alarm reduction and true alarm recognition rate. Another approach is to identify and filter the label noise to improve the quality of the training data [
13]. A major drawback with removing training samples with noisy labels is that the removed samples may also be informative. In the medical domain, label noise in training data may propagate through the trained model to downstream clinical decisions, leading to medical errors that are difficult to find root causes for. Despite the significant implications of label noise, scant knowledge is available about the impact of label noise on biomedical data classification, with only a couple of studies dedicated to this topic [
14,
15]. In one study, a genetic algorithm-based method for mitigating label noise was developed that aimed to detect mislabeled samples during ECG signal classification [
14]. Another study has suggested utilizing cross-validation to generate training data for five different classifiers. If the classification result of one sample from these five classifiers does not match its original label, they call it one mislabeled sample [
15]. Both studies are targeted to identify and reduce label noise before deriving the final prediction model. To our knowledge, no studies have systematically investigated the impact of label noise on downstream classification performance from various machine models in the area of biomedical data classification.
In this study, both random and class-dependent label noises were artificially introduced into the training dataset to emulate the errors associated with human fatigue and annotation bias, which are the main sources of label noise in medical data annotation. The focus of this paper was to assess the robustness of different types of AI-based models to the label noise and bring the discussion of this relevant issue in the automatic approaches to the medical domain, which are usually dependent on the annotated data and the difficulties inherent in the labeling process. We investigate the impact of random and class-dependent label noise on the performance of machine learning (ML) and deep learning (DL) models, undertaking a binary classification task assessing the quality of photoplethysmography (PPG) signal. PPG is a popular tool to inform distal circulatory status and has gained its popularity recently years in cardiovascular research due to its ubiquitous presence in consumer wearable devices [
16]. Its signal quality though can have a significant impact on downstream applications, which makes PPG quality assessment a great choice for examining the impact of label noise in the field of biomedicine.
2. Materials and Methods
This section describes the data collection and annotation process; the methods for artificial label noise introduced in the data, and the learning models used for PPG quality classification. The overview of this pipeline is presented in
Figure 1.
Figure 1 represents several parts of the pipeline developed to study the impact of the different label noise induced in the data. The PPG data were collected from several patients, and then segmented and normalized during the pre-processing staging; after that, each segment was labeled by the annotators. From the labeled dataset, part of the data has changed the label simulating the typical label noise, but in a controlled way with specific percentages for the labels changed. Three different learning models were used to classify the PPG segments, and the performance degradation was assessed following the increase of the label noise introduced on the data.
2.1. Training Data
The training data set consists of PPG recordings (sampled at 240 Hz) from 3764 adult patients admitted to the intensive care unit (ICU) at the University of California, San Francisco (UCSF) medical center between March 2013 and December 2016. Continuous PPG recordings were divided into non-overlapping 30-s records. We randomly selected 25 30-s PPG records from each patient, corresponding to a total of 78,278 30-s PPG records after preprocessing, as described in a previous study [
17,
18].
2.2. Test data
The test set consists of PPG recordings (sampled at 240 Hz) from 13 patients (age range 19 to 91 years, median = 73.5) admitted to the Neuro ICU at UCSF Medical Center between October 2016 and January 2018. The inclusion criteria of the selected patients consist of: (a) being diagnosed with acute ischemic stroke; (b) being at least 18 years old, and (c) English-speaking. Patients with significant problems relating to their attention, alertness, cognitive functions, or communication were excluded unless a legally authorized representative could give consent on their behalf. All the enrolled patients provided written consent after being informed of the protocols approved by the UCSF’s Institutional Review Board [
18]. These patients were prospectively studied to test the validity of using a wrist band to collect PPG signals as compared to standard devices used at the bedside, but the data analyzed in this study were from the standard device that records PPG from fingers.
2.3. Annotation
The annotation process followed the same rules as defined in a previous study [
17,
18]. The 30-s PPG and synchronously recorded electrocardiogram (ECG) waveforms were presented to annotators to determine signal quality label (‘Good quality’, ‘Bad quality’ and ‘Not sure’) of each PPG record. Three strict conditions that qualify a PPG record as a good quality signal include: (i) its pulsatile waveform reflects blood volume changes in accordance with either physiological or pathophysiological (we expect ECG arrhythmia will cause irregular pulsatile PPG waveform) characteristics of the cardiovascular system; (ii) the waveform presents a consistent number of inflection points; and (iii) the waveform is free of irregularities that could not be explained by changes in the synchronous ECG. Records do not meet any of these three conditions will be considered as ‘Bad quality’. The records labeled as ‘Not sure’ were discarded from the following analysis. In the trial step, we assigned 100 random 30-s PPG signals to all the five annotators. Cohen’s kappa was utilized to assess the inter-rater agreement, the results showed a kappa coefficient of 0.87. Remaining PPG records were then assigned to each annotator without overlap for further annotation.
2.4. Artifact Proportion
PPG records in the test set were also annotated in terms of artifact proportion. This artifact proportion was defined by selecting all segments in a PPG record that were considered corrupted, then dividing their combined length by the total length (30 s) of the signal. The final distributions of labels in both training and test sets are shown in
Table 1. The training set contained 30% bad quality records, with the test set presenting a very similar distribution.
2.5. Models
To investigate the impact of label noise on model performance, we adopted four different prediction models that take various representations of PPG signals as input. The learning models used in this work were selected from the previous works that already made an extensive comparison from different approaches [
17,
18]. The continuous PPG signal was first segmented into 30-s segments and normalized between zero and one before being transformed into three signal representations: 1D raw signal, 2D signal plot, and expert-engineered features [
17]. To generate 2D signal plots, each PPG record was transformed into a 224 × 224 × 3 RGB image with a pixel density of 96 dots per inch (DPI). Residual networks have demonstrated robustness to label noise for various image classification tasks (ImageNet, MNIST, CIFAR-100) [
19]. Therefore, the Resnet-18 was adopted as the model structure for samples represented as 2D signal plots. The 1D Resnet-34 was selected as the classifier for samples represented as 1D raw signal. Finally, two conventional machine learning algorithms, support vector machine and XGBoost were used for expert-engineered features (both temporal-domain and spectral-domain). Details about these features can be found in our previous study [
17].
2.6. Hyperparameter Selection and Training Procedure
In our study, we use the same hyperparameter as our previous study [
17,
18]. For SVM, we use the RBF kernel. The optimal values for parameters C and sigma are 314.24 and 25.45. After tuning, we choose 4 as the max-depth for XGBoost and 10 as the number-of-round during training. For two deep learning models, we use the glorot_uniform as the initiation function for each convolution kernel. During training, the cross-entropy is used as the loss function and Adam as the optimizer with the learning rate 1 × 10
−4. Each deep model is trained for 50 epochs, the final network parameter is chosen from the epoch with the least loss on the validation set.
2.7. Artificial Label Noise
To simulate complex situations that can arise during the human labeling process, two common types of label noise were artificially introduced in the training data in this study, so that the impact of label noise on the classification performance can be evaluated. One of the principal sources of label noise is fatigue or environmental distractions during the labeling process. The fatigue and environmental distractions can be considered random since they would not affect more one class than another, and their occurrence is not dependent on the data but on the characteristics of the annotator or/and the conditions under which the annotator is working. This kind of noise was simulated by using random label noise, with different percentages of randomly selected labels flipped to their counterparts (namely random flipping, i.e., for a selected sample, if the true quality label is ‘good’, it is flipped into ‘bad’ as label noise, and vice versa). To evaluate the impact of various levels of label noise, 10%, 20%, 30% and 50% of the labels from the training dataset were flipped to generate different versions of the training set.
Another common type of label noise is class-dependent noise which arises from intrinsic bias of annotators [
20]. Here, the annotator tends to label one specific class incorrectly. To replicate noise of this nature, we selected a percentage of samples from only one class and flipped the labels. This procedure was repeated for the other class to see if the impact on the performance was different depending on the class of samples with biases. Again, 10%, 20%, 30%, and 50% of the labels were flipped for each class (i.e., good-to-bad flipping, and bad-to-good flipping) to produce different levels of label noise.
2.8. Experimental Design
Three types of experiments are designed to evaluate the various impact of label noise on the model performance. The objective of the first experiment is to compare the performance deterioration on three models caused by label noise. Four classifiers (SVM, XGBoost, 1D Resnet and 2D Resnet) are trained from training sets with three types of label noise (random flipping, good-to-bad flipping, and bad-to-good flipping) and four levels (10%, 20%, 30%, and 50%) of label noise and tested on the independent test set (see
Figure 1). With each combination, the performance of the trained prediction model is generated. We use accuracy as the main performance metric in this study. Other metrics (such as sensitivity, specificity, positive predictive value, and negative predictive value) are also provided in the
Supplemental Materials. In the second experiment, the objective is to compare the impact between three types of label noise on one single model. The same training process as the first experiment is implemented but only on one model. The third experiment is designed to investigate the influence of label noise on the model output probability. Besides predicting a binary label, the model output probability represents the model’s confidence on the prediction. With the availability of annotated artifact proportion, the third experiment is conducted by presenting the relationship of the artifact proportion with different models’ output probability. A 5-fold cross validation was conducted in each experiment, the mean value and variance of accuracy were reported in the result section.
4. Discussion
Label noise is an inevitable issue in supervised learning tasks and the identification of source can come from many reasons. In this study, the annotation process was undertaken by domain experts, and the potential reasons for artifactual labels can be identified and summarized into random label noise and class-dependent label noise. The random label noise simulates the error caused when fatigue took place, leading to occasional random assignment of class labels, and the class-dependent label noise simulates the labeling errors generated from annotator bias raised from the misinterpretation of annotation rules. In this study, a binary classification framework was adopted to assess and classify signal quality of 30-s PPG records. Three types of label noise were simulated by controlling the type and amount of artificially flipped labels in the training sets, which were used to train different ML and DL classifiers. Such an experimental setup permits evaluation of the different tolerance of prediction models against label noise, different impacts on model performance from different types of label noise, and the subsequent effect on the confidence of model-predicted labels.
Between two conventional ML algorithms, the performance of XGBoost is less degraded by all three types of label noise than SVM, although they share the same input–hand-crafted features from our previous study [
14]. The core of SVM is to use a Kernel to project the non-linearly separatable features into a high-dimension space where the features can be separated, and we do observe very close performance when the labels are all correct. However, when introducing noisy data samples, the label of support vectors is highly likely to be polluted and therefore the decision boundary is skewed. Compared to SVM, XGBoost shows better robustness because of its ensemble mechanism. Detailed mathematical derivation from [
21] has demonstrated the relation between boosting and SVM, and it reports that in high-dimension space, the margins between XGBoost and SVM are drastically different. XGBoost intends to have a much larger margin so L1 norm in the loss function can ignore the irrelevant features and is therefore more robust to the label noise.
Both deep and traditional machine learning algorithms suffer from label noise. However, deep neural networks proved to be more accurate and robust against all three types of label noise we studied. As shown in
Figure 2, when there was no label noise in the training data, the two deep classifiers performed slightly better than the SVM. When the level of label noise was increased, the deep classifiers not only maintained their superiority but also became proportionally better than SVM. The results confirm a broader hypothesis that deep learning models perform better than traditional machine learning algorithms in terms of their predictive power and robustness against label noise in the training data. It also suggests that deep models show greater promise for future works when a large dataset is available and label noise is a clear concern in the training data. There are two potential reasons: the first reason is proposed in [
9], gradient updates can be easily canceled out for noisy samples while gradients from clean samples are more likely to contribute to the learning process. Another reason is the early stop mechanism, which will terminate the training process if the loss value on the validation set did not reduce for six epochs consecutively. This mechanism can drastically avoid the overfitting and thus performance degradation.
Between the two deep learning models, the 2D model was more robust than the 1D model. When we increased the noise level from 20% to 30% in the bad-to-good type of noise, the 1D model showed an unexpected improvement in terms of accuracy. Nonetheless, the 2D model was still slightly superior to the 1D model. One plausible reason could be that the 1D model had 34 layers, which is deeper than the 18 layers in the 2D model. In the machine learning bias-variance tradeoff theory, a deeper model can more easily adapt to training data with more variance, leading to a reduction in generalizability and an increment of error in the test data. The better performance of the 1D model when there was no label noise further supports this hypothesis.
A significant difference in performance was also found within the three different types of noise when the same classifier was adopted. As shown in
Figure 3, when the absolute number of noisy samples was 10,000 in the training data, the bad-to-good flipping resulted in about 9% more reduction of accuracy than the other two types of noise. One plausible explanation is that there was an imbalance in samples from two class conditions in the original training data, with only 30% of the samples being labeled as bad quality. Bad-to-good flipping scheme exacerbates the sample imbalance issue, leading to a reduction in classifier performance. Additionally, changing from labels from bad-to-good quality breaks the restricted definition of good quality signals. Specifically, introducing poor quality signals into good-quality condition sabotages the classifier’s ability to learn discriminative features truly separating the two quality conditions, which in turn leads to a dramatic decrease in classification performance.
The calibration ability of the classifier was also vulnerable to label noise, which has been even less studied. We found that even 10% of each type of noise could significantly change the classifier output probability (see
Figure 4). Due to the restricted rules imposed for good quality signals in our study, a 30-s PPG record with even a small percentage of artifacts would still be labeled as bad quality. This restriction is reflected in
Figure 4a, where most samples with other than zero percentage of noise are assigned with a small probability of being good quality signals. However, when 10% label noise was introduced into the training data (
Figure 4b–d), many samples with nonzero percentage of noise were assigned with a higher probability of being good quality signals, resulting in a more spread pattern of the noise-probability map. The results suggest label noise alters model probability calibrations and that a more sophisticated strategy is required for recalibrating the classifier output.
Limitations
The current study represents an evaluation of the impact of label noise on the accuracy and calibration of binary classifiers with a case study of PPG signal quality. However, restrictive definitions for labeling physiological events are highly common in the medical domain. In this study, the label definition of ‘good quality’ class is also strict, which makes the distribution of model output probability skewed to the bottom left instead of evenly distributed. The conclusions of this work may need to be limited to contexts where the class labels are based on a very specific definition in the classification problem. As the study was limited to binary classification, a similar study focused on multiclass classification would also expand the range of possible findings. The current study was focused on the several types of label noise that could be found in the medical data; however, the combination of them was not studied at this time, since the main objective was to compare the robustness of different types of learning models to each individual label noise. The impact of the combination of the label noise on the learning models’ performance will be assessed in future work.
This study focused on using the quality assessment of individual 30-s PPG records as a problem to study label noise. This enabled us to explore the influence of different types of noise potentially arising from the human annotation process. The generalizability of our conclusions would be reinforced by examining a wider range of PPG-based tasks. It is also the case that different tasks may be susceptible to different types of noise in the data set. This suggests a need for a more task-specific investigation.
The current work was dedicated to assessing the robustness of the previous developed classification models; however, other recent promising approaches have been proposed to deal with sequential time series such as transformers. Transformers and other AI-based models must be continually explored trying to find the best classifier for the quality assessment. Also, our results show deep learning methods are more robust to label noise, our future work would also focus on further improving robustness of deep neural networks.