4.1. Consistency-Based Methods
Relation-driven Self-Ensembling Model (SRC-MT) [
19]: This method employs the Mean Teacher framework with a consistency-enforcing approach. In addition, this method explores the intrinsic relation between the images, which is generally overlooked in consistency-based methods, including Mean Teacher. Studying the relation between the images in an unsupervised fashion helps extract useful information from the unlabeled data [
47,
48]. The SRC-MT method proposes a novel Sample Relation Consistency (SRC) paradigm which enforces the consistency of the relation between two images after perturbation; if two images have a similarity before perturbation, this relation should be preserved after they are perturbed. In other words, the relationship between input samples
and
must be the same as the relationship between perturbed samples
and
. Thus, this method achieves consistency in both labeling and relations after perturbation. The total objective functions of the framework are as follows:
wherein
is the supervised objective and
the unsupervised objective.
is made up of the standard consistency loss
, and the relation consistency loss
.
is the trade-off weight between unsupervised and supervised loss, and
is a hyperparameter to balance
and
.
SRC-MT was tested on two public benchmark medical image datasets; thorax disease classification with ChestX-ray14 for multilabel classification and ISIC2018: Skin Lesion Analysis for single label classification. Perturbations such as random rotation, horizontal and vertical flips, and translation were applied to the input images for consistency enforcing. For the skin legion dataset, the images were all resized to 224 × 224. The dataset was divided into 70% training, 10% validation, and 20% testing. DenseNet121 [
23], pre-trained on ImageNet [
49], was used as the network backbone. For the ChestX-ray14 dataset, the images were resized to 384 x 384, and the same data split as the previous dataset was used. DenseNet169 [
23], pre-trained on ImageNet, was used as the network backbone. For comparison, the upper bound and baseline performances were calculated; the upper bound was the performance of the fully supervised model trained on 100% labeled data, and the baseline was the performance of the fully supervised model trained on 20% of the data. Additionally, the model was compared to other semi-supervised learning methods, including the self-training-based method [
50], SS-DCGAN [
51], the Temporal Ensembling method, and the Mean Teacher. The training set used for the proposed method consisted of 20% labeled and 80% unlabeled data. The metrics used to gauge the performance of the model for single-label classification was AUC, sensitivity, specificity, accuracy, and F1 scores. For the single-label classification task on the skin legion dataset, SRC-MT achieved the highest AUC, sensitivity, accuracy, and F1 scores compared to the baseline and other semi-supervised models. SRC-MT also achieved the second-highest specificity. Additionally, the AUC score of SRC-MT (93.58) was 3% higher than the baseline and 2% less than the upper bound and the F1 score (60.68) was 8% more than the baseline and 10% less than the upper bound. For the multi-labeled classification on the Chest X-ray14 dataset, the metric used was AUC, and for comparison the upper bound and baseline as well as the GraphX
NET [
34] method were used. The AUC scores were calculated for each of the 14 labels and averaged. The SRC-MT achieved a score of 79.23, approximately 0.4% higher than the Mean Teacher, 1.2% higher than GraphX
NET, 2% higher than the baseline, and 2% lower than the upper bound.
NoTeacher (NoT) [
52]: In the Mean Teacher, the consistency target, which is the teacher, relies on the EMA of the student. In other words, the teacher’s weight is an ensemble of student weights. As such, the weights of the student and teacher are tightly coupled. This leads to the model being vulnerable to confirmation bias [
53]; the teacher continues to reinforce what it already believes. NoTeacher is a consistency-based SSL framework that addressed this issue by employing two independent networks, thus removing the need for the time-averaging component. The NoTeacher framework works as follows: two random augmentations are applied to an input
, which can be a 2D or 3D sample. These augmentations result in samples
and
, which becomes the input of
and
, two networks with similar architectures. The outputs are named
and
for labeled inputs and
and
for unlabeled inputs. Next, a loss function is calculated to enforce prediction consistency between
and
. This loss function consists of a consistency loss and a supervised cross-entropy loss; if
and
are augmented versions of the same input
, then
resulting from
and
resulting from
must be similar. Additionally, if
is a labeled input, then the outputs of both networks should match the ground truth target
. Lastly, the total loss is backpropagated to update network parameters. The NoTeacher method is like the Mean Teacher as they both employ two networks with similar architectures. However, the NoTeacher method introduces two differences; first, it removes the EMA, making the networks entirely independent, and second, NoTeacher bases its loss function on a graphical model. The graphical model involves a graph consisting of
,
and
as the nodes, each connected to a consensus function
which enforces mutual agreement of the outputs on labeled and unlabeled data. The three datasets used for testing this model were Chest X-ray14, Radiological Society of North America (RSNA) Brain CT, and the Knee MRNet dataset containing knee MRI exams. The backbone networks used for each dataset were DenseNet121 [
23], DenseNet169 [
23], and MRNet [
54], respectively, all pre-trained on ImageNet. All inputs were normalized based on ImageNet statistics before training. For all the datasets, the experiment was repeated six or seven times with different percentages of the dataset being labeled. The Chest X-ray14 dataset was used for multi-label classification. It was divided into 70% training, 10% validation, and 20% testing. The AUROC scores of the method were compared to the supervised baseline as well as other semi-supervised methods, including Mean Teacher, VAT, Pseudo-labeling, and the previously discussed SRC-MT. The NoTeacher method consistently achieved higher scores than the previous methods. It achieved a score of 82.10 at 50% labeled data, 1% less than the supervised model at 100% labeled data. The SRC-MT method attained lower scores than MT at all labeling budgets below 20%, but a higher score at 20%. The RSNA Brain CT dataset was used for multi labeled classification. It includes 19,520 CT brain scans from the RSNA 2019 challenge. Each slice in these scans is labeled with five binary labels corresponding to the absence or presence of five different forms of hemorrhage. 14.3% of the images include the presence of at least one form of hemorrhage, with 30.1% including multiple forms. 85.7% of the data are samples with no abnormalities. The dataset was divided into 60% training, 20% validation and 20% testing. NoTeacher achieved consistently higher scores than both the supervised baseline and MT. It also achieved overall higher scores than the other semi-supervised methods. On the other hand, MT provided lower scores than the supervised baseline at higher percentages of labeled data. Lastly, the Knee MRNet dataset consists of 1370 data points labeled with the presence or absence of abnormality, leading to a single labeled classification task. 80.6% of the data points contain abnormalities. 1130 training scans and 120 testing scans were used, with 20% of the training scans being randomly sampled for the validation set. Similar to the previous experiments, NoTeacher achieved higher scores than both the supervised baseline and the other semi-supervised methods. Additionally, at the labeled data percentage of 50%, NoTeacher managed to achieve a score higher than the supervised baseline at 100% labeled data.
Self-supervised Mean Teacher for Semi-supervised (S
2MTS
2) [
55]: S
2MTS
2 is a consistency-based SSL method for chest x-ray classification, which uses the Mean Teacher framework. This method consists of two learning stages. In the first stage, the student-teacher model is pre-trained in a self-supervised manner on the labeled and unlabeled data using JCL [
41]. JCL involves learning an infinite number of key-query pairs generated from unlabeled data with the goal of enforcing dependencies between different key-query pairs containing the same query, which leads to more consistent representations in each instance-specific class [
41]. Thus, each query
paired simultaneously with multiple of its positive keys
must return a low loss value. The loss of each
pair is defined as follows:
where
is the temperature hyperparameter,
is the
positive key of
and
is the
negative key of
. Thus, the total JCL loss can be calculated as follows:
where
is the set of labeled and unlabeled images, and
M is the number of positive keys. In the second stage, the pre-trained student-teacher model is fine-tuned using the original Mean Teacher approach by maintaining an EMA following Equation (1). The datasets used for testing were the Chest X-ray14, ISIC2018 Skin Lesion Analysis, and CheXpert. The CheXpert dataset contains 220,000 images labeled with 14 diseases, wherein each image can have more than one label. The backbone model used for all datasets was DenseNet121 [
23], with the two-layer multi-layer perceptron (MLP) projection head being replaced with a three-layer MLP. The Chest X-rat14 dataset was resized to 512 × 512, and the ISIC2018 and CheXpert datasets were resized to 224 × 244. The data augmentations used were random resize, random crop, random horizontal flipping, and random rotation. The Chest X-ray14 dataset was evaluated on different percentages of labeled data (2%, 5%, 10%, 15%, 20%), resulting in AUC scores of 74.69, 78.96, 79.90, 80.31, 81.06, and 82.51. For the CheXpert dataset, the model achieved AUC scores of 66.15, 67.85, 70.83, 71.37, and 71.58 for 100, 200, 300, 400, and 500 labeled samples, respectively. Lastly, the results obtained on the ISIC2018 dataset with 20% labeled samples were an AUC score of 94.71% and an F1 score of 62.67.
A summary of the methods discussed in this section is provided in
Table 1. Three consistency-based methods were reviewed: SRC-MT, NoT, and S
2MTS
2. For each model, the table contains the datasets used for testing as well as the best performance score obtained on each dataset.
4.2. Adversarial Methods
Uncertainty-Guided Virtual Adversarial Training With Batch Nuclear-Norm Optimization [
56]: This method incorporates batch-nuclear norm (BNN) optimization [
57] to prevent overfitting to the labeled data and to improve the diversity and discriminability of the model. As proposed by Cui et al. [
57], the nuclear-norm
of the
m ×
n prediction matrix
is calculated as follows:
where
represents the
ith largest singular value of
. As mentioned previously, the goal of incorporating BNN optimization is to improve generalization and avoid overfitting the labeled data. This is achieved by minimizing the BNN of the labeled data and maximizing the BNN of the unlabeled data batch. Thus, the labeled BNN loss
and unlabeled BNN loss
are defined as follows:
and
represent the size of the labeled and unlabeled datasets, while
and
represent the nuclear norm of the labeled and unlabeled prediction matrices, respectively. In addition to BNN, the model incorporates uncertainty guidance during VAT loss computation to filter out unlabeled samples close to the decision boundary. The uncertainty
is calculated for each unlabeled sample
in a batch, and predictions with high uncertainty are filtered out to ensure that only reliable targets are used for learning.
is calculated as follows:
with
representing the predicted probability of
for the
jth class and C denoting the number of classes. Thus the model is trained using the following losses: the cross-entropy loss of the supervised model (
), the VAT loss calculated from labeled data (
), the uncertainty-guided VAT loss computed from the unlabeled data (
), and the BNN losses
and
. The overall loss for the labeled data is the combination of all losses calculated over the labeled data:
Similarly, the unlabeled loss can be calculated as follows:
wherein
,
and
are weighting coefficients. The overall objective function is the addition of the supervised and unsupervised losses
. The model was tested on the ISIC2018 and Chest X-ray14 datasets as well as an in-house collected hip x-ray dataset composed of 26,075 x-ray images labeled into four categories. The ISIC2018 and hip x-ray images were resized into 224 × 224 images. DenseNet121 [
23], pre-trained on ImageNet, was used as the backbone network for both datasets. The X-ray14 dataset was resized into 256 × 256 images with DenseNet169 [
23] as the backbone network. The model achieved an AUC score of 96.04 and an F1 score of 69.67 on the ISIC2018 dataset with 20% labeled data. On the Chest X-ray14 dataset, the method achieved AUC scores of 69.75, 74.50, 77.52, 79.49, and 80.69 on labeled percentages of 2%, 5%, 10%, 15%, and 20%. Lastly, the hip x-ray dataset generated an AUC score of 92.43 and an F1 score of 72.39 with 20% labeled data. Other state-of-the-art methods such as MT, SRC-MT, and VAT were also tested on this dataset achieving AUC scores of 92.00, 91.40, 91.66, and F1 scores of 70.52, 70.90, and 70.57, respectively.
Semi-supervised Adversarial Classification (SSAC) [
58]: this GAN-based method involves a reconstruction network R, and a supervised classification network C. Learnable transition (T) layers are used to transfer the image representation ability learned by network R to C. R is an adversarial autoencoder-based unsupervised network consisting of a generator G and a discriminator D. G contains an encoder and a decoder. The encoder, which has an architecture similar to ResNet [
59], takes 64 × 64 patches, and the decoder generates reconstructed patches of the same size. D is a deep convolutional neural network containing four convolutional layers. C contains two parts, one with the same architecture as the encoder of R, and one a fully connected layer with two neurons, separated by a global average pooling (GAP) layer. There is no parameter sharing between R and C; the learnable T layers, with each layer consisting of a 1 × 1 convolutional layer, transfers the feature maps obtained by R to the corresponding blocks in C. In the experimentation, C was pre-trained on ImageNet, and R was pre-trained on both the labeled and unlabeled data. The loss function was defined as follows:
where
is the
mth input sample and
is a weighting parameter. The three terms in the function represent the mean square reconstruction loss of G, the cross-entropy adversarial loss of D, and the supervised classification loss, respectively. Two datasets were used for experimentation. The first dataset was a database of lung nodules on CT-scans from the lung image database consortium (LIDC), and the image database resource initiative (IDRI) called LIDC-IDRI. The second dataset was the Tianchi Lung Nodule Detection dataset. The LIDC-IDRI dataset contains 1018 chest CT scans from which the nodules of diameter 3mm to 30 mm were identified and annotated. From the Tianchi Lung Nodule Detection dataset, 1000 patients’ data was used to extract 1227 unlabeled nodules. In total, the combination of the two datasets resulted in 644 malignant, 1301 benign, and 1839 unlabeled samples, making this a single-label classification task. For this task, the MK-SSAC model was constructed, which consists of nine knowledge-based collaborative SSAC models (KBC-SSAC) trained on the patches extracted from the lung nodules whose predictions were subsequently combined by two output neurons. The model was tested five independent times with 10-fold cross-validation. The metrics calculated to gauge the results were accuracy, sensitivity, specificity, and AUC. The performance of the model was compared to other SSL methods (MK-CatGAN, MK-AAE, MK-Ladder Network), hand-crafted feature-based methods (3D GLCM + SVM, MVF + SVM), and DCNN-based methods (Fuse-TSD, MV-KBC). The average scores achieved by the model were 92.53 accuracies, 84.94 sensitivity, 96.28 specificities, and 95.81 AUC. The model obtained the highest accuracy and AUC scores as well as the second-highest sensitivity and specificity among all the methods compared. Additionally, the MK-SSAC model was tested with different percentages of the labeled data (100%, 80%, 60%, 40%, and 20%). From 20% to 100%, the scores increased by about 2% for accuracy and 3% for AUC. In comparison, the supervised MK-C model was also tested with the same percentages of labeled data and showed an increase in 7% in accuracy and 5% in AUC. These results show that the MK-SSAC model has relative robustness to variation in the number of labeled data.
Retinal Image Synthesis and Semi-Supervised Learning for Glaucoma Assessment (SS-DCGAN) [
51]: SS-DCGAN is a model created to synthesize retinal images and determine the presence or absence of glaucoma. This model is based on the DCGAN architecture with a modification in the last output layer of D: one neuron for synthesis and three neurons for training the glaucoma classifier. D thus becomes a classifier that has to label each sample as either Normal, Glaucoma, or Synthetic. The loss function of the method was as follows:
where
for
K classes is the cross-entropy loss function:
and
is GAN’s two-player minmax game:
where
is the probability of
being from the real data and
is the probability of
being from the generator. The dataset used for experimentation was a combination of fourteen public databases with a total of 86,926 images. Samples with glaucoma or normal labels were divided into a 70% training set (669 glaucomas and 981 normal) and a 30% testing set (287 glaucomas and 420 normal). The labeled training set, as well as all unlabeled images (84,569 samples), were used to train the model. The metrics calculated were AUC, F1-score, specificity, and sensitivity. The final results were compared to three previous models: ResNet50 [
59] as well as the CNNs proposed by Alghamdi et al. [
60] and Chen et al. [
61]. SS-DCGAN achieved the highest AUC (0.90) and F1-score (0.84), as well as the second-highest specificity and sensitivity scores among the models.
Bi-Modality Medical Image Synthesis Using Semi-Supervised Sequential Generative Adversarial Networks [
62]: multi-modality imaging involves the simultaneous usage of two or more imaging modalities for one examination [
63]. For example, single photon emission computed tomography (SPECT), magnetic resonance imaging (MRI), and positron emission tomography (PET) can be used to combine optical, magnetic, and radioactive reporters to detect abnormalities in the epileptic brain. Common examples of bi-modality images are PET-SPECT and PET-CT [
63]. The model proposed by Yang et al. was intended to synthesize high-quality bi-modality medical images using GANs by creating two sequential generative networks, each corresponding to one modality. The first modality in the sequence is the one with the least complexity, measured automatically by a complexity measurer algorithm. The goal of this is to decrease the difficulty of generating the second, more complex modality by conditioning it on the first. The proposed generator network is trained via semi-supervised learning in order to generate realistic images with a large diversity. The supervised learning model learns the joint distribution of the different modalities, while the unsupervised learning model learns the marginal distribution of the different modalities via adversarial learning. The architecture of the generator is as follows: first, a real image of modality 1 is encoded into a low-dimensional latent vector, which is then mapped to a fake image of the same modality by a decoder. For the second image, an image-to-image translator is used to create a fake image of modality 2 with the help of information gained from the previously generated fake image of modality 1. In the supervised training, the original paired images are provided, and so for each pair of fake images generated by the generator, the corresponding true pair in the original data can be found. Thus, the loss function of the supervised training is the pixel-wise reconstruction loss:
where
and
represent the fake images while
and
represent the real images and
represents the average Manhattan-distance calculated pixel-wise between intensities of images
and
. In order to counter a severe overfitting issue present in the supervised learning model due to the small number of labeled images available, an unsupervised learning model is also deployed. In the unsupervised model, the generator is trained using unpaired images and noise vectors rather than encodings. The model aims to minimize the Wasserstein (
W) distances between the fake and real images. As such, the loss function of the unsupervised generator is:
where
is the
W distance between the real and fake images of modality 1 and
is the
W distance between the real and fake images of modality 2. The semi-supervised training is carried out as follows: in one iteration, the generator is trained in a semi-supervised manner with a set of paired training images. In the next iteration, the decoder and image translator are trained in an unsupervised manner using a set of unpaired images. The supervised and unsupervised training are alternated for 40,000 iterations. The usage of supervised learning allows for the creation of correctly paired images, while unsupervised training allows for higher realism and diversity. The images generated by this model were used as real training data in a single-label prostate cancer classification task in which each pair of images is classified as clinically significant (CS) or non-clinically significant (nonCS). The classifier was trained using 483 synthetic multimodal images of type ADC-T2w, and the testing set contained 50 real CS images and 50 real nonCS images. The evaluation metrics used were the Inception Score (IS) and Fréchet Inception Distance (FID) for quality assessment of the synthetic images, Mutual Information Distance (MID) to assess the correctness of the pairings between ADC and T2w synthetic images, and finally, the accuracy of the classification task. When compared with four state-of-the-arts GAN-based image synthesis methods (Costa et al. [
64], CoGAN [
65], CycleGAN [
66], and pix2pix [
67]), the proposed model achieved better results in all metrics, with a classification accuracy of 93%.
A summary of the methods discussed in this section is provided in
Table 2. Three adversarial methods have been reviewed: Uncertainty-guided VAT with BNN, SSAC, and Bi-Modality Image Synthesis. For each model, the table contains the specific adversarial approach (VAT-based or GAN-based), the datasets used for testing, as well as the best performance score obtained on each dataset.
4.3. Graph-Based Methods
GraphX
NET: GraphX
NET is a semi-supervised graph-based framework that performs a classification task with an extremely small number of labeled samples and a large number of unlabeled samples. The function used in this model is
, based on the normalized graph p-Laplacian with p = 1. The algorithm is as follows: for each class
, the model assumes that there exists a set of labeled nodes
. For each class
, a variable
is chosen that has values in all nodes of the graph. Assume the total number of classes is
. For all unlabeled nodes
, the
chosen variables are coupled with the following constraint:
. Using a small positive number
, the following constraint is applied:
The goal of the model is then to minimize the normalized ratios
. This model was evaluated using the ChestX-ray14 dataset. The results of the classification task were compared to two previously existing state-of-the-art deep learning methods for X-ray classification proposed by Wang et al. [
68] and Yao et al. [
69]. For both methods, 70% of the data was labeled, whereas, for the GraphX
NET method, only 20% was labeled. The metric used for comparison was the average of the AUC score over all 14 classes. GraphX
NET achieved the highest average score (0.78) despite using a much lower number of labeled samples. Additionally, when tested with three different, random partitions of data, GraphX
NET proved to be more stable to variations in the data partition.
Graph-Embedded Random Forest [
70]: This method makes the standard random forest algorithm more reliable when dealing with a low number of labeled samples. With the traditional method, insufficient training data leads to a limited depth, an inaccurate prediction model for leaf nodes, and a sub-optimal splitting strategy [
71]. In the graph-based model, Gu et al. [
70] improved the splitting strategy by replacing the information gain algorithm with a graph-embedded entropy. The goal is to utilize the local structure of unlabeled data to achieve higher reliability when using a small number of labeled data while maintaining the advantages of the random forest, such as low computational burden and robustness towards overfitting. The loss function is a sum of the supervised loss and a graph Laplacian regularization term. First, a graph
is constructed using labeled and unlabeled data. The nodes represent the training samples and
W is the symmetric weight matrix, calculated as follows:
Based on the label information for unlabeled samples gained from the graph embedding, the new information gained is as follows:
where
is the node,
is the left child node,
is the right child node, τ is a threshold,
and
represent labeled instances and the class labels, respectively, and
represents unlabeled instances. The medical image datasets used for the classification task were the Digital Retinal Images for Vessel Extraction (DRIVE) 2D retinal vessel dataset and the Big Neuron 3D neuronal dataset. For both datasets, 40,000 samples were randomly selected from the training and testing sets, respectively, each containing 20,000 positive and 20,000 negative samples. The results were compared with the standard random forest trained on labeled data only, as well as the RobustNode method [
71] trained on both labeled and unlabeled data. The results showed considerable improvement over both the standard random forest and the RobustNode method. For the DRIVE dataset, training with a labeled dataset size of 400 achieved an accuracy of 79.42%. For the Big Neuron dataset, training with 1500 labeled data achieved an accuracy of 74.16%.
Colorectal Cancer Tissue Classification Using Semi-Supervised Hypergraph Convolutional Network [
72]: Bakht et al. [
72] propose a hypergraph-based approach for colorectal cancer (CRC) classification. Hypergraphs allow more complex relationships between nodes compared to standard graphs by allowing one edge to join multiple nodes. The images used for classification were CRC Whole Slide Images (WSIs), high-resolution images captured from a microscope slide representing tissue structures that can be used to identify malignancy. First, patches of size 224 × 224 are extracted from the images, and a VGG-19 [
73] model in feed-forward mode is used to extract the feature matrix
X from the set of
n patches. The hypergraph
is then constructed from
X by connecting each vertex with its
k nearest neighbors. The hypergraph is represented by a vertex-edge probabilistic incidence matrix
H of size
n ×
n as follows:
where
is the Euclidean distance between the current node and the neighbor node,
is the average Euclidean distance between the
k neighbors, and
is the maximum probability. The degrees of each vertex
and edge
were calculated as follows:
The combination of all node and edge degrees results in the diagonal matrices
and
. In the classification step,
X and
H are fed to a hypergraph neural network (HGNN) consisting of three hidden convolutional layers and a SoftMax classification layer. Spectral graph convolution is used for representation learning as follows:
where
is the activation function,
is a parameter to be learned during training, and
is one’s diagonal matrix. Layer
L outputs
which is received as an input by layer
L + 1. The model was tested on a dataset of CRC WSIs containing 4995 patches of size 224 × 224 representing 7 different tissue types. 70% of the dataset was used for training and 30% for testing. The model achieved an F1 score of 0.94 for the seven-class classification task.
A summary of the methods discussed in this section is provided in
Table 3. Three graph-based methods have been reviewed: GraphX
NET, Graph-Embedded RF, and CRC with HGNN. For each model, the table contains the datasets used for testing as well as the best performance score obtained on each dataset.
4.4. Other Methods
The following section provides reviews of methods that do not neatly fit into the previous three categories.
Anti-curriculum Pseudo-labeling for Semi-supervised Medical Image Classification (ACPL) [
37]: Liu et al. [
37] propose a pseudo-labeling based image classification method for the Chest X-ray14 and ISIC2018 Skin Lesion Analysis datasets. The ACPL approach aims to address the disadvantages of traditional pseudo-labeling and develop a model capable of achieving state-of-the-art results on par with consistency-based methods. ACPL relies on the existence of a distribution shift between labeled and unlabeled data. The unlabeled samples chosen for pseudo-labeling are the ones located as far as possible from the labeled data’s distribution, making them more likely to belong to the minority class and, thus, more informative, resulting in a more balanced training process. The informativeness of a sample is determined using the cross-distribution sample informativeness (CDSI) measure, which computes the closeness of the unlabeled sample to the set of most informative labeled samples called the anchor set (
). The CDSI was calculated as follows:
in which
denotes the information content random variable (low, medium, high),
represents the parameters of the Gaussian Mixture Model (GMM) and
. Once the set of most informative unlabeled samples is selected, the informative mixup (IM) method is used for pseudo-labeling. This method mixes the labels obtained from the K-nearest neighbor (KNN) classification with the labels obtained from the model
, where
wherein
is the input image feature and
is the final activation function to calculate an output in
. The IM method carries out the pseudo-labeling process by calculating a linear combination of the model prediction
and the KNN prediction is weighted by the density score. Following the pseudo-labeling stage, the Anchor Set Purification (ASP) algorithm is used to select the most informative pseudo-labeled samples to be added to the anchor set. The backbone model used for experimentation was DenseNet-121 [
23] for both datasets. The images in the X-ray14 dataset were resized to 512 × 512. The training was performed with batch size 16 and a learning rate of 0.05 for 20 epochs in the first step and then 50 epochs, where the anchor set was updated with new pseudo-labeled samples every 10 epochs. The model achieved AUC scores of 74.69, 79.96, 79.90, 80.31, and 81.06 for labeled dataset sizes of 2%, 5%, 10%, 15%, and 20%, respectively. For the ISIC2018 dataset, images were resized to 224 × 224. The training was conducted with batch size 32 and a learning rate of 0.001 for 40 epochs in the warm-up stage, followed by 100 epochs, with the anchor set being updated with ASP every 20 epochs. The model achieved AUC and F1-scores of 94.36 and 62.23 at a labeled set size of 20%.
Federated Semi-supervised Medical Image Classification via Inter-client Relation Matching [
74]: Liu et al. [
74] propose an SSL method based on FL. Existing SSL methods cannot be used reliably for FL as they depend on the labeled dataset being accessible during training. However, in the case of FL, the local data at a particular location may be entirely unlabeled. As such, Liu et al. [
74] build an interaction between the learning of labeled and unlabeled data by leveraging inherent disease relationships which are independent of a specific local dataset. This client-invariant disease relation information can be extracted from the labeled data and used to supervise the learning of unlabeled data to ensure it captures similar disease relationships. An uncertainty-based algorithm is used to filter out unreliable pseudo-labels. The backbone FL framework of this method follows the standard FL paradigm in which the central server sends the global parameters
to every client
k. The model is trained at each
k with a local objective function
and the updated local parameters are sent back to the central server and aggregated to update the global model. In the proposed model, local parameters are aggregated using weights proportional to the size of their database following the federated averaging algorithm (FedAvg) method [
75]. The local learning objectives use the cross-entropy loss at labeled clients and the consistency regularization mechanism at unlabeled clients with transformations such as rotation, translation, and flip. The inter-client relation matching (IRM) method is introduced for disease relation estimation to assist learning at unlabeled clients, in which the disease relation matrix
is derived from the data’s pre-SoftMax features at each labeled client and averaged to obtain the matrix
representing the general release relation. The model was tested on two datasets; ISIC2018 and the RSNA 2019 Brain CT dataset. From the RSNA dataset, 25,000 slices were sampled and divided into 70%, 10%, and 20% splits for training, validation, and testing. The ISIC2018 images were resized to 224 × 224 and split similarly. The FL setting was simulated by partitioning the training set into 10 local clients, of which 20% (two clients) were labeled. DenseNet121 was used as the backbone model. On the RSNA dataset, the model achieved an AUC score of 87.56 and F1-score of 59.86. On the ISIC2018 dataset, it achieved an AUC score of 92.46 and F1-score of 55.81.
Semi-HIC [
76]: Su et al. [
76] propose an SSL method called Semi-HIC for the classification of histopathological images, which are representations of biopsy samples captured by a microscope. This model used a patch-wise CNN to learn embeddings of labeled and unlabeled images, wherein three convolutional blocks are used to extract low-level detail information, and four cascaded inception blocks are used to create discriminative representations for patches. Semi-HIC uses a novel loss function obtained from combining an association cycle consistency (ACC) loss and a maximal conditional association (MCA) loss. These losses are intended to address the two major challenges in histopathological image classification, which are inter-class similarity and intra-class variation. They are based on the walker loss and visit loss proposed by Haeusser et al. [
39] in their LA method. The original visit loss is vulnerable to the inter-class similarity in histopathological images, as a cancerous labeled embedding
and unlabeled embedding
with the underlying non-cancerous class may have high similarity. Su et al.’s MCA loss aims to address this by separately calculating the visit probability of non-cancerous embeddings
and the visit probability of cancerous embeddings
. If the unlabeled embedding
has an underlying cancerous class, it is expected to have a high
and low
and vice versa. Thus, the MCA loss
is defined as the conditional entropy between
and
:
where
is the size of
. The goal is to maximize the difference between
and
for the model to learn discriminative representations for unlabeled patches. The ACC loss, on the other hand, applies modifications to the walker loss in order to make the model robust to intra-class variation. This is achieved by penalizing association cycles that start and end at labeled embeddings belonging to the same image. The ACC loss was calculated as follows:
representing the cross-entropy between the association cycle probabilities
and uniform target distribution
. The penalty factor
is 1 if the association cycle does not start and end at the same image (
) and 5 if it does. Finally, the total loss function of the model is:
wherein
is the binary cross-entropy loss of the CNN model.
,
, and
are the weights of each loss which are set to 1, 1, and 0.75, respectively. This model was evaluated on the Bioimaging2015 and the Grand Challenge on BreAst Cancer Histology (BACH) datasets. The Bioimaging2015 dataset contains 249 training and 36 testing images of cancerous and non-cancerous histopathological images. The training set is split into 222 training and 27 validation images. The BACH dataset contains 400 cancerous and non-cancerous histopathological images split with the ratio 70:10:20 into the training, validation, and testing sets. After patch extraction, the size of the dataset is 8420, 1163, and 2369 patches for training, validation, and testing, respectively. 5-fold cross-validation was carried out on the BACH dataset, while the average of 5 runs was computed for the Bioimaging2015 dataset. For the BACH dataset, the model obtained error rates of 30.38, 24.87, and 15.46 for 20, 40, and 80 labeled images per class, respectively. For the Bioimaging2015 dataset, the model achieved error rates of 30.13, 28.39, and 25.23 for 20, 40, and 80 labeled images per class, respectively. The model proved to be more successful than state-of-the-art supervised models for histopathology image classification.
A summary of the methods discussed in this section is provided in
Table 4. Three methods have been reviewed: ACPL, FL with IRM, and Semi-HIC. For each model, the table contains the type of approach to SSL, the datasets used for testing, as well as the best performance score obtained on each dataset.
4.5. Hybrid Methods
This section covers a few models which combine two or more of the previously reviewed approaches.
Local and Global Consistency Regularized Method [
77]: This method is both consistency-based and graph-based. This method uses the Mean Teacher framework and also enforces the local and global consistency of the data [
78]. Local consistency is the tendency of instances from the same class to be in the same area in the feature space, and global consistency is the tendency of instances from the same global structure to have the same label. This method integrates label propagation (LP) to enforce local and global consistency. LP is a semi-supervised learning algorithm that involves the propagation of labels from labeled samples to unlabeled samples based on their closeness, which is determined by the affinity matrix. For an unlabeled instance
, the label is calculated by taking the weighted average of the labeled instances close to
; then, the new label of
can be propagated to other neighboring unlabeled data. Next, a graph is constructed based on the labels created by the LP algorithm as well as the ground truth labels:
in which
and
represent the labels of the data. Contrastive Siamese loss [
79] is used in order to enforce local and global consistencies by pulling the instances from the same class closer and pushing the ones from different classes farther apart:
where
is a hyperparameter and
is the feature vector extracted from the intermediate layers of the student network. The final loss function is as follows:
where the total loss is a weighted addition of the Mean Teacher loss (
) and the graph-based loss (
and
).
is the weight of the loss calculated on the labeled instances and
is the weight of the loss calculated on both labeled and unlabeled instances. The loss on the unlabeled instances is not calculated separately as the labels predicted by the LP algorithm are noisy, and their inclusion may harm the performance of the method. Testing was conducted using two datasets, the Multi-organ Nucleus Segmentation (MoNuSeg) dataset containing 22,462 samples and the Ki-67 nucleus dataset containing 17,516 samples. In both datasets, each sample is labeled with one of four types, leading to a single labeled classification task. For both datasets, 80% of the data was used for training and 20% for testing. From the training data, 20% was used for validation. The model architecture is a 13-layer convolutional neural network similar to the one proposed in the original Mean Teacher method [
27]. The metric used to gauge the method’s performance is F1. Testing was conducted at different percentages of labeled data (5, 10, 25, 50, 100 for MoNuSeg and 1, 5, 10, 100 for Ki-67). For the MoNuSeg dataset, the proposed method achieved higher scores than both the supervised baseline and the Mean Teacher for all percentages, notably in the 5% and 10% percentages, where it obtained scores approximately 2% higher than the Mean Teacher. The highest score obtained was 76.89 for 50% of labeled data, and the lowest was 75.02 for 5% of labeled data. In the case of the Ki-67 dataset, the proposed method also achieved a higher score than the supervised baseline and Mean Teacher, the highest increase from Mean Teacher being approximately 2.5% at the 5% labeled percentage. The highest score was 79.79 for 10% of labeled data, and the lowest was 74.9 for 1% of labeled data.
Deep virtual adversarial self-training with consistency regularization [
80]: This model is both adversarial and consistency-based. It consists of three parts: self-training, consistency regularization, and virtual adversarial training. As previously mentioned, self-training involves using the model itself to generate labels for unlabeled samples, which will then become part of the labeled training set in the next iteration. The model outputs a probability distribution over all classes, and the generated label is only kept when the highest probability is above a predetermined threshold. Consistency enforcement is applied to both labeled and unlabeled samples. For the labeled samples, weak augmentation is applied, and the generated label is encouraged to be consistent with the ground truth label. For unlabeled samples, weak augmentation is applied, then a pseudo-label is generated by the self-training model; the same input image is then strongly augmented, and the new prediction must be consistent with the previously obtained pseudo-label. Lastly, virtual adversarial training is applied to improve the robustness of the model and strengthen its generalization capability. Thus, the loss function of the model becomes the weighted sum of three losses: the supervised cross-entropy loss
for labeled data, the regularization loss
for unlabeled data and the virtual adversarial training loss
applied to both labeled and unlabeled data.
where
and
are the weighting coefficients. Experimentation was performed using two datasets; a breast ultrasound (BRUS) dataset and an optical coherence tomography (OCT) dataset. The BRUS dataset contains 39,904 images, 22,026 of which are labeled as malignant (14,557) or benign (7469). The other 17,878 images are unlabeled. The labeled data were randomly split into 34% training (7662), 34% validation (7579), and 32% testing (6785). The OCT dataset contains 109,309 images, each labeled with four different types (three diseases and one normal). There were 20,000 images taken for the validation and testing datasets, respectively, and 1% of the remaining data (691 images) was taken for the labeled dataset, whereas the rest was considered unlabeled. The network architecture used for this experimentation is DenseNet-121. The weak augmentations include horizontal and/or vertical flips with a 50% probability, as well as randomly applied horizontal and/or vertical translations by less than 20% of the image width or height. Strong augmentation includes transformations such as color channel reduction, contrast maximization, solarization, rotation, and posterization. The results are compared with Mean Teacher, VAT, MixMatch [
42], and FixMatch [
44], as well as the fully supervised DenseNet-121 as the baseline. The metrics calculated were the accuracy, F1, AUC, and Kappa scores. The proposed method achieved higher scores than all the compared methods on both datasets, with an F1-score of 0.88 on the BRUS dataset and a Macro F1-score of 0.91 on the OCT dataset.
Pseudo-labeling generative adversarial networks (PLGAN) [
81]: This model proposed by Mao et al. [
81] is based on pseudo-labeling and GANs. It additionally incorporates CL and MixMatch. The training process of this method is comprised of 4 steps; (1) pretraining, (2) image generation, (3) finetuning, and (4) pseudo-labeling. In step 1, the feature layer of ResNet50 [
59] is pre-trained using CL to extract key image features. In step 2, GANs are used to generate images that simulate the real distribution of the labeled images with random Gaussian noise. In step 3, the cross-entropy loss is used to classify the images created by the generator in order to finetune the discriminator for classification. The model contains two classifiers, a global and a local classifier, both of which use two convolutional blocks to extract feature information. In step 4, the MixMatch approach is used for pseudo-labeling. The trained generator is used to create additional unlabeled samples in order to expand the unlabeled dataset, and these generated images are combined with the original dataset. In the same fashion, the pseudo-labels created for both the original and generated samples are combined to create the complete set of pseudo-labels. The overall loss function of the model contains four loss functions for each of the 4 steps. The loss function for the first step is the infoNCE [
82] loss for CL and the reconstruction loss for intermediate vectors to avoid CL pattern collapse. For the second step, the loss function is the least squares loss. The loss of the semi-supervised finetuning step is the addition of the supervised cross-entropy loss and the unsupervised loss. Lastly, the fourth step uses the loss function of MixMatch. The model was tested on a dataset of optical coherence tomography (OCT) images for the classification of retinal degeneration, with each sample belonging to one of four categories (three disease labels and one normal label). Additionally, it was tested on a COVID-19 dataset, a brain tumor MRI dataset, and a chest x-ray dataset. For the OCT and chest x-ray datasets, 100 labeled and 1000 unlabeled images were used. In the MRI dataset, 100 labeled images and 480 unlabeled images were selected. Lastly, 200 labeled and 200 unlabeled images were used for the COVID-19 dataset. The results achieved by the model were classification accuracies of 87.06, 89.50%, 80.50%, and 96.80% for the OCT, COVID-19, MRI, and x-ray datasets, respectively. In addition to PLGAN, the authors developed PLGAN+, which uses Comatch [
45] for pseudo-labeling. This model achieved accuracy scores of 97.10%, 89.30%, 86.80%, and 97.50% for the OCT, COVID-19, MRI, and x-ray datasets, respectively. These results were compared to the results gained by several other state-of-the-art SSL models, including Comatch, MixMatch, FixMatch, and VAT. Overall, PLGAN+ achieved more favorable results than all other models.
Semi-supervised medical image classification based on CamMix [
83]: Guo et al. [
83] proposed an SSL model for medical image classification similar to MixMatch, which combines several SSL approaches. The consistency-based approach is used on unlabeled data to generate pseudo-labels that are robust to various augmentations. The authors argue that the MixUp method, which mixes samples by linearly interpolating the input samples and labels, results in unnatural mixed samples. Thus, they proposed CamMix, a novel MSDA method that mixes pairs of input samples and labels based on the class activation mask generated from the predictions of both labeled and unlabeled samples. As in MixMatch, entropy minimization is achieved by sharpening the target distribution for unlabeled data. The class activation map is obtained as follows for each batch
b at each epoch:
where
is the weight of feature map
k for batch
b and
A.
is calculated as follows:
where
is the pixel value at location (
i,
j) of
k,
Z represents the number of pixels in
k and
denotes the maximum prediction score of batch
b generated by the classification model. Random threshold
is applied to the gray-level
to obtain the binary mask CamMask. The pixels higher than 1 −
are set to 1, and all others are set to 0. The CamMix algorithm takes a batch of labeled and unlabeled data and their corresponding predictions, including weak and strong augmentations of the data, and generates a mixed batch of original samples and shuffled samples, whose corresponding labels are mixed based on the number of pixels in CamMask. Thus, considering the original samples
and the shuffled samples
=
as well as the corresponding label targets
and
, the CamMask is obtained by calculating
, and the parameter
is calculated based on the number of pixels in the CamMask as follows:
The mixed batch
is obtained from
and
by calculating the following:
Lastly, the overall loss of the model is defined as:
wherein logits = model(
mixed_
input). The model was tested with a ResNet-18 on two datasets: The Interstitial Lung Disease (ILD) dataset and the ISIC2018 dataset. Instead of the F1-score, the metric used for evaluation was
, the average F1-score over the different classes. The ILD dataset contains 109 CT scans with an average of 25 slices per case for a total of 2795 slices. All regions of interest were cropped into patches of size 32 × 32, resulting in approximately 7000 labeled patches and 19,000 unlabeled patches. Six typical lung patterns were used for the classification task. By using CamMix, the model achieved an
score of 95.34, higher than previous state-of-the-art MSDA methods for the ILD dataset, such as MixUp, CutMix [
84], and Fmix [
85]. For the ISIC2018 dataset, the model was trained with 20% labeled and 80% unlabeled data, obtaining an AUC score of 94.04 and
of 78.15. These scores were higher than the scores obtained by the previously mentioned MSDA methods. The SRC-MT model was also used for comparison and achieved a
score of 78.15, higher than the one obtained by CamMix. On the other hand, it obtained a lower AUC score.
A summary of methods discussed in this section is provided in
Table 5. Three hybrid methods have been reviewed: Local and Global Consistency Regularized, PLGAN, SSL with CamMix, and Wan et al.’s method. We have reported the SSL methods which have been combined to create each hybrid method, as well as the datasets used for testing each model and the best performance score obtained on each dataset.