3.1. Data
We use the COVIDx-US dataset v1.4. [
1] as the data source. COVIDx-US is an open-access benchmark dataset of lung ultrasound imaging data that contains 242 videos and 29,651 processed images of patients with COVID-19 infection, non-COVID-19 infection (mainly pneumonia), other lung conditions, and normal control cases. The dataset provides LUS images captured with two kinds of probes, linear probes, which produce a square or rectangular image, and convex probes, which allow for a wider field of view [
28]. Due to the difference in the field of view, combining the linear and convex probe data in training may increase noise and influence the performance of the network. As there are also a low number of COVID-19 positive examples captured with the linear probes in the dataset, we exclude them from this study. A total number of 25,262 convex LUS images are then randomly split into the training set, containing 90% of images, and the unseen test set with the remaining 10% of images, ensuring all frames from each video are either in train or test set to avoid data leakage. The training set is then split into 80–20%, representing the training and validation datasets. The validation dataset is used for hyperparameter tuning and performance assessment in the training phase. All images are re-scaled to
pixels to keep the images across the entire dataset consistent. The dataset is further augmented by rotating each image by 90°, 180°, and 270°, resulting in a total of 101,048 images (
). This rotation technique is an appropriate method for increasing the dataset size, as it keeps the images and areas of interest for clinical decisions unaltered and in-bound [
29].
3.2. Methods
COVID-Net USPro is a prototypical few-shot learning network that trains in an episodic learning setting. It uses a distance metric for assessing similarities between a set of unlabelled data, i.e., query set, and labelled data, i.e., support set. Labelled data can be used to compute a single
prototype representation of the class, and unlabelled data are assigned to the class of the prototype they are closest to. A prototypical network [
10] is based on the idea that there exists an embedding in which points in a class cluster around a single prototype representation for the class. During the training phase, a neural network is used to learn the non-linear mapping of the inputs to an embedding space, and a class prototype is computed as the mean of its support set data in the embedding space. Classification is then done by finding the nearest class prototype for each query point based on a specified distance metric. An episodic approach is used to train the model, where in each training episode, the few-shot task is simulated by sampling the data point in mini-batches to make the training process consistent with the testing environment. The performance of the network is evaluated using the unseen test set. Both quantitative performance analysis based on accuracy, precision, and recall and qualitative explainability analysis are conducted. A high-level conceptual flow of the analysis is presented in
Figure 1.
We define the classification problem as a
K-way
N-shot episodic task, where
K denotes the number of classes present in the dataset, and
N denotes the number of images for each class in each episode. For a given dataset,
N images from each of the
K classes are sampled to form the support set, and another
M images from each class are sampled to form the query set. The network then aims to classify the images of the query set based on the
total images presented in the support set. In this work, we consider three classification scenarios and formulate the problem as 2-way, 3-way, and 4-way classification problems. Details are included under
Section 3.3.3.
The few-shot classification with a prototypical network can be summarized into three steps: (1) encoding of the images, (2) generating class prototypes, and (3) assigning labels to query samples based on distances to the class prototypes. Let
and
be the support and query sets, respectively, where each
is a
D-dimensional example feature vector and
is the label of the example. The prototypical network contains an image encoder
that transforms each image
onto a
H-dimensional embedding space where images of the same class cluster together. Class prototypes are then generated for each class by averaging the embedding image vectors in the support set, where
denotes the prototype of class
k [
10]. To classify the query image, a distance metric is used where distances between the embedding vector of a query image and each of the class prototypes are computed. In this work, the squared Euclidean distance
is used, where
q is the embedding vector of the query image, and
is the embedding vector of the
i-th prototype. The choice of the squared Euclidean distance instead of other distance metrics, e.g., cosine distance, is validated by Snell et al. [
10], who proved that metrics that are Bregman divergences, e.g., euclidean distance metrics, perform better in the calculation of class prototypes based on embeddings in prototypical networks. After distances are computed, a SoftMax function is applied over the distances to the prototypes to compute the probabilities of the query image being in each class. The class with the highest probability is then assigned to the query image.
In the training phase, the network learns by minimizing a loss function, i.e., the negative log-SoftMax function (
) of the true class
k. An Adam optimizer with an initial learning rate of 0.001 is used, and the learning rate is reduced if the validation loss is not improved after 3 epochs. In each episode, a subset of data points is randomly selected, forming a support and query set. The loss terms on training and validation sets are calculated at the end of each training episode. To facilitate an effective training process and prevent over-fitting, early stopping is implemented to stop the training process when the validation loss is not improved after 5 epochs. A total of 10 epochs is set for all training processes, and 200 episodes are set for each training epoch.
Figure 2 presents an architecture design overview of the COVID-Net USPro network.
The trained model’s performance is evaluated quantitatively and qualitatively. In the quantitative analysis, the model’s accuracy, precision, and recall for each class are analyzed and reported. In the qualitative analysis, model explainability is investigated and visualized. Explainable artificial intelligence (XAI) has been an important criterion when assessing whether neural networks can be applied to clinical settings [
30]. While AI-driven systems may show high accuracy and precision in analyzing medical images, lack of reasonable explainability will spark criticism of the network’s adoption [
30]. COVID-Net USPro’s explainability is assessed using two approaches, i.e., gradient-weighted class activation map (Grad-CAM) [
14] and GSInquire [
15], on a selected dataset containing correctly classified COVID-19 and normal cases with high confidence (i.e., >99.9% probability), as well as falsely predicted COVID-19 and normal cases. Grad-CAM generates a visual explanation of the input image using the gradient information flowing into the last convolutional layer of the convolutional neural network (CNN) encoder and assigns importance values to each neuron for making a classification decision [
14]. The output is a heatmap-overlayed image that shows the image regions that impact the particular classification decision made by the network [
14]. The other tool, GSInquire, identifies the critical factors in an input image that are shown to be integral to the decisions made by the network in a generative synthesis approach [
15]. The result is an annotated image highlighting the critical region, which could drastically change the classification result if removed [
15]. Results from both tools are reviewed by our contributing clinician (A.F.) with experience in ultrasound image analysis to assess whether clinically important patterns are captured by the network.
3.3. Experiment Settings
We comprehensively assess the performance of COVID-Net USPro in detecting COVID-19 cases from ultrasound images by testing various training conditions, such as different image encoders, the number of shots available for training, and classification task types. Details are further discussed in this section.
3.3.1. Image Encoders
To leverage the power of transfer learning, we experiment with multiple encoders, including, but not limited to, the ResNet, DenseNet, and VGG networks [
22,
24,
31]. Pre-trained models refer to using model parameters pre-trained on ImageNet [
32]. To concisely summarize the main results, we report the top-4 performing encoders with respect to our research objectives:
ResNet18L1: Pre-trained ResNet18 [
24], with trainable parameters on the final connected layer and setting out features as the number of classes. We consider this pre-trained network as the baseline model for encoders, as it contains the least number of layers and retrained parameters.
ResNet18L5: Pre-trained ResNet18 [
24], with trainable parameters on the last four convolutional layers and final connected layer. Out features set to the number of classes.
ResNet50L1: Pre-trained ResNet50 [
24], with trainable parameters on the final connected layer and setting out features as the number of classes.
ResNet50L4: Pre-trained ResNet50 [
24], with trainable parameters on the last three convolutional layers and final connected layer. Out features set to the number of classes.
3.3.2. Number of Training Shots
The optimal number of shots for maximized performance is tested by training the model under 5, 10, 20, 30, 40, 50, 75, and 100-shot scenarios. For selected models showing a steady increase of performance over increasing shots, 150 and 200-shot conditions are also tested to further verify that the maximum performance is reached at 100-shot. To ensure the training process is faithful to the testing environment, the number of shots for each class presented in each episode is the same in the support and query sets in both training and test phases. For example, in the 5-shot scenario, five images in each class are presented for both the support set and the query set in the training phase, and the same follows in the test phase.
3.3.3. Problem Formulation
In comparison to other classes, e.g., non-COVID-19 and normal cases, the ability of the model to correctly identify COVID-19-positive cases is valued the most. The classification problem for identifying COVID-19 is formulated in three different scenarios as follows, in ascending order of data complexity:
2-way classification: Data from all three other classes, namely, the ’normal’ class, ’non-COVID-19’ class, and ’other’ class, are viewed as a combined COVID-19 negative class. The network learns from COVID-19 positive and COVID-19 negative datasets in this setting.
3-way classification: As the ’other’ class contains data from multiple different lung conditions, it has the highest variations and may disrupt the network’s learning process due to the lack of uniformity in the data. In the 3-class classification, the ‘other’ class is excluded, and the network is trained to classify the remaining three classes.
4-way classification: As the dataset contains four classes, the four-class classification condition remains in this setting, and the network is trained to classify ’COVID-19’, ’normal’, ’non-COVID-19’, and ’other’ classes.
The network hyperparameters and training settings are listed in
Appendix A.