1. Introduction
A brain tumor is an abnormal mass of cells in the brain that can develop at any stage of life. These cells divide and grow uncontrollably, occupying the limited and enclosed space of the brain or invading normal brain tissue, leading to various symptoms. This type of disease is relatively common, with approximately 250,000 new cases globally each year. In Australia, the government is investing AUD 126.4 million into brain cancer research. This investment is prompted by the incidence of around 2000 Australians being diagnosed with brain cancer every year and the low 5-year survival rate of approximately 23% [
1].
Brain tumors are named based on their location and the classification of the tumor cells, which can be primary or metastatic, benign or malignant, and vary in size. Brain metastases and primary brain tumors with multiple foci glioblastoma (GBM) are two different pathophysiological entities requiring distinct therapeutic approaches. Primary brain tumors originate within the brain parenchyma itself, including glioma, meningioma, and astrocytoma. On the other hand, metastatic brain tumors originate from metastatic cells spreading from other parts of primary malignancy, such as lung, breast, and colorectal cancers [
2,
3].
Patients with primary or metastatic brain tumors exhibit different symptoms depending on the size, location, and distribution of the lesion within the brain. As the tumor grows within the limited space of the skull, it increases intracranial pressure and endangers the patient’s life. This may cause brain edema due to plasma-like fluid leakage through impaired capillary endothelial tight junctions in tumors [
4]. The mass effect with midline shift, tumor dimensions, and mass-edema index may contribute to the differential diagnosis of metastatic versus primary brain metastasis [
5].
The size of brain tumors can range from very small to very large with different progression. The prognosis of a patient depends on several factors, including the type of brain tumor and its growth rate [
6]. Since growth speed is a major concern in treating brain tumors, it is best to detect them when they are still small. However, there are currently not many papers discussing how to detect small brain tumors.
To diagnose brain tumors, neurologists and neurosurgeons inquire about the patient’s and their family’s medical history and conduct a comprehensive neurological examination. This includes assessments of consciousness, muscle strength, coordination, reflexes, and pain response to identify the cause of symptoms. Progressive visual field defect and visual acuity loss generally occur due to an intracranial tumor [
7].
Based on the results of physical and neurological examinations, further tests may be selected, such as a computer tomography (CT) scan, magnetic resonance imaging (MRI), electroencephalogram (EEG), cerebral angiography, and positron emission tomography (PET) [
8]. In this paper, we focus on the use of MRI to detect and segment small tumors.
To reduce the cost of examinations by experts, automated detection and/or segmentation is desirable. In the field of automated brain tumor segmentation, many methods have been proposed, such as region-based approaches [
9,
10,
11,
12], edge-based approaches [
11,
13], threshold-based approaches [
11,
12], and deep learning-based approaches [
12,
14,
15,
16,
17,
18,
19]. Additionally, several survey papers are available on these methods [
11,
20,
21,
22]. It can be observed that deep learning-based approaches are preferable for brain tumor segmentation, as evidenced by the vast number of recent publications [
21,
22]. Among the deep learning-based approaches, U-net [
16] has been one of the preferred models due to its good segmentation performance. Recent studies by other scholars [
17] indicate that adding the Transformer [
23] module to the U-net architecture can effectively improve segmentation results.
Despite the great success of deep learning methods, these methods require a large number of labeled samples and the high cost of labeling, which typically requires professional doctors to invest significant time and effort. The high cost of tumor samples further limits the performance of supervised learning methods.
To mitigate this problem, several efforts have been made to create public brain MRI datasets for researchers, such as those in [
24,
25,
26]. Among the datasets, the Brain Tumor Segmentation (BraTS) challenge 2021 dataset [
26,
27,
28,
29] is “utilized primarily by maximum researchers” [
22]. With data from over 1200 cases with annotations, it is a highly valuable resource for segmenting brain tumors in MRI images. More details about the BraTS 2021 dataset will be provided in
Section 3.1.1.
Although the BraTS 2021 dataset is available to the general public, it contains only cases of primary brain tumors with annotations. In contrast, to the best of our knowledge, there is no public 3D MRI dataset that includes cases of metastatic brain tumors. While both primary and metastatic brain tumors develop in brain tissues, they can be distinguished using MRI scans [
30]. A previous study investigated the imaging differences between GBM and multiple brain metastases, aiming to develop a diagnostic algorithm for differentiation on initial MRI, utilizing apparent diffusion coefficient (ADC) values, surrounding T2-hyperintensity, and edema distribution.
To assess the segmentation performance in real-world scenarios, particularly for metastatic brain tumors, we randomly collected MRI cases from Cheng Hsin General Hospital (CHGH) in Taipei, Taiwan. The selection was based solely on medical records without prior viewing of the MRI images, resulting in a dataset that includes both large and small tumors. The definitions of large and small tumors will be provided in
Section 3.1.2. The visual distinction between large and small brain tumors is evident, as illustrated in
Figure 1. Given the time-consuming and costly process of annotating tumor regions, our goal is to develop a feasible approach for segmenting and detecting small brain tumors with very limited available data, supplemented by a public dataset containing primary tumors. This approach presents challenges, as the imaging features of primary and metastatic tumors may differ.
In this paper, the training methods explored include supervised learning, two transfer learning approaches, and self-supervised learning. Unlike existing research, which typically focuses on either segmentation or detection performance, our objective is to identify the most effective segmentation method and subsequently evaluate its detection performance.
2. Related Works
Ronneberger et al. [
16] proposed a neural network architecture called U-net, which is based on an autoencoder architecture. This architecture features skip connections that link the feature map of each encoder layer to the corresponding decoder layer’s feature map. These connections help the decoder retain more contextual information during upsampling, thereby improving the accuracy of segmentation results. It has also been demonstrated that the U-net architecture outperforms the previous best methods in the International Symposium on Biomedical Imaging (ISBI) challenge.
Hatamizadeh et al. [
17] proposed a variant of U-net called Swin UNETR, which primarily uses the Swin Transformer [
31] as the encoder. They added a residual network to the skip connections that link the encoder and decoder. The Swin UNETR architecture also achieved excellent results in the BraTS2021 segmentation challenge.
Chen et al. [
32] proposed a framework for image contrastive learning called SimCLR, which simplifies previously proposed self-supervised learning algorithms. It demonstrates that the combination of data augmentation in contrastive learning plays a crucial role in prediction tasks. The SimCLR method has been proven to outperform previous self-supervised and semi-supervised learning methods, achieving good accuracy even with only 1% of the labeled training samples.
Tang et al. [
33] proposed a self-supervised learning framework for 3D medical imaging, using the Swin Transformer encoder for contrastive learning, followed by fine-tuning segmentation tasks with Swin UNETR. The authors used their method to perform self-supervised pre-training on 5050 open-source CT images of different body organs, followed by fine-tuning on the Beyond the Cranial Vault (BTCV) and Medical Segmentation Decathlon (MSD) datasets. The results show that their model ranks first on the test leaderboards of the BTCV and MSD datasets.
Abdusalomov et al. used Yolov7 to detect gliomas, meningiomas, and pituitary brain tumors [
19]. With a fairly large dataset (3400+ samples per plane) with data augmentation, they achieved very high detection sensitivity and specificity. However, this paper does not provide sufficient information about the size of the tumors in the study.
Mansur et al. [
12] investigated brain tumor segmentation using three different methods: threshold-based, region-based, and CNN-based approaches. Their results indicated that the threshold-based approach outperformed the other two methods. However, they utilized the Kaggle dataset [
25], which is a 2D dataset (i.e., one image per case), rather than a 3D dataset. Consequently, their findings cannot be directly applied to our study, which involves the use of 3D images.
Kaifi provided a comprehensive review of AI-based diagnostics for brain tumors [
20]. This paper reviews different types of brain tumors, introduces imaging modalities such as CT, MRI, and PET, and provides an overview of classification and segmentation methods. It includes a literature review and discussion and highlights some challenges.
Ahamed et al. [
22] conducted a review of deep learning methods for brain tumor segmentation, offering several valuable observations. Notably, most researchers predominantly utilize the BraTS dataset, thus dealing with only large primary tumors with a large number of training samples. Another conclusion is that the application of fusion and attention mechanisms has been shown to enhance segmentation performance. Therefore, it is expected that a model with attention would be preferable.
The review by Ahamed et al. [
22] highlights that detecting small tumors, particularly metastatic ones, has not received much attention. Additionally, the review indicates that recently published papers typically use datasets with at least several hundred cases. Unfortunately, medium-scale hospitals often cannot afford to collect and annotate a large number of brain MRI cases, as carried out in the BraTS challenge. Therefore, it is important to explore how a small dataset can be used to segment and detect small metastatic tumors and to examine the results.
3. Methodology
This section describes the experimental datasets, preprocessing steps, experimental models, and conducted experiments.
Section 3.1 covers the MRI datasets used. Since the obtained dataset cannot be directly used as training samples, preprocessing steps are detailed in
Section 3.2.
Section 3.3 discusses the architectures of the chosen models, namely U-net and Swin UNETR. Next,
Section 3.4 outlines the parameters used for data augmentation.
Section 3.5 describes the three conducted experiments in detail.
3.1. MRI Datasets
3.1.1. BraTS2021 Dataset
The BraTS challenge dataset has evolved since its inception in 2012. This paper utilizes the latest version, the BraTS 2021 dataset [
26,
27,
28,
29], for Task 1 (Segmentation). As noted by Ahamed et al. [
22], BraTS is the dataset most commonly used by researchers. Although other datasets are available, they have various limitations. For example, the Kaggle dataset [
25] includes only 2D images. The TCGA-GBM dataset [
34] contains 3D images but has only 262 cases. Additionally, the collection of openly available datasets by the University of Cambridge [
24] is not specifically designed for brain tumor segmentation.
The BraTS 2021 dataset includes 1251 multimodal MRI cases of primary brain tumors. The BraTS challenge is jointly organized by the Radiological Society of North America (RSNA), the American Society of Neuroradiology (ASNR), and the Medical Image Computing and Computer Assisted Interventions (MICCAI) Society. The data come from multiple medical institutions in countries such as the United States, Germany, Switzerland, Canada, Hungary, and India. All tumor data were manually annotated by one to four experts following the same protocol and finally certified by an experienced committee and approved by neuroradiologists. The dataset is publicly available in [
26].
Each case in the BraTS 2021 dataset includes four modalities: T1, T1Gd, T2-weighted, and T2 fluid-attenuated inversion recovery (T2-FLAIR). Each modality has a data size of 240 × 240 × 155 and shares tumor segmentation labels. The segmentation labels are (0, 1, 2, 4), where label 0 represents the background, label 1 represents the necrotic tumor core (NT), label 2 represents the peritumoral edema (ED), and label 4 represents the enhancing tumor (ET). We consider labels 1, 2, and 4 as the entire tumor area. All four modalities of the cases have undergone spatial registration, interpolation to the same resolution, and skull stripping. The final data are stored in the NifTI (.nii.gz) format. Due to usage restrictions stated on the website [
26], we officially declare that this dataset is only used for the publication of this paper and not for any other purposes.
3.1.2. The CHGH Dataset
Another set of brain MRI data used in this paper is provided by Cheng Hsin General Hospital (referred to as the CHGH dataset). It was approved by the Institutional Review Board of Cheng Hsin General Hospital, Taipei, Taiwan with the protocol code: (1123)113-53, approval date: 9 October 2024. The tumor areas of the patients were annotated/validated by a neurologist (Y.H.L, one of the authors). The provided MRI data are in the original DICOM (Digital Imaging and Communications in Medicine) [
35] format without any processing. The MRI data include images from different planes: axial, coronal, and sagittal, and contains multiple modalities. In this paper, we will use the axial T2-FLAIR modality for experiments, and the tumor area information is stored in the JSON (JavaScript Object Notation) format. The MRI DICOM files and JSON files require additional processing steps to be used in the experiments, which will be detailed in the next subsection.
Table 1 shows the CHGH dataset, which includes 18 large-tumor cases, 15 small-tumor cases, and 22 normal cases. The main task of this paper is to segment and detect small tumors. In the literature, large brain metastases are typically defined as lesions greater than 2 cm in diameter [
36]. However, some cases in our study present very long and thin sections of tumors, making it difficult to use diameter as a true representation of tumor size. Therefore, we use the tumor area in a slice, rather than diameter, to classify lesions as “large” or “small.” In this study, a small tumor is defined as a tumor with an area of less than 3.5 cm
2 in the slice with the largest tumor area in all slices of a case. This threshold is slightly greater than the area of a circle with a diameter of 2 cm. The median area of the small tumors is approximately 1.1 cm
2, while the median area of the large tumors exceeds 16 cm
2. The dataset containing only small-tumor cases is denoted as CHGH_S.
3.2. Preprocessing of MRI Dataset
3.2.1. Preprocessing of the BraTS2021 Dataset
The dimensions of the input images for the experimental model must be multiples of 32. This means both the size of the images and the number of images in a case should be divisible by 32. The tensor size of the BraTS2021 dataset is 240 × 240 × 155, so we need to adjust it to 128 × 128 × 64. To decimate the slices, we use the following equation:
where
represents one of the target 64 slices and
represents one of the 155 slices in the BraTS2021 source slices. Since the first 15 slices in the BraTS2021 dataset are blank, we start from the 16th slice and take every other slice until we have 64 slices.
The original BraTS2021 dataset has four labels: 0, 1, 2, and 4. Label 0 is the background, and the other labels represent different tumors. In our experiment, we do not perform detailed tumor segmentation, so we merge labels 1, 2, and 4 into label 1.
3.2.2. Preprocessing of the CHGH Dataset
The CHGH MRI data are originally in DICOM format with multiple modalities. For the following experiments, we select the T2-FLAIR as the experimental data. By stacking the processed FLAIR images in sequence, we create a three-dimensional data structure and save it as a NifTi (.nii.gz) file. The tumor location information, marked by the doctor, is originally stored in a JSON file. We need to convert this location information into images and save them as NifTi files as well. Therefore, the final output of each case is a pair of NifTi files, one for the brain image and one for the tumor label. The detailed processing steps are as follows:
Parsing DICOM files: A DICOM file stores the pixel information of the image and related metadata, such as patient ID, study, series, equipment, and image information. We need to parse the series information part, which contains information related to the series. We use the Pydicom package to parse the DICOM files and the dcmread() function provided by Pydicom to read the DICOM files and access the series description attribute. This makes it very convenient to classify the series. A detailed description of all DICOM attributes can be found on the DICOM official website [
35].
Image pixel conversion: The grayscale range in the original DICOM file images is not uniform, as equipment from different manufacturers usually has minor differences in image details. To address this issue, we need additional steps, as shown in
Figure 2.
Parsing JSON annotation files and converting them to image files: In the JSON file we received, the tumor areas are represented by closed lines, composed of continuous segments. Each segment consists of a set of coordinates and width. Using the coordinate information, we draw lines on a blank image canvas, fill the inside of the closed lines with a gray value, and finally save the image. The pairing between the brain image files and the annotation files is matched using the DICOM SopInstanceUID attribute and the JSON file name.
Interpolation of brain images and tumor annotation images: Although most cases have between 48 and 51 brain slices, a few cases have only 24 or 25 slices. Therefore, we interpolate these brain slices and the paired annotation images to double the slice number, from 24–25 slices to 47–48 slices. This way, the number of slices for all cases will range from 47 to 51 slices. We use the high-order slice interpolation method for interpolation [
40,
41] because this method outperforms linear and cubic interpolation counterparts.
Padding black images: Since the number of slices in all cases ranges from 47 to 51, we pad the end of the slices with black images to reach a total of 64 slices. Finally, we save the three-dimensional data structure as a NifTi file with dimensions 128 × 128 × 64.
3.3. Experimental Models
In this section, we introduce the experimental models. During the initial research stage, we explored various models, including convolutional neural networks, conventional autoencoders, and ResNet-based networks. However, these models did not yield satisfactory results. We ultimately found that U-net [
16,
42] and Swin UNETR [
17,
33] performed better. Therefore, we will focus on describing these two models to save space. For a comprehensive review of U-net and its variants in medical image segmentation, please refer to [
43].
3.3.1. U-Net Model
In the following experiments, our U-net model follows the work of Ronneberger et al. [
16], as shown in
Figure 3. The model input is an image with a size of 1 × 128 × 128 × 64 (Channel × Height × Width × Depth). It first passes through a convolutional layer with a stride of 1, increasing the number of channels to 16, resulting in a feature map size of 16 × 128 × 128 × 64.
Next, the feature map goes through the encoder blocks. Inside each encoder block, as shown at the bottom left of
Figure 3, the input first passes through a 3D convolutional layer with a stride of 2, followed by instance normalization and a ReLU (Rectified Linear Unit) activation function. Then, it passes through another 3D convolutional layer with a stride of 1, followed by instance normalization and a ReLU activation function before outputting. It is worth noting that there is a “bypass path,” similar to the design of a residual network [
44], connecting the input and output of the encoder block. In the encoding path, the number of channels doubles, and the height, width, and depth are halved from top to bottom blocks. After the 5th encoder block, we obtain a feature map size of 256 × 8 × 8 × 4, which is then fed into the 1st decoder block.
Inside each decoder block, as shown at the bottom right of
Figure 3, the input first passes through a 3D transposed convolutional layer with a stride of 2. The output of the transposed convolution is concatenated with the corresponding encoder output (as shown by the gray arrows connecting left and right in
Figure 3). Then, it passes through a 3D convolutional layer with a stride of 1, followed by instance normalization and a ReLU activation function. The output of the concatenated layer in a decoder block is connected to the final layer output inside a block, again similar to a residual network [
44]. With decoder blocks going upwards, the number of channels is halved, and the height, width, and depth are doubled. After the 4th decoder block, we obtain an output size of 16 × 128 × 128 × 64.
Finally, it passes through a final 3D convolutional layer with a stride of 1, reducing the number of channels to 2, resulting in a tensor size of 2 × 128 × 128 × 64. We then perform an argmax operation on the result, taking the index of the maximum value, and ultimately obtain a tumor segmentation image size of 1 × 128 × 128 × 64.
3.3.2. Swin UNETER Model
In the experiments, the Swin UNETR model consists of five encoder blocks and five decoder blocks, as shown in
Figure 4, following the design of [
33]. The structure of these five encoder blocks is the same as the Swin Transformer architecture (shown at the bottom left of
Figure 4) [
31]. The input to the Swin UNETR model is an image with a size of 1 × 128 × 128 × 64 (Channel × Height × Width × Depth). Following the concept of the Swin Transformer, the input image is divided into multiple non-overlapping 3D patches through patch partition and then fed into a series of encoder blocks (shown at the bottom left of
Figure 4).
The first step of each encoder block is patch merging, which doubles the number of channels and halves the image resolution, similar to the setup of a convolutional neural network. Each encoder block implements a Swin Transformer, which contains two attention sub-blocks. The first sub-block has a layer normalization, windowed multi-head self-attention (W-MSA), a second layer normalization, and a multi-layer perceptron (MLP, essentially a fully connected multi-layer network). The second sub-block is similar to the first sub-block but uses SW-MSA (shifted windowed multi-head self-attention), instead of W-MSA. After passing through five encoder blocks, we obtain a feature map size of 768 × 4 × 4 × 2.
As shown in
Figure 4, there are five decoder blocks in the decoding path. The input to each decoder block first passes through a 3D transposed convolutional layer with a stride of 2, then concatenates with the feature map output from the encoder and finally passes through a residual sub-block before outputting. The feature map from the encoder is reshaped to the same dimension as the decoder block, and then concatenated with the decoder block after passing through the residual sub-block. After the 5th decoder block, we obtain an output size of 48 × 128 × 128 × 64.
Finally, it passes through a final 3-D convolutional layer with a stride of 1, reducing the number of channels to 2. By using the same treatments as in the U-net, we have a tumor segmentation image size of 1 × 128 × 128 × 64.
3.4. Data Augmentation
We perform data augmentation in some experiments to expand the number of training samples. By applying various transformations to the original training data, additional samples are generated. Overall, we have used the following transformations:
Rotation: Rotate the image by 20°~50° or −20°~−50°.
Scaling: Scale the image by 0.6~0.9 times or 1.1~1.5 times.
Rotation and scaling: Apply both rotation and scaling with the same range as above.
Affine transformation: Apply a shear transformation of 0.4 to the image.
Once the augmented samples are generated, they are added to the training set to train the models. For self-supervised learning, the samples for pretext tasks and downstream tasks differ. In this case, the augmentation is performed on training for the downstream models. However, the test set is not subjected to the augmentation step.
3.5. Conducted Experiments
Three experiments are conducted in this paper. The first experiment compares the segmentation performance of large and small tumors, measured by the Dice score. The second experiment evaluates the Dice scores of various approaches used to improve the segmentation performance of small tumors. The third experiment computes the sensitivity and specificity of the best model obtained in the second experiment.
3.5.1. Experiment One
We will use the models introduced in
Section 3.3.1 and
Section 3.3.2 to perform 3-fold cross-validation on the datasets. The training and test datasets are the BraTS2021 and CHGH datasets (only using the 33 tumor cases). The training and testing dataset combinations are as follows:
Training: BraTS2021; Testing: BraTS2021.
Training: BraTS2021; Testing: CHGH.
Training: CHGH; Testing: CHGH.
Training: BraTS2021 + CHGH; Testing: CHGH.
For using the BraTS2021 dataset for training and testing, the conventional 3-fold cross-validation is carried out. For using the CHGH as the test set, we first divide the CHGH dataset into three subsets: O1, O2, and O3, each containing 11 cases. Next, we select the O1 subset as the test set and the remaining subsets, if applicable, as part of the training data. We then conduct complete supervised learning and evaluate O1. This process is repeated for O2 and O3.
3.5.2. Experiment Two
The BraTS2021 training set consists of 1251 patients, primarily with large and prominent tumors. In contrast, the CHGH dataset includes 33 cases, of which 18 have large tumors and the remaining 15 have small tumors. With such a small number of small-tumor cases, it is difficult to train a model to achieve higher performance. We will show that Swin UNETR has a higher performance in Experiment One. Therefore, this experiment only uses Swin UNETR. As the number of small-tumor cases is limited, we perform leave-one-out cross-validation to report the segmentation results. To find an effective segmentation method for small tumors, we study the following five methods:
Supervised learning: This method uses the BraTS2021 + CHGH_S data as the training set with conventional supervised learning. Recall that leave-one-out cross-validation is used for CHGH_S during testing. Therefore, all cases but one in CHGH_S are in the training set. This method does not use an augmented dataset during training due to the large number of training samples and the limitations of the experimental hardware on training the augmented dataset.
Supervised learning with data augmentation: The training set is CHGH_S with different augmentation rates to observe whether data augmentation is useful for this problem.
Transfer learning with parameter finetuning: Pre-train the Swin UNETR model using the BraTS2021 dataset with supervised learning, then use transfer learning to fine-tune the parameters with augmented CHGH_S cases, as shown in
Figure 5. Specifically, we first use the BraTS2021 training set to conduct complete supervised learning to obtain a pretrained model. We then use the weights of the pretrained model to fine-tune the model with augmented CHGH_S training data. For finetuning, it is important to select the appropriate number of training epochs. Therefore, we use 2, 4, 6, 8, and 10 epochs to fine-tune the model with four-times augmentation rates and select the number of epochs that perform best. With this epoch number, the performance of other augmentation rates is examined.
Transfer learning with layer freezing: This method also pre-trains a model using the BraTS2021 dataset. However, during fine-tuning, we only train the bottom two layers, freezing the weights of all other layers, as shown in
Figure 6. We hope that the top layers of the model can learn to extract features from the BraTS2021 dataset while allowing the bottom two layers to adapt to segment small tumors. This approach aims to enhance the model’s generalization performance.
Self-supervised learning: Pre-train the model using the BraTS2021 dataset with self-supervised learning (SSL), then use different augmentation rates on small-tumor cases for downstream training. The training procedure is shown in
Figure 7. Following the procedure of [
32,
33], we use the encoder (i.e., Swin Transformer) part of Swin UNETR to perform contrastive learning on the BraTS2021 dataset. We apply an inner cutout and outer cutout to generate contrastive images as the inputs to the pretext task.
Figure 5.
Fine-tuning of the pretrained model in the experiments. The encoder part consists of blue blocks and the decoder part consists of the orange blocks.
Figure 5.
Fine-tuning of the pretrained model in the experiments. The encoder part consists of blue blocks and the decoder part consists of the orange blocks.
Figure 6.
Layer-freezing approach of transfer learning in the experiments. The encoder part consists of blue blocks and the decoder part consists of the orange blocks.
Figure 6.
Layer-freezing approach of transfer learning in the experiments. The encoder part consists of blue blocks and the decoder part consists of the orange blocks.
Figure 7.
Self-supervised learning approach used in the experiments.
Figure 7.
Self-supervised learning approach used in the experiments.
The inner cutout images are obtained by randomly cropping the interior of the image with sizes ranging from a minimum of 5 × 5 to a maximum of 32 × 32 and adding noise. The number of crops is 6.
The outer cutout images are obtained by randomly cropping the outer edges of the image with sizes ranging from a minimum of 20 × 20 to a maximum of 64 × 64 and adding noise. The number of crops is also 6.
The resulting images, referred to as Aug 1 and Aug 2 in
Figure 7, are the inputs to the Swin Transformer model, producing reconstructed images Recon 1 and Recon 2. We calculate the contrastive loss between Recon 1 and Recon 2, as well as the L1 Loss between Recon 2 and the original, un-augmented data. The combined loss from the contrastive loss and L1 Loss is used to update the model’s weights. For the downstream task, only augmented data from small-tumor samples are used in training. During the downstream training, the encoder parameters are fixed and only the decoder parameters are trainable.
3.5.3. Experiment Three
We use the best-performing model from Experiment Two to predict cases with and without small tumors, evaluating the sensitivity and specificity of the trained model.
5. Discussions and Future Directions
Given the proposed approach’s sensitivity of 100%, it ensures that all tumors are correctly detected. High sensitivity is typical, as the model was trained to perform segmentation, which requires accurately finding the locations of tumors. However, the low specificity of 54.5% may not be sufficient for many applications. In a typical situation, using a threshold determined by, say, Youden’s index can trade sensitivity for specificity. To determine the threshold, a validation set is necessary, partitioned from the test set. This process ensures no data (or knowledge) leakage during experiments. That is, no information from the test set is used during the testing phase. Unfortunately, in our case, it is difficult to further partition the validation set from the already small dataset.
The low specificity of 54.5% of our approach is partly due to some cases having white spots on the MRI slices. By excluding these subjects, the specificity increases to 80%, which could be sufficient for screening purposes. Currently, we are investigating using PET-CT (positron emission tomography–computed tomography) images in conjunction with the MRI for the same subject to improve the specificity. Still, the presented experiments confirm the feasibility of detecting small tumors with the proposed method with a small dataset.
In this paper, our dataset is relatively small. While CHGH has many more cases available for study, expanding the experimental dataset is not extremely difficult. However, our intention is to emulate the limited resources of a medium-scale hospital. Therefore, we decided to use a small dataset for training and testing, with cases randomly selected to include both large and small tumors. The advantage of training a model with an in-house dataset, despite its small size, cannot be matched by using an openly available dataset, such as BraTS 2021. Using such a dataset may lead to the over-expectation of performance in real-world cases. For example, a model trained with the BraTS dataset reaches a Dice score of 0.90 for segmenting its own samples (in Experiment One), closely matching the latest works that also use BraTS, such as [
17,
45,
46]. Thus, the model is successfully trained. However, when this model is used to segment small tumors, the score drops to 0.0332, indicating that it cannot hope to have any useful medical applications.
In our experiments, the model using the proposed self-supervised learning approach improved the Dice score to 0.1882. While this score is still low, the trained model is nevertheless useful, as demonstrated in the detection experiment with acceptable results. Therefore, the model could be used for the initial screening of MRI scans to detect tumors and pinpoint their locations, reducing the workload of neurologists.
The feasibility of using a small in-house MRI dataset for brain tumor detection and segmentation also opens up the possibility of exploring other types of diseases, such as stroke. There are open MRI datasets for stroke lesion segmentation, such as [
47]. By using a similar method presented in this paper, a hospital may develop a model tailored to a specific application on stroke with a small in-house dataset.
It is also important to note that the Dice score tends to be lower for small objects. The Dice score is calculated as follows:
where
is the region of true positives,
is the region of false positives, and
is the region of false negatives.
To illustrate that the Dice score favors larger regions over smaller ones, consider the following hypothetical example in
Figure 12: a task to segment a small region size of 3 × 3 and a large region size of 5 × 5. Due to technical reasons (such as labeling bias, misalignment, etc.), the segmented regions are offset by one pixel in both horizontal and vertical directions. According to (4), the Dice score for the small region is as follows:
On the other hand, the Dice score for the large region is as follows:
If the region is larger, the score is even higher. In practical cases, the labeled ground truth rarely coincides exactly with the tumor. Offsetting by one or two pixels in each direction is common. In our case, some small tumors have areas less than 1 in one slice. As we resample the images to a size of 128 × 128, an offset of a couple of pixels in segmentation (or annotation) significantly affects the Dice score, making a low Dice score reasonable.
In the future, we plan to implement a two-step approach to improve the Dice score. The first step involves cutting a portion of an MRI slice containing the region of interest (ROI). The second step is to enlarge the ROI portion to increase the number of pixels representing the small tumors. A similar approach for lung cancer is used in [
48]. The segmented ground truth should be based on the enlarged ROI. This method should be able to increase the Dice score.
As mentioned in the introduction, detecting small tumors and continuously monitoring their growth rate is beneficial. Therefore, another future direction would be to develop an algorithm to automatically compute the growth rate of tumors based on several sets of MRIs taken on different dates.
6. Conclusions
This paper studies the segmentation and detection of small metastatic brain tumors using a small set of MRI scans. The available literature primarily focuses on segmenting or detecting large primary brain tumors with a large number of training cases. Our study closely resembles a real scenario that a medium-scale hospital may face: a wide variety of tumor sizes and types (primary or metastatic), but insufficient resources to create a large in-house dataset.
When comparing the performance of the U-Net and Swin UNETR models, the latter performs better in tumor segmentation tasks. Even so, our experimental results show that using the BraTS 2021 dataset to train a Swin UNETR model still yields poor performance in segmenting small tumors. To address this, we use self-supervised learning, which has two training phases: pretext task training and downstream task training. By using the BraTS dataset for pretext task training, we can leverage the large number of samples in the dataset to aid the downstream task. Overall, self-supervised learning with data augmentation offers advantages in terms of Dice scores, from 0.0332 to 0.1882.
When directly using the self-supervised model, designed for segmentation, to perform small tumor detection, experimental results show good sensitivity (100%) and acceptable specificity values (80% if excluding cases with white matter hyperintensities). Overall, the experiments confirm the feasibility of segmenting and detecting small tumors with only a small number of in-house samples. To further improve the model’s specificity for subjects with white matter hyperintensities, additional information beyond MRI scans is needed in the detection model.