1. Introduction
Gliomas are the most common primary brain tumors in adults, accounting for 70% of all malignant primary brain tumors [
1]. They can be classified as high-grade gliomas (HGGs) and low-grade gliomas (LGGs), with HGGs being more aggressive and invasive than LGGs [
2]. Magnetic Resonance Imaging (MRI) is commonly used for the diagnosis and treatment planning of brain tumors owing to its high resolution, soft tissue contrast, and non-invasive nature [
3]. For gliomas, four MRI modalities (T1, T1ce, T2, and FLAIR) are typically employed. Multimodal imaging facilitates the capture of a wide range of histopathological parameters, effectively reducing informational uncertainty and enhancing the precision of clinical diagnoses [
4].
Identifying gliomas in MRI is crucial for the clinical diagnoses and formulation of treatment plans. However, the traditional segmentation process for diagnosis, which involves manually inspecting MRI volumes slice-by-slice, is time consuming and depends significantly on the experience of the radiologists [
5]. Moreover, the significant individual variability in tumors’ location, size, shape, margins, and density makes it challenging to manually distinguish gliomas in MRI data [
2]. Additionally, morphological uncertainty complicates the process, as the outer layers of brain tumors consist of edematous tissue, making the edges of the tissue surrounding the tumor ambiguous and the tumor contours difficult to define [
1].
These challenges have prompted research on automatic brain tumor segmentation. With the advancement of deep learning, numerous studies have been conducted utilizing deep learning-based methods. In particular, the introduction of U-Net [
6], a U-shaped convolutional network, has aided various brain tumor segmentation studies using convolutional neural networks (CNNs). Initially, two-dimensional (2D) convolution was predominantly used, thus employing three-dimensional (3D) MRI data as sliced 2D data. Previous studies have proposed cascade structures or multitask methods that leverage 2D convolution for precise brain tumor segmentation [
7,
8,
9,
10,
11]. However, these 2D-based models often lose crucial contextual information present in volumetric data and face challenges in accurately distinguishing pathological features as the volume of data increases. Recently, numerous methods that employ 3D data have been explored. These approaches have been augmented by leveraging 3D convolution to enhance the segmentation performance [
12,
13,
14]. Nonetheless, the reliance on 3D CNNs alone has limitations due to variability in the size and location of brain tumors. To address these limitations, studies have proposed the use of atrous convolution [
15] and hierarchical feature pyramid structures to detect tumors of varying sizes and utilize multi-scale features [
16,
17]. Attention mechanisms have been integrated to concentrate on areas where tumors are present [
18,
19,
20,
21].
However, despite significant advancements in automatic brain tumor segmentation research, researchers have often overlooked the full potential of leveraging the unique characteristics of 3D MRI data. While 3D-based models align more closely with the intrinsic volumetric nature of MRI data, processing 3D data with convolution layers that utilize
kernels fails to fully exploit the unique characteristics of MRI. MRI inherently comprises three dimensions—axial, coronal, and sagittal—each providing unique and critical perspectives on brain anatomy and pathology, as illustrated in
Figure 1. Focusing on the axial, coronal, and sagittal dimensions is essential because each dimension offers a different view of the brain’s anatomy, revealing various aspects of tumors, such as their spread, volume, and interaction with surrounding tissues [
22,
23]. This highlights the need for a novel approach that not only preserves the rich contextual and spatial nuances inherent in MRI data but also enhances tumor segmentation precision by deeply prompting the understanding of the multi-axis structure of MRI data.
In this paper, we propose a Tri-Axis based Context-Aware Reverse Network (TACA-RNet), which is a novel approach that carefully considers the axial, coronal, and sagittal perspectives of MRI data. This method was designed to address the aforementioned challenges by leveraging the unique 3D spatial orientations inherent in MRI data, aiming to significantly enhance the accuracy and precision of brain tumor segmentation through a comprehensive understanding and utilization of the volumetric information provided by MRI technology. Our approach was validated using the Brain Tumor Segmentation Challenge (BraTS) 2018 and 2020 datasets [
24,
25,
26]. The experimental results demonstrate that the proposed TACA-RNet outperforms other recent networks.
The main contributions of this research are summarized as follows:
We introduce the TACA-RNet, a novel framework specifically designed to leverage the axial, coronal, and sagittal MRI directions. This approach enables a deeper understanding of the complex spatial relationships inherent in volumetric MRI data.
Our approach integrates three specialized modules: a Tri-Axis Channel Reduction module (TACR), which targets dimension reduction and feature enhancement across MRI’s axial, coronal, and sagittal planes; a MultiScale Contextual Fusion module (MSCF), which integrates features from multiple scales to enhance spatial discernment; and a 3D Axis Reverse Attention module (ARA), which concentrates on essential details for precise tumor segmentation.
We evaluated the efficiency of the proposed network using the BraTS 2018 and 2020 datasets. The results demonstrate that our approach generates a superior segmentation performance and outperforms other recent CNN methodologies.
The remainder of this paper is organized as follows.
Section 2 provides a discussion of existing research related to our study.
Section 3 details the method and design of this study. In
Section 4, we outline the datasets utilized, describe the preprocessing steps, present the evaluation metrics, and detail the experimental configurations; this is then followed by comparison and ablation experiments. Finally, conclusions are presented in
Section 5.
3. Proposed Method
In this section, we present the overall network framework. We then introduce the designed components: the TACR, partial decoder (PD), Multi-Resolution Fusion (MRF), MSCF, and 3D ARA modules.
3.1. Overview of Network
As shown in
Figure 2, the proposed network consists of six main components: an encoder composed of convolution blocks, a TACR module, a PD, MRF, an MSCF module, and a 3D ARA module.
The encoder, which is composed of convolution blocks, extracts high-dimensional semantic features related to gliomas. Each convolution block within the encoder is structured with group normalization, a
convolution layer, and a ReLU activation function. This setup is used to classify and analyze the sub-level local pixel values of gliomas on MRI. We consider only high-level features {
fi, i = 3, 4, 5} among the features {
fi, i = 1, …, 5} extracted from the encoder because low-level features demand more computational resources owing to their large spatial resolution compared to high-level features, yet they contribute less to the performance [
30]. To optimally leverage the unique 3D characteristics of the MRI data, including their axial, coronal, and sagittal orientations, we integrated the TACR module. This strategic addition supplements the conventional Receptive Field Block, previously situated before the PD [
30], with a more sophisticated mechanism tailored to the three-axis structure of the MRI. Subsequently, we present a PD method that generates a global map to serve as an initial guide. Simultaneously, to preserve the details of tumors of varying sizes while minimizing irrelevant information, our approach incorporates MRF. Subsequently, the MSCF module is employed to integrate the global contextual information and accommodate the diverse resolutions inherent in the volumetric data. Finally, guided by the features generated by the PD, the 3D ARA module transforms the suppressed details from the downsampling process into emphasized features, thereby enhancing the focus on locally important information. The network details of the TACA-RNet are provided in
Appendix A.
3.2. Tri-Axis Channel Reduction Module
The TACR module refines the high-level feature maps {fi, i = 3, 4, 5} extracted by the encoder, directly addressing the unique 3D spatial orientations (axial, coronal, and sagittal) found in MRI data. This module decreases dimensionality while accentuating the salient features that are vital for accurate tumor segmentation.
As illustrated in
Figure 3, the TACR module comprises four branches, each initially using a
convolution to adjust the number of channels. The first branch of the module, denoted as
, utilizes a
convolution to capture a broad spectrum of features across the data, establishing a baseline for feature extraction. The subsequent branches, denoted as
for axial,
for coronal, and
for sagittal, employ specialized convolutions with varying kernel sizes, such as
,
, and
, to capture features pertinent to each orientation. After processing through branches
,
, and
, the outputs are concatenated, merging the unique spatial features captured in each direction (axial, coronal, and sagittal). The specific expression for the concatenation process is as follows:
where
C denotes the concatenation operation. Following feature integration, an SE block [
29], denoted as
, dynamically recalibrates channel-specific responses, enhancing important features and diminishing less relevant ones through global information analysis. Subsequently, the outputs from
and the outputs enhanced by the SE block are unified, creating a feature map that incorporates both wide-ranging and orientation-specific characteristics. Finally, shortcut connections seamlessly integrate the initial input with the output of the model, enhancing learning and feature representation while preventing information loss and gradient dissipation. The specific implementation can be mathematically defined as follows:
Additionally, to reduce model complexity, the channel count for the output of each TACR module was reduced to 32.
3.3. Partial Decoder
As mentioned in
Section 3.1, we strategically incorporated a PD [
30] within our model to significantly reduce information loss by enhancing the detail capture capability, particularly in the context of the encoder’s downsampling process. This decoder was designed to process only high-level features {
ti, i = 3, 4, 5} obtained from the TACR module. For the highest feature layer
, we directly use the feature from the corresponding layer, setting
. For features where
, each feature
is updated by multiplying it element-wise with deeper layer features. The specific expression for the updating process is as follows:
where Up represents the upsampled feature by a factor
, Conv denotes a
convolutional layer, and ⊙ denotes element-wise multiplication. To integrate multi-level features {
tci, i = 3, 4, 5}, we employ an upsampling and concatenating strategy. The specific expression for the integration process is as follows:
where Up denotes an upsampling operation that doubles the spatial dimensions through transposed convolution, Conv is a
convolutional layer, C represents the concatenation operation, and
is a
convolutional layer. This strategic use of high-level features via parallel connections in the partial decoder efficiently generates a global map
. This map effectively guides the accurate determination of tumor shape, location, and size, enhancing our model’s precision with optimized computational efficiency.
3.4. Multi-Resolution Fusion
The MRF process utilizes the high-level features {
ti, i = 3, 4, 5} obtained from the TACR module, scaling them up or down to match their unique resolutions. Following this adjustment, the features of the three distinct resolutions are made compatible for concatenation by aligning them with the scale of each layer. The specific implementation can be mathematically defined as follows:
In these equations, Up represents an upsampling operation that doubles the spatial dimensions using transposed convolution, Down denotes a downscaling operation that halves the spatial dimensions using convolution, and C indicates the concatenation operation. By harmonizing the features across different resolution layers, MRF enables the integration of diverse spatial information, thereby enhancing the segmentation capabilities of the network.
3.5. MultiScale Contextual Fusion Module
To enhance the network’s ability to accurately segment brain tumors, we introduced the MSCF module to reflect the complexity and variability of tumor sizes and their spatial distribution in the MRI data. This module, inspired by Atrous Spatial Pyramid Pooling (ASPP) [
31] and the Contextual Feature Pyramid (CFP) [
32], aims to innovatively capture and integrate contextual information at multiple scales through a hybrid approach.
As shown in
Figure 4, the MSCF module employs a combination of ASPP and CFP to comprehensively represent the spatial and contextual details necessary for accurately identifying tumor boundaries. The ASPP component within the MSCF employs atrous convolution operations with a set of dilation rates
. This enables the network to extract features from a wide range of receptive fields, capturing both local and global contextual information without a loss of resolution. Concurrently, the CFP component employs a series of convolutions with an increasing sequence of dilation rates
. This setup progressively captures larger contextual features, thereby enhancing the network’s ability to discern spatial relationships at various scales. Varying levels of 3D padding are employed as necessary in the ASPP and CFP components to ensure that the outputs have compatible resolutions for concatenation. After applying ASPP and CFP, the features are concatenated to form a comprehensive, multi-level feature map. This map is then adjusted to the appropriate channel count using a
convolution.
3.6. 3D Axis Reverse Attention Module
The 3D ARA module, designed to capture high-resolution details critical for delineating tumor boundaries, is closely aligned with the intrinsic characteristics of the MRI data, particularly considering its axial, coronal, and sagittal orientations. While the MSCF module effectively identifies tumor regions across various scales, it may lack the precision needed for the fine-grained delineation of tumor margins. The 3D ARA module complements this by focusing on the critical details to ensure more accurate segmentation outputs.
As shown in
Figure 5, 3D ARA components comprise bifurcated complementary mechanisms: axis attention (AA) and reverse attention (RA). The AA mechanism divides the 3D space into 2D planes, applying three separate 2D attentions across the dimensions of height, width, and depth. To facilitate this, the input is restructured into a 2D format where one dimension corresponds to the space of interest (height, width, or depth) and the other dimension combines the remaining two spatial dimensions. For instance, in the case of
, width and depth are merged into a single dimension, while height is kept separate, thus tailoring the input to suit 2D attention requirements. This approach mirrors the unique anatomical orientations found in MRI data: the height dimension within the axial plane, the width dimension in the coronal plane, and the depth dimension through the sagittal plane. The AA mechanism is defined by the equation
. Here,
,
, and
were specifically designed to process features corresponding to height, width, and depth, respectively. For instance,
focuses attention within each individual height plane, functioning across the 2D planes of width and depth. The variable
represents the output of the MSCF module for
. Following the AA phase, the RA mechanism reclaims and emphasizes features that may have been overshadowed during the initial focusing process. The RA mechanism is defined by the equation
, where
denotes an upsampling function that enhances the resolution of the feature map,
applies sigmoid activation for a non-linear effect, and ⊖ represents the subtraction of this activated output from a unitary matrix. The variable
indicates the output generated from a preceding processing stage in the cascade structure, with
, and 5 indicating the sequence of each layer. Notably,
is designated as
. The culmination of the 3D ARA process involves the integration of the AA and RA mechanisms to produce a refined feature map that accurately delineates the tumor boundaries. The final representation of this process, which combines the focused and refocused features, is expressed as
In this equation, ⊙ denotes element-wise multiplication, merging the AA and RA maps to form a feature map that is rich in detail and effectively captures the true edges of the tumors. This sophisticated mechanism of the 3D ARA module, emphasizing the height, width, and depth attentions in alignment with MRI’s axial, coronal, and sagittal orientations of the MRI, not only identifies the general area of the tumor but also maps its precise contours.
3.7. Deep Supervision
To ensure thorough learning across various abstraction levels and improve segmentation reliability, our network employs a deep supervision strategy. This strategy entails applying deep supervision to both the global map
and the outputs of the
i-th layer,
. To calculate the loss, we used upsampling to adjust the output of each layer to the same size as that of the ground truth. Therefore, the final loss can be expressed as
where
and
are the weights for the output of each layer, being a constant of 0.3 for
, and 1, 0.5, and 0.4 for
, respectively. While deep supervision is employed during training to enhance learning at multiple levels, the final output of the network is specifically
, which is optimized for the most detailed and accurate segmentation results.
In building on this foundation, our loss function is defined as the sum of the soft Dice loss function [
27] and binary cross entropy (BCE) loss [
33], calculated in a voxel-wise manner as follows:
where
G and
P are the ground truth and predicted values, respectively, and
I and
J are the numbers of voxels and classes, respectively.
4. Experimental Results
4.1. Datasets
The datasets used in this study were sourced from the BraTS datasets for 2018 and 2020 [
24,
25,
26]. In the BraTS 2018 and 2020 datasets, the training sets consisted of 285 and 369 cases, and the validation sets consisted of 66 and 125 cases, respectively. The training sets from both datasets consisted of 3D MRI data from four modalities (T1, T1ce, T2, and FLAIR), including voxel-wise ground truth labels. The labels comprised distinct categories for different tumor regions: necrotic and non-enhancing tumor core (NCR/NET), peritumoral edema (ED), enhanced tumors (ET), and non-tumor regions. Commonly used modalities (T1, T1ce, T2, and FLAIR) play important roles in segmenting different types of brain tumor regions, such as ED, NCR/NET, and ET. An example of multimodal MRI for brain tumor segmentation is shown in
Figure 6. The segmentation accuracy was assessed using the Dice score and 95% Hausdorff Distance metrics for specific tumor regions: enhanced tumor (ET), whole tumor (WT, comprising NCR/NET, ED, and ET), and tumor core (TC, involving NCR/NET and ET) regions. The volume of each MRI data is
. Labels for the validation data were not provided; hence, all segmentation results were evaluated using the Center for Biomedical Image Computing and Analytics (CBICA) Image Processing Portal (
https://ipp.cbica.upenn.edu/ (accessed on 20 January 2024)).
4.2. Implementation Details
The proposed model was implemented using PyTorch and trained on four GeForce RTX 3090 GPUs for 500 epochs with a batch size of 16. We utilized the AdamW optimizer, setting the learning rate at 0.001 and a weight decay of 1 ×. The input data for our network consisted of multiple channels (number of modalities), spatial resolution , and the depth dimension D (number of slices), represented as . The raw input resolution of the images was . These images were resized to via random cropping. We also applied z-score normalization to standardize the input data. Additionally, we applied two data augmentation techniques: random mirror flipping and random intensity shifting. For inference, we used a sliding window approach with an overlap of 0.5 for neighboring voxels.
4.3. Evaluation Metrics
Two widely used evaluation metrics in brain tumor segmentation, the Dice score and 95% Hausdorff Distance, were employed to evaluate the superiority of the proposed model. Both metrics assess the similarity between the predicted results and the ground truth. A higher Dice score and lower Hausdorff Distance indicate a higher similarity. The definitions of these two evaluation metrics are as follows:
where
and
represent the ground truth and predicted values of voxel
i, and
and
denote the set of surface points of the ground truth and predicted values, respectively.
4.4. Comparison with Other Methods
To validate the efficiency of our method for brain tumor segmentation, we conducted extensive comparisons with recent 2D and 3D segmentation methodologies. For 2D-based models, our comparative analysis included MTAU [
9], Probabilistic U-Net [
34], and AGResU-Net [
11], all of which employ attention mechanisms to enhance the segmentation accuracy. For 3D-based models, we compared our method with several advanced architectures. These include the 3D U-Net [
12], Huang [
13], CANet [
14], and Deep Supervision CNN [
35]; models leveraging multi-scale information such as AFPNet [
16], DenseAFPNet [
17], and an MR Encoder–Decoder [
36]; and models utilizing attention mechanisms, including NLCA-VNet [
20], scSE-NLV-Net [
21], Single-Level U-Net3D [
19], and AMMGS [
18].
Table 1 and
Table 2 present the average Dice scores and Hausdorff Distance metrics for these methods and our model for the BraTS 2018 and 2020 validation sets, respectively. The performance evaluation results for all the models were obtained either by citing the respective papers or through online validation available on official websites, with bold font indicating the highest scores achieved for each category, and underlined text marking the second highest scores.
For the BraTS 2018 validation set (
Table 1), our model achieved Dice scores of 77.32%, 90.21%, and 84.66% for the ET, WT, and TC regions, respectively, complemented by Hausdorff Distances of 3.441 mm, 5.101 mm, and 5.872 mm, respectively. These results had an average Dice score of 84.06% and an average Hausdorff Distance of 4.805 mm, positioning our model at the forefront of the performance metrics across all evaluated regions on the BraTS 2018 dataset. In the realm of ET, AGResU-Net was identified as the leading contender in the literature. Our proposed model exhibited incremental improvements in ET, with a 0.12% increase in the Dice score and a 0.129 mm reduction in the Hausdorff Distance. For the WT and TC scenarios, CANet delivered optimal outcomes. Additionally, our model demonstrated improvements in the Dice score of 0.41% and 1.26% and in the Hausdorff Distance of 1.584 mm and 1.802 mm for WT and TC, respectively, underscoring its proficiency in delineating small tumor regions.
For the BraTS 2020 validation set (
Table 2), our methodology had Dice scores of 75.52%, 90.43%, and 84.51% for ET, WT, and TC, respectively, with Hausdorff Distances of 25.104 mm, 5.047 mm, and 5.410 mm, respectively. Consequently, the proposed model attained an average Dice score of 83.49% and an average Hausdorff Distance of 11.854 mm across these regions. Aside from the ET region, our model demonstrated superior performances in terms of both Dice scores and Hausdorff Distances for WT and TC regions in the BraTS 2020 dataset compared to other models. In the ET region, the AMMGS model attained the highest performance. Our model showed a slight decline in performance for ET of 2.51% in Dice score and 1.489 mm in Hausdorff Distance but exhibited notable improvements in the WT and TC regions, with increases in the Dice score of 2.12% and 2.79% and decreases in the Hausdorff Distance of 2.122 mm and 2.578 mm, respectively. Our model exhibited the best performance when averaged across all considered regions for both the Dice scores and Hausdorff Distances.
The demonstrated efficacy of the TACA-RNet on the BraTS 2018 and 2020 validation sets underscores its superior capability to harness the full spectrum of inherent 3D spatial orientations of MRI. By meticulously integrating the axial, coronal, and sagittal perspectives, the TACA-RNet provides a deeper understanding of complex spatial relationships within volumetric MRI data. This comprehensive approach allows for the precise identification of tumor boundaries across various tumor types and scales, facilitated by innovative modules of the model designed for multi-scale contextual information processing and focused dimensional reduction. Moreover, the remarkable performance of the TACA-RNet in distinguishing small tumor regions and its ability to effectively integrate multi-scale contextual information affirm its advancements over conventional methodologies. Our comparative experiments validated not only the model’s efficiency in enhancing the segmentation accuracy but also its potential in setting a new benchmark for brain tumor segmentation research, demonstrating its superiority by surpassing existing approaches across all evaluated regions.
4.5. Ablation Study
In this section, we describe an in-depth ablation study conducted to evaluate various modules in the segmentation task using the BraTS 2020 dataset. Due to the absence of ground truth labels in the official validation dataset, we divided the training dataset into a randomized 9:1 split. This split produced a validation set, which constituted of the training dataset and was used primarily for visualization purposes. All visualizations were exclusively performed with this validation set. The quantitative evaluation results of the ablation studies, presented in the tables, were validated using an official website to confirm their robustness and ensure an accurate performance comparison.
4.5.1. Analysis of the Impact of Tri-Axis Integration within the TACR
The first ablation study was conducted to validate the efficacy of utilizing all the three axis orientations within the TACR module of the proposed TACA-RNet. This experiment was structured to assess the performance variations when employing the TACR module aligned exclusively along the axial, coronal, and sagittal axes, in addition to a configuration in which all three axes were integrated. We conducted four experiments under consistent conditions, where the
branch of the TACR module was consistently used, and the branches designated as
for axial,
for coronal, and
for sagittal were selectively applied. For instance, a TACR module using only the axial orientation employs only
and
. The results are presented in
Table 3, in which the highest performance is highlighted in bold.
In our experiments on the BraTS 2020 dataset, integrating the axial, coronal, and sagittal orientations within the TACR module of the TACA-RNet resulted in an enhanced segmentation performance compared to configurations where the TACR module was aligned exclusively along a single axis. While the axial orientation typically used in 2D models showed a higher performance than the other individual axial, coronal, and sagittal orientations, it was surpassed by the integrated module that leveraged all three orientations. This improvement underscores the significance of TACR in the model architecture, particularly during the process of channel reduction across the encoder’s high-level features. In meticulously designing TACR to emphasize the unique 3D spatial orientations inherent in MRI data, our approach not only deepens the understanding of complex spatial relationships within volumetric MRI data but also significantly boosts the accuracy and precision of brain tumor segmentation. This advancement demonstrates the need to consider the three-axis structure of MRI to enhance the segmentation performance, validating the effectiveness of our method in leveraging the full spectrum of MRI volumetric information.
Figure 7 illustrates the tumor segmentation outputs of the TACR module across various anatomical orientations on the BraTS 2020 dataset. From left to right, each column shows the segmentation results with the TACR module aligned separately in the axial, coronal, and sagittal orientations, followed by the integrated approach using all three orientations, and finally, the ground truth for comparison. In the visualization results, the performance of the NCR/NET regions, marked in red, and the ED region, highlighted in green, was inaccurately predicted by all models except for the one integrating all three orientations. Specifically, in the first row, models exclusively employing axial, coronal, or sagittal orientations incorrectly identified normal tissue as ED in the same slice, and a review of the 3D-rendered results revealed a failure to predict within NCR/NET regions. Furthermore, in the second row, all models, except the one integrating all three orientations, incorrectly identified normal tissue as the ED region.
4.5.2. Assessment of Impact of Proposed Modules
In the second ablation study, meticulous evaluations were performed to discern the individual contributions of each component module to the overall performance of the proposed model. This comparative analysis involved the base network (BN), which comprised a CNN encoder and a PD, as well as variations in the base model enhanced with additional modules: the TACR, MSCF, and 3D ARA modules. These modules were methodically integrated to examine their individual and collective impacts on segmentation accuracy. The results are presented in
Table 4, with the best performances highlighted in bold.
In our experiments on the BraTS 2020 dataset, the introduction of our designed modules led to notable improvements in the segmentation performance compared to the BN. Specifically, the Dice scores increased by 1.05%, 1.27%, and 3.09% for the ET, WT, and TC tumor regions, respectively. Moreover, the Hausdorff Distances showed significant reductions of 5.791 mm, 3.11 mm, and 15.925 mm across these regions. These enhancements attest to the effectiveness of our modules in brain tumor segmentation tasks. The implementation of the TACR module led to improved performance in all tumor regions, both in terms of Dice scores and Hausdorff Distances. This improvement can be attributed to TACR’s capability to handle unique 3D spatial orientations of MRI data (axial, coronal, and sagittal), enabling the model to accurately capture and emphasize critical features. Further integration of the MSCF module yielded higher Dice scores and lower Hausdorff Distances. This outcome demonstrates the capacity of the MSCF module to reflect the complexity and variability of tumor sizes and their spatial distribution within the MRI data. When the 3D ARA module was added, without the MSCF module, there was a slight decrease in Dice scores compared to configurations using the MSCF module. However, the 3D ARA module’s focus on capturing essential details for accurate tumor boundary delineation resulted in improved accuracy in terms of Hausdorff Distances. This highlights the importance of high-resolution feature capture for enhancing segmentation precision. Ultimately, integrating all modules resulted in the highest segmentation accuracy, confirming the synergistic benefits of our comprehensive module design.
Figure 8 presents visualization results of the ablation study for the configuration modules on the BraTS 2020 dataset. From left to right, each column represents the base model, the model incorporating the TACR module, the model introducing both the TACR and MSCF modules, the model introducing both the TACR and 3D ARA modules, the model with all the modules integrated, and ground truth. In the visualization results, the performance in the ED region, marked in green, was inaccurately predicted across all models, excluding the model incorporating all modules. Specifically, in the first row, there was a failure to predict the ED region located on the right side, which aligned with the ground truth. Conversely, in the second row, normal tissue was erroneously identified as the ED region.
4.5.3. Effectiveness of MultiScale Contextual Fusion Module
The third ablation study was conducted to demonstrate the efficacy of MSCF in harnessing the complexity and variability of tumor sizes, as well as their spatial distribution across MRI data. Inspired by the ASPP and CFP, the MSCF module implements a hybrid approach that encapsulates contextual information across various scales and achieves a synergistic effect. The ASPP aspect of MSCF utilizes a series of dilation rates to extract features from a broad range of receptive fields without compromising resolution. Concurrently, CFP introduces a sequence of dilation rates designed to progressively capture larger contextual features, thereby enhancing the network’s spatial discernment across scales.
Table 5 presents the comparative analysis results for the BraTS 2020 validation dataset, focusing on the performance of the MSCF module, with the best scores highlighted in bold. The superiority of Full MSCF, which achieved the highest Dice scores and lowest Hausdorff distances when all components were integrated, is clearly demonstrated. These improvements are crucial because the Full MSCF module ensures a comprehensive representation of the spatial and contextual nuances essential for accurately delineating tumor boundaries.
Figure 9 presents a visual comparison of the tumor segmentation outputs across various MSCF configurations on the BraTS 2020 dataset. From left to right, each column represents the MSCF without the ASPP component (MSCF w/o ASPP), MSCF without the CFP component (MSCF w/o CFP), Full MSCF with all components integrated, and the ground truth for comparison. In the visualization results, the performance in the NCR/NET regions, marked in red, was inaccurately predicted by all models except Full MSCF. Specifically, in the first row, the MSCF without the ASPP component failed to accurately predict the NCR/NET regions that matched the actual values. Moreover, the MSCF without the CFP component predicted only a small portion of this region. Additionally, in the second row, the same slices reveal that both MSCF without the ASPP component and MSCF without CFP component failed to predict the NCR/NET regions. These outcomes were further confirmed through 3D-rendered visualization results. While the Full MSCF predicts NCR/NET regions that closely resemble that of the ground truth, both the MSCF without the ASPP component and the MSCF without the CFP component failed to accurately predict the NCR/NET regions.
5. Conclusions
In this study, we introduced the TACA-RNet, which was designed to automate the segmentation of brain tumors from multimodal MRI data. The TACA-RNet uniquely incorporates the TACR module to leverage the distinct spatial orientations of MRI data (axial, coronal, and sagittal), thereby enhancing the model’s understanding and processing of volumetric information. This was further complemented by the integration of PD, MSCF, and 3D ARA modules, each contributing to the model’s ability to accurately delineate brain tumors. Our comprehensive evaluations using the BraTS 2018 and 2020 datasets demonstrated the superiority of the TACA-RNet over existing segmentation techniques, with significant advancements in the segmentation accuracy and precision. Ablation studies further validated the significance of the TACR module in the TACA-RNet for brain tumor segmentation using MRI data. This evaluation demonstrated the central role of TACR in harnessing the unique 3D spatial orientations of MRI data. Subsequently, the contribution of individual modules within the model was examined, emphasizing their collective impact on achieving precise segmentation across various tumor ranges and sizes. Finally, the analysis of the impact of the MSCF module in multi-scale fusion settings revealed its effectiveness in capturing the variability and complexity of tumor sizes and their spatial distribution.
Our approach not only aids the precise localization and diagnosis of brain tumors but also shows promise in addressing the challenges of multimodal brain tumor MRI segmentation. However, the issue of missing modalities in MRI data is often encountered in clinical practice, which significantly affects segmentation tasks. Therefore, it is crucial to develop segmentation methods capable of handling missing modalities. Future research could explore strategies to enhance the segmentation performance in the presence of missing modalities, aiming to bolster the robustness and clinical applicability of the TACA-RNet. This direction promises to refine the utility of the TACA-RNet for clinical applications, offering valuable tools for medical professionals in the diagnosis of brain tumors. Furthermore, TACA-RNet’s potential extends beyond diagnostic imaging. In the context of drug development for brain tumors, TACA-RNet could play a pivotal role in clinical trials by monitoring tumor responses to new therapeutic agents. This capability facilitates quicker assessment of drug efficacy, enhancing the overall efficiency of drug development.