Next Article in Journal
Eddy Current Mechanism Model for Dynamic Magnetic Field in Ferromagnetic Metal Structures
Previous Article in Journal
Multi-Trajectory Planning Control Strategy for Hydropower Plant Bridge Crane Based on Evaluation Algorithm
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

OAR-UNet: Enhancing Long-Distance Dependencies for Head and Neck OAR Segmentation

Digital Manufacturing Equipment and Technology Key National Laboratories, Huazhong University of Science and Technology, Wuhan 430074, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(18), 3771; https://doi.org/10.3390/electronics13183771
Submission received: 29 August 2024 / Revised: 12 September 2024 / Accepted: 21 September 2024 / Published: 23 September 2024
(This article belongs to the Section Bioelectronics)

Abstract

:
Accurate segmentation of organs at risk (OARs) is a crucial step in the precise planning of radiotherapy for head and neck tumors. However, manual segmentation methods using CT images, which are still predominantly applied in clinical settings, are inefficient and expensive. Additionally, existing segmentation methods struggle with small organs and have difficulty managing the complex interdependencies between organs. To address these issues, this study proposed an OAR-UNet segmentation method based on a U-shaped architecture with two key designs. To tackle the challenge of segmenting small organs, a Local Feature Perception Module (LFPM) is developed to enhance the sensitivity of the method to subtle structures. Furthermore, a Cross-shaped Transformer Block (CSTB) with a cross-shaped attention mechanism is introduced to improve the ability of the model to capture and process long-distance dependency information. To accelerate the convergence of the Transformer, we designed a Local Encoding Module (LEM) based on depthwise separable convolutions. In our experimental evaluation, we utilized two publicly available datasets, SegRap2023 and PDDCA, achieving Dice coefficients of 78.22% and 89.42%, respectively. These results demonstrate that our method outperforms both previous classic methods and state-of-the-art (SOTA) methods.

1. Introduction

Head and neck cancer is undergoing profound changes in terms of therapeutic strategies. The introduction of Intensity-modulated Radiotherapy (IMRT) in the 1990s [1] marks a major leap forward in therapeutic technology. Because of its ability to accurately respond to highly radiation-sensitive organs in the complex anatomical structure of the head and neck region, IMRT has emerged as a pivotal therapeutic modality [2]. Through optimization of the radiation dose distribution, this technique ensures highly tailored and uniform irradiation of the tumor target area and minimizes radiation exposure to the surrounding normal tissues and critical organs at risk (OARs) [3], which can significantly reduce the side effects of treatment. In radiation therapy (RT), the precise modulation of radiation dose levels to OARs is a central strategy for minimizing complications [4]. Simultaneously, optimization of the dose distribution within the Planned Target Volume (PTV) to achieve the best therapeutic outcome is indispensable. Especially in the framework of Image-guided Radiotherapy (IGRT) [5], the realization of this goal is based on the precise segmentation of OARs in Radiotherapy Computed Tomography (RTCT) [6] or Cone-beam Computed Tomography (CBCT) [7] images. However, the current segmentation of OARs relies heavily on manual work by radiation oncologists. This process is not only extremely time-consuming (e.g., it often takes more than two hours to fully segment nine OARs) but also introduces significant operational variability due to individual differences. Given the significant volume variation of OARs, especially for labeling small structures, it is even more difficult and time-consuming. As treatment protocols become more complex and the number of OARs to be considered increases, the time cost of manual segmentation increases dramatically, which in turn limits patient access to timely radiotherapy. This situation strongly drives the medical community to explore and develop efficient and accurate automated segmentation techniques to meet the increasing treatment challenges and improve the efficiency and effectiveness of treatment.
Owing to the rapid development of deep learning technologies, deep convolutional models have achieved significant success in biomedical image segmentation. The UNet method, proposed by Ronneberger et al. [8], is a classic medical image segmentation method that consists of an encoder and decoder. It combines the features of convolutional neural networks (CNNs) and uses skip connections to make CNNs suitable for medical image segmentation. However, UNet tends to struggle with insufficient feature extraction, resulting in a lower accuracy. Zhou et al. [9] introduced UNet++, an extended and improved version of UNet that incorporates multi-scale skip connections and dense connections to fully extract feature information from medical images. Furthermore, Diakogiannis et al. [10] proposed ResUNet to highlight the benefit of rich gradient flow for segmenting smaller targets and designed a hybrid model based on residual connections and the UNet structure. Limited by the size of convolutional kernels, convolutional neural networks are inadequate in capturing long-range dependencies, which may prevent the model from fully understanding the spatial relationships between different organs, leading to inaccurate segmentation results.
To address this challenge, Transformer architecture [11] has been widely employed in medical image segmentation models. For example, Chen et al. [12] introduced TransUNet, wherein a Vision Transformer block [13] is incorporated into the bottleneck of the UNet architecture to improve the model’s ability to capture long-range dependencies. Cao et al. [14] proposed SwinUNet, which replaces the convolutional blocks in UNet with Swin Transformer modules [15] to enhance the model’s ability to learn global information. Li et al. [16] proposed UTAC-Net, a model that utilizes a dual-stream encoder integrating convolutional layers with Transformer modules. This encoder simultaneously captures global and local information from the target images for segmentation, thereby improving the model’s precision and contextual awareness in segmentation tasks. Although these Transformer-based models have an advantage in capturing global dependencies, they may not be precise enough when dealing with the fine details of smaller tissue organs. This is because the Transformer’s self-attention mechanism inherently has a global receptive field, which can lead to the “dilution” of local detail information when processing fine structures. Furthermore, Transformer-based models are computationally complex and have a slower convergence rate. Therefore, for the task of multi-organ segmentation in the head area, Transformer-based models need further optimization.
In addition to classic medical image segmentation methods, several methods that specifically target head and neck OAR segmentation have drawn considerable attention. Gao et al. [17] proposed FocusNetv2 to simulate how radiologists delineate OARs. Their solution consists of two main components: a segmentation network and an adversarial autoencoder (AAE) for organ-shape constraints. AAE employs prior shapes of OARs to constrain the segmentation results, effectively addressing the challenge of detecting small targets. Furthermore, Zhong et al. [18] designed a “human-in-the-loop” strategy to improve the OARs segmentation accuracy. They trained a two-dimensional (2D) U-Net-like network, which was subjectively evaluated by 2–3 oncologists to further retrain the model. Moreover, Zhang et al. [19] proposed SegReg, which used standard templates to register training images and labels before segmentation, thereby reducing the training difficulty. Although SegReg and FocusNetv2 have achieved acceptable segmentation accuracy, these methods face challenges in practical clinical application. For example, the “human-in-the-loop” strategy requires radiologists to continuously refine the model’s segmentation predictions. Although this iterative refinement process gradually enhances the model’s precision, it does not simplify the radiologists’ workflow but rather introduces additional complexity into the procedures. On the other hand, FocusNetv2 relies heavily on shape constraints as prior information which is frequently unavailable. This limitation hinders the application of FocusNetv2 in clinical application.
To address the issues mentioned above, we have developed the OAR-UNet method for the segmentation of head and neck organs (OARs). This method does not rely on extra prior information; instead, it converts image features into coordinate information and dynamically generates weight maps to highlight key positions, thereby improving the recognition ability of small head and neck organs. Additionally, we redesigned the Transformer block to allow the model to capture more comprehensive long-range dependencies while enhancing convergence speed. The main contributions of this paper are as follows:
To address the challenge of recognizing small organs in medical images of the head and neck, we developed the Local Feature Perception Module (LFPM). This module converts image features into coordinate information and employs a coordinate weight calculation network to dynamically generate weight maps during feature extraction. By leveraging these maps, the LFPM selectively enhances key gradient information at specific coordinate positions in the feature map through long skip-distance connections, thereby improving the detection of small organs.
To enhance the model’s capability in handling long-distance dependencies, we have developed a novel Cross-shaped Transformer Block (CSTB). This block incorporates a cross-shaped self-attention mechanism within the Transformer architecture, effectively reducing the computational complexity associated with traditional self-attention mechanisms.
To accelerate convergence, we introduced a lightweight deep separable convolution module named Local Encoding Module (LEM). It serves as a pre-processing module in front of the CSTB, utilizing feature information obtained through local induction to enhance the training efficiency of the Transformer.

2. Methodology

2.1. Overview of OAR-UNet

As a model specifically designed for head and neck OAR segmentation, the OAR-UNet architecture effectively addresses the limitations of traditional U-shaped networks by incorporating the LFPM and CSTB. LFPM processes feature maps in the early stages of the U-shaped network, computes the importance of local features for coordinate information in both directions, captures detailed information in high-resolution feature maps, and efficiently transfers gradient information to the decoder. In the deeper layers of the network, where abundant semantic information is present, CSTB employs Cross-Shaped Self-Attention for self-attention computations. This method is more efficient than traditional patch-based methods and significantly enhances a network’s ability to perceive long-distance dependencies. Consequently, OAR-UNet can thoroughly interpret OAR images and generate highly accurate segmentation maps, thereby improving the accuracy of the OAR image segmentation tasks. The architecture of the OAR-UNet is illustrated in Figure 1.

2.2. Designing OAR-UNet Model for OAR Segmentation

2.2.1. Local Feature Perception Module

From the perspective of image information, smaller organs within at-risk regions have less feature information and a weaker gradient flow, requiring more intensive perception [20]. Providing sufficient local information to the decoder during training can improve the performance of the U-shaped network and enable a better understanding of the features of smaller organs. Given that critical organ regions are often small, and their features are less apparent, the U-shaped network may lose crucial feature information of these organs during the downsampling process. In the U-shaped network, the early layers produce larger feature maps with smaller receptive fields, thereby allowing for more localized information extraction [21]. To enhance the network’s focus on OAR regions and improve local information extraction, we designed LFPM for OAR-UNet. This module is integrated into the long skip connections in the first three stages of the network, as shown in Figure 1, with a detailed diagram in Figure 2. The LFPM design incorporates coordinate decomposition, which splits input features into components along both the width (W) and height (H) directions. It uses a four-branch fully connected layer to apply attention weighting to these coordinate positions. The resulting attention map, obtained by multiplying the attention layer components, is then multiplied by the input feature map to pass the important local feature gradient information to the decoder. This process strengthens the network’s ability to perceive local information and enhances attention to smaller organs.
The LFPM is integrated into the long skip connections of the first three stages of OAR-UNet. After feature extraction by Encoder1 to Encoder3, the three feature maps fed into the LFPM module had resolutions of the original resolution, half of the original resolution, and a quarter of the original resolution, respectively. It is assumed that the input features of the LFPM are X R H × W × C , then the input feature map undergoes the following steps. To encourage long-distance interactions for capturing accurate spatial location information, the local perceptual attention is first divided into two branches, with one pooled in the H direction and the other pooled in the W direction. To obtain feature vectors of sizes of (C,1,W) and (C,H,1), the decomposed vectors represent the horizontal and vertical coordinate components on the original feature map, with each coordinate indicating the information of a specific pixel. Considering input X, the output of the C-th channel in the H direction can be expressed as follows:
Z c h ( h ) = 1 W 0 m < W X c ( h , m ) .
Similarly, the output of the C-th channel in the W direction can be expressed as follows:
Z c w w = 1 H 0 n < H X c n , w .
The two feature maps were reshaped to obtain feature vectors with sizes of (C, H) and (C, W). When H = W, these two feature maps are spliced to obtain feature vectors with size (C, W + H). Subsequently, four fully connected branches were used to reduce their dimensions to one-fourth of the input channel, and four feature maps with the size of (C/2, W) were obtained. After four fully connected branches were formed, the resulting four feature maps were spliced to obtain a feature map with the size of (C, W + H). The idea of channel attention weighting is followed in the process of using four fully connected branches to reduce dimensionality and splicing. Drawing inspiration from SENetV2 [22], the use of multiple branches enhances the perception of feature information across various dimensions. The feature map was then split into two parts along the channel dimension and reshaped to produce two feature maps with sizes of (C, 1, W) and (C, 1, H), respectively. These dimensions are equivalent to those obtained from the mean pooling operation. Finally, the feature maps with sizes of (C, 1, W) and (C, 1, H) were multiplied by matrix multiplication to obtain an attention map M a t t with dimensions (C, H, W). Once the attention map is acquired, the output Y of the LFPM can be expressed as follows:
Y = M a t t × X .
By leveraging the attention map M a t t , the method enhances the rich local information of small organs within the input feature map, with important channels being weighted during the training process to highlight their significance.

2.2.2. Cross-Shaped Transformer Block

In OARs images, there is strong interdependence among organs, necessitating enhanced long-distance dependency perception by the network. Traditional U-shaped networks struggle to capture long-distance dependencies using only convolutional structures. To address this challenge, we introduced a module that integrates the Local Encoding Module (LEM) with the Cross-Shaped Transformer Block (CSTB) in the final two stages of the OAR-UNet architecture to enhance long-distance dependency perception. Inspired by Vision Transformer (ViT), our design adapts the concept of tokenization for image representation. In ViT, the image is segmented into small patches, which are then vectorized into tokens and processed by the Transformer encoder to lower computational complexity. However, the patch size, a hyperparameter, can limit the model’s long-range dependency perception. To address this, we propose a cross-shaped Transformer module that employs cross-shaped self-attention. Unlike ViT’s patch-based tokens, our method uses horizontal and vertical stripes as tokens. This allows us to compute interdependencies more effectively, enhancing the model’s ability to perceive long-range dependencies. The use of stripes provides greater modeling capabilities, strengthening the associations between distant pixels. Consequently, our approach more effectively captures long-range perceptual dependencies compared to the patch-based token method.
To address the inherent convergence issues of Transformers, we introduce LEM, a local enhancement module, before the cross-shaped Transformer. The LEM module is designed using depthwise separable convolution, which maintains a lightweight structure while offering a larger receptive field compared to traditional convolution, allowing it to capture more prior features. The combination of the LEM module with the cross-shaped Transformer can not only accelerate convergence but also strengthen the model’s ability to capture long-distance dependencies. The LEM module integrated with the Cross-Shaped Transformer Block is shown in Figure 3.
As shown in Figure 3, given the input feature map X R H × W × C , the LEM operation can be defined as follows:
L E M X = D C o n v 3 × 3 B N C o n v 1 × 1 X + X .
In Equation (4), the LEM module consists of a 1 × 1 convolution, Batch Norm, and a 3 × 3 depth separable convolution [23]. The features processed by the LEM module are fed into the Cross-Shaped Transformer, which consists of the Layer Norm, cross-shaped self-attention, and MLP modules. Each Layer Norm module is followed by using a residual connection. With two layers, MLP uses the GELU activation function. The computation for the cross-shaped Transformer is expressed as follows:
X ^ l = C S S A L N X l 1 + X l 1 ,
X l = M L P L N X ^ l + X ^ l ,
where X l 1 represents the output of the previous layer, X l represents the output of the current layer, and the feature map inputs the long-range-dependent perceptual capability module. In the cross-shaped Transformer, cross-shaped self-attention (CSSA) is achieved by creating a cross-shaped window and performing self-attention computations in parallel within the horizontal and vertical strips that form this window. The structure of the cross-shaped self-attention is illustrated in Figure 4.
According to the self-attention mechanism, the input feature X R H × W × C is first linearly projected onto the K heads. Furthermore, each head performs a localized self-attention computation within a horizontal or vertical strip. For the horizontal strip self-attention computation, X is uniformly partitioned into equal-width nonoverlapping horizontal strips [ X 1 , X 2 , , X M ] , each containing sw × W tokens, where sw represents the strip width, which can be adjusted to balance the learning capacity and the computational complexity. Formally, it is assumed that the dimensions of the projected query, key, and value of the k-th head are d k . The output of the horizontal strip self-attention computation of the k-th head (denoted as H A t t k ( X ) ) can be expressed as follows:
X = [ X 1 , X 2 , , X M ] ,
Y k i = A t t ( X 1 W k Q , X 2 W k K , X M W k V ) ,
H A t t k ( X ) = [ Y k 1 , Y k 2 , , Y k M ] ,
where X i R ( s w × W ) × C , M H / s w ,   i = 1 , , M. W k Q R C × d k , W k K R C × d k , W k V R C × d k denote the projection matrices of the queries, keys, and values for the k-th head, respectively, and d k is C/K. The self-attention of the vertical strip can be derived similarly, and the output of its k-th head is denoted as V A t t k ( X ) . It is assumed that there is no directional bias in the head and neck medical images. This study divides K heads into two groups (each with K/2 heads, where K is usually an even number). The first group of heads performs horizontal strip self-attention, whereas the second group performs vertical strip self-attention. Finally, the outputs of the two groups were concatenated. The process can be expressed as follows:
C S S A X = C o n c a t h e a d 1 , h e a d 2 , , h e a d k W O ,
w h e r e , h e a d k = H A t t k ( X ) ,     k = 1 , , K 2 V A t t k ( X ) ,     k = K 2 , , K ,
where CSSA represents cross-shaped self-attention and W O represents the commonly used projection matrix for projecting the self-attention results to the target output dimension (set to C by default). As mentioned above, a key insight in the design of our self-attention mechanism is to categorize the multiheads into different groups and apply different self-attention operations accordingly. In other words, the attention region of each token in each Transformer block is extended by grouping the multiheads. In contrast, the existing self-attention mechanism applies the same self-attention operation to different multiheads. The experimental section reveals that this design achieves better performance.

3. Data Preparation and Implementation Details

3.1. Data Preparation

In this section, we aim to validate the accuracy and robustness of OAR-UNet by utilizing two publicly available datasets for high-risk organ segmentation in the head and neck region, SegRap2023 and PDDCA. These datasets are used to assess the performance of the proposed method.
SegRap2023 datasets [24]: The SegRap2023 dataset provides CT images of the head and neck organs at risk, as well as accurate segmentation annotations. SegRap2023 contains 200 datasets, including 120 cases of training data, 20 cases of validation data, and 60 cases of test data. This study used nine anatomical categories in the dataset (Brain, BrainStem, TemporalLobe, OralCavity, Mandible, Parotid, Mastoid, SpinalCord, and Larynx) and pre-processed the raw data by removing noise from binarized images, extracting the maximally connected region, and generating a non-zero mask to identify the structural regions of interest. Subsequently, the images were computed and cropped based on non-zero masks to focus on the regions of interest and reduce the data size. To ensure that the images and labels have a uniform spatial resolution, this study performed a resampling process and normalized the images based on their median, mean, and standard deviation before slicing, resulting in 15,312 training images, 2520 validation images, and 7656 test images.
PDDCA datasets [25]: PDDCA, as a CT dataset for head and neck endangerment and organ segmentation tasks, contains CT images from 48 patients, and we used five anatomical categories in the dataset (BrainStem, Chiasm, Mandible, OpticNerve, Parotid), together with the manual identification of the bone markers. The dataset was divided into 38 training and 10 test samples. During pre-processing, we first sliced them in two dimensions (2D). After slicing, the images were binarized to obtain a common bounding box based on the labels, and the region of interest was obtained based on the common bounding box. In addition, morphological manipulation is used to optimize the shape of the bounding box to ensure that the final extracted region contains the structure of interest in its entirety. Finally, the original and labeled images are cropped according to the expanded bounding box coordinates, and 1250 training images and 313 test images are obtained.
In this study, the input images were scaled to a size of 512 × 512 pixels before inputting them into OAR-UNet for training. To broaden the data distribution for training purposes, we employed data augmentation techniques on the dataset throughout the training process, which involved random cropping and random warping of the images. Augmentation during the training process only expands the data distribution but does not change the amount of input data, which improves the generalization ability of the model without increasing the amount of data.

3.2. Loss Function

During training, this study used a hybrid loss function that combined Cross-Entropy (CE) loss with Dice loss. The effectiveness of this combined loss function was further validated through ablation experiments described in the experimental section.
Cross-Entropy Loss: The Cross-Entropy loss function was used to compare the predicted segmentation mask with the true mask in semantic segmentation tasks. The Cross-Entropy loss is defined as follows:
C E l o s s = 1 N i = 1 N c = 1 C y i , c l o g ( y ^ i , c )
where C is the number of classes, and y i , c indicates whether the i -th sample belongs to class c , y ^ i , c represents the predicted probability that the i -th sample belongs to class c . The loss function calculates the sum of the logarithms of the predicted probabilities weighted by the true labels, providing a measure of how well the model predictions align with the actual classes. In the segmentation tasks, the Cross-Entropy loss function was adapted to measure the difference between the predicted and actual pixel-wise classifications. The goal of segmentation is to classify each pixel of an image into one of several classes; thus, the Cross-Entropy loss is computed for each pixel across the entire image.
Dice Loss: The Dice loss function aims to quantify the overlap or similarity between the predicted segmentation and ground truth label. Dice loss is defined as follows:
D i c e L o s s = 1 2 y i y ^ i y i + y ^ i
where y i y ^ i denotes the number of pixels that are correctly predicted as positive samples (part of the object of interest) and are positive samples in the ground truth label. y i represents the number of positive sample pixels in the true label and y ^ i denotes the total number of pixels predicted as positive samples.
Hybrid Loss: The hybrid loss is formed by combining two loss functions, Cross-Entropy loss and Dice loss, which can be expressed as follows:
H i b i r d L o s s = α C E L o s s + β D i c e L o s s ,
where α and β represent the weights of the Cross-Entropy loss and Dice loss, respectively, and the specific values are determined in the experiments.

3.3. Evaluation Criteria

Dice similarity coefficient (DSC) and intersection over ratio (IoU) metrics were used to evaluate the segmentation results by measuring the degree of similarity between the predicted segmentation masks and the true labeling masks. They are formulated as follows:
D S C = 2 y i y ^ i y i + y ^ i = 2 T P 2 T P + F P + F N ,
I o U = y i y ^ i y i y ^ i = T P T P + F P + F N ,
where y i denotes the predicted segmentation mask and y ^ i denotes the true labeling mask. True Positive (TP) refers to correctly identified positive cases, False Positive (FP) refers to incorrectly identified positive cases, and False Negative (FN) refers to missed positive cases.

3.4. Implementation Details

OAR-UNet was implemented on a server with Intel(R) Xeon(R) Silver 4216 CPU and two Quadro GV100 GPUs, using PyTorch version 1.10 and MMSegmentation version 0.29. All methods were implemented in MMSegmentation with the same training configurations and parameters for each method. The batch size was set to 8, and a Poly learning rate adjustment strategy was applied, with an initial learning rate of 10−4 and a final learning rate of 10−6. The Adam optimizer was used, and training was conducted for 20,000 iterations.

4. Results

4.1. Ablation Experiments

In this section, an ablation study of OAR-UNet is conducted, which includes ablation experiments for the LFPM, CSTB, and LEM modules, as well as the hybrid loss function. The weight loss of the hybrid was experimentally determined.
Table 1 presents the results of ablation studies for the LFPM, CSTB, and LEM modules. As shown, using the LFPM improves the mIoU and mDice metrics in three ways. The metrics improved by only two points with the CSTB module alone and less than 1 point with the LEM module alone. However, when both the LEM and CSTB modules are used together, the metrics improve by nearly five points compared to the baseline and by four points relative to the use of only the CSTB module. This suggests that the LEM module effectively reduces the training complexity of the CSTB module. The highest metric values are achieved when all three modules are used simultaneously.
In Table 2, we conducted ablation experiments on the loss function, comparing the accuracy of using both the CE loss function and the Dice loss function with the accuracy of using only one loss function. The results show that the accuracy is higher when both the CE loss function and the Dice loss function are used together compared to when only one of them is used. We designed experiments to determine the optimal loss-function ratio. Then, the highest metrics were obtained when the loss weight of α to β reached 2:1. Thus, the weight loss ratio of α to β was determined to be 2:1.
In this section, we aim to validate the accuracy and robustness of OAR-UNet using two publicly available datasets for high-risk organ segmentation.

4.2. Comparative Experiments

Table 3 presents a comparison of segmentation results among different methods. The results show that the proposed OAR-UNet outperforms the seven comparative methods in terms of mIoU and mDice metrics across two different datasets. Specifically, this study conducted a comparison with classic convolution-based methods such as UNet [8], UNet++ [9], and ResUNet [10], alongside Transformer-based methods like TransUNet [12] and SwinUNet [14]. Additionally, the state-of-the-art methods SegReg [18] and FocusNetv2 [17] were also evaluated for their performance in the task of segmenting head and neck OARs. The experimental results demonstrate that our method achieved the highest mDice scores of 0.7822 and 0.8942 on the SegRap2023 and PDDCA, respectively. OAR-UNet achieves higher accuracy than the latest SegReg and FocusNetv2 methods and does not require additional information or a “human-in-the-loop” strategy, making it suitable for clinical applications.
Figure 5 and Figure 6 illustrate the segmentation accuracy of the different methods across various anatomical structures using box plots. These box plots effectively capture the accuracy characteristics of OAR-UNet for each anatomical category and facilitate comparisons of the distribution characteristics across multiple experimental groups. The box plots show that across all anatomical categories and datasets, the distribution of IoU and Dice metrics for OAR-UNet is superior to that of other methods. Specifically, the OAR-UNet model consistently achieves higher overall performance metrics compared to all other evaluated methods, particularly exhibiting significant advantages when segmenting smaller organs, such as the parotid gland. These results substantiate the superior efficacy of the OAR-UNet in these specific contexts.

4.3. Visualization

This study visually compares OAR-UNet with seven other methods to illustrate the subjective advantages of the OAR-UNet method. Visual comparisons are displayed in Figure 7 and Figure 8, showing the performance differences between OAR-UNet and the other methods. We used two publicly available datasets, SegRap2023 and PDDCA, for visualization. By selecting several example cases and extracting relevant slices, we can generate visual representations of segmentation outcomes. The visual results indicate that our proposed OAR-UNet model yields segmentation results that are visually more aligned with the ground truth than the benchmark methods. The comparative methods exhibited notable instances of missed or erroneous identifications, particularly in the segmentation of smaller anatomical structures. For example, both the UNet and UNet++ models struggle to identify smaller organs. In the context of segmenting small targets, the OAR-UNet model demonstrates distinct superiority, with no instances of missed or incorrect identifications observed.

5. Discussion

5.1. Findings

The OAR-UNet model demonstrates significant improvements in segmentation accuracy for head and neck organs at risk (OARs), driven by the integration of the Local Feature Perception Module (LFPM) and the Cross-Shaped Transformer Block (CSTB). The introduction of LFPM enhances the network’s capability to perceive detailed local features, particularly for smaller organs within at-risk regions. By incorporating local perceptual attention mechanisms early in the network, LFPM effectively captures and preserves crucial spatial information that is otherwise prone to loss during the downsampling process in traditional U-shaped networks. The CSTB, on the other hand, addresses the challenge of capturing long-distance dependencies that are essential for accurate OAR segmentation. By employing cross-shaped self-attention mechanisms and LEM, CSTB improves the model’s ability to perceive relationships between distant pixels, surpassing the traditional patch-based methods in efficiency and accuracy. The combination of LFPM and CSTB not only boosts the model’s performance in interpreting complex OAR images but also significantly enhances segmentation maps, as evidenced by improved metrics across the datasets.
Our ablation studies reveal that integrating both LFPM and CSTB modules results in the highest performance metrics, underscoring the importance of leveraging both local and long-range feature extraction strategies. Specifically, the use of LFPM alone improves the mean Intersection over Union (mIoU) and mean Dice coefficient (mDice) by approximately 2 points, while the CSTB module alone contributes less than 1 point to these metrics. However, when both modules are combined, the improvements reach nearly 5 points compared to the baseline and 4 points relative to CSTB alone. This highlights the synergistic effect of combining local feature perception with long-range dependency modeling. In comparative experiments, OAR-UNet outperforms several state-of-the-art methods, including UNet, UNet++, ResUNet, and advanced Transformer-based methods like TransUNet and SwinUNet. The proposed model achieves the highest mDice scores of 0.7822 and 0.8942 on the SegRap2023 and PDDCA datasets, respectively.
These results validate the effectiveness of the OAR-UNet architecture and demonstrate its strong performance in accurately segmenting small and challenging anatomical structures, such as the parotid gland. Overall, the design and feature extraction capabilities of the OAR-UNet model set a new benchmark for head and neck OAR segmentation and provide a valuable tool for clinical diagnosis and treatment planning.

5.2. Limitation and Future Works

We recognize that transitioning research findings from the laboratory setting to actual patient care in a clinical setting requires a thorough understanding and consideration of the real-world clinical environment. This means that although the segmentation results achieved in the laboratory setting may be impressive, further validation and adjustment may be necessary in actual clinical practice to ensure that the algorithm can adapt to the complex realities of clinical settings.
For example, the model’s increased computational complexity due to the integration of the Local Feature Perception Module (LFPM) and Cross-Shaped Transformer Block (CSTB) may limit its applicability in real-time scenarios with constrained resources. Future work should focus on optimizing the model to reduce computational requirements, making it more practical for real-time use. Furthermore, broadening the assessment of the model to encompass a variety of datasets from multiple institutions and imaging modalities will be crucial to examine its robustness and versatility. Finally, integrating the OAR-UNet model into clinical systems and workflows could enhance its utility in clinical decision-making and treatment planning.

6. Conclusions

Current manual segmentation techniques for head and neck organs-at-risk (OARs) are known for their inefficiency and high cost. Moreover, existing automatic segmentation methods struggle with effectively segmenting small organs and managing the complex interdependencies among anatomical structures. To address these challenges, we propose the OAR-UNet segmentation method, which is based on a U-shaped network architecture. OAR-UNet incorporates three key design components: the Local Feature Perception Module (LFPM), the Local Encoding Module (LEM), and the Cross-shaped Transformer Block (CSTB). These innovations significantly enhance the model’s efficacy in segmenting head and neck OARs, ensuring precise segmentation of smaller organs and maintaining stability and accuracy in the face of complex organ interactions. Evaluations of the SegRap2023 and PDDCA datasets demonstrate the superior performance of the OAR-UNet method. Specifically, OAR-UNet achieved a mIoU of 0.7478 and an mDice of 0.7822 on the SegRap2023 dataset, and a mIoU of 0.8545 and an mDice of 0.8942 on the PDDCA dataset. These results reveal that OAR-UNet outperforms both classic medical image segmentation methods and the latest state-of-the-art methods. It is anticipated that this advancement will provide robust technical support for the accurate segmentation of organs at risk in head and neck tumor radiotherapy planning. In future research, the method will be further optimized, and its potential applications in additional clinical scenarios will be explored to contribute more significantly to personalized and precise tumor radiotherapy.

Author Contributions

Conceptualization, K.P.; methodology, K.P. and D.Z.; software, K.P.; validation, K.P. and D.Z.; formal analysis, K.P.; investigation, K.P. and D.Z.; resources, K.P. and D.Z.; data curation, K.P. and D.Z.; writing—original draft preparation, K.P.; writing—review and editing, K.P., D.Z. and S.G.; visualization, K.P. and D.Z.; supervision, S.G.; project administration, S.G.; funding acquisition, S.G. All authors have read and agreed to the published version of the manuscript.

Funding

The research is supported in part by the National Natural Science Foundation of China under Grant #52175462 and in part by the National Key Research and Development Program of China under Grant #2016YFC0105306.

Institutional Review Board Statement

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Data Availability Statement

The SegRap2023 and PDDCA datasets, both publicly available, were used for the training and testing evaluations in this study. The SegRap2023 dataset was downloaded from https://segrap2023.grand-challenge.org/ (accessed on 10 March 2024), and the PDDCA dataset was downloaded from https://www.imagenglab.com/newsite/pddca/ (accessed on 10 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest that could be perceived as prejudicing the impartiality of the research reported in this paper. All funding sources for the research, data collection, and analysis are disclosed transparently. The authors have no personal, financial, or professional affiliation with any organization or entity that could influence the findings, interpretations, or conclusions of this study. The authors adhered to all ethical guidelines and standards in the preparation of this manuscript to ensure the integrity and objectivity of the research.

References

  1. Gormley, M.; Creaney, G.; Schache, A.; Ingarfield, K.; Conway, D.I. Reviewing the Epidemiology of Head and Neck Cancer: Definitions, Trends and Risk Factors. Br. Dent. J. 2022, 233, 780–786. [Google Scholar] [CrossRef]
  2. Thomas, S.J.; Penfold, C.M.; Waylen, A.; Ness, A.R. The Changing Aetiology of Head and Neck Squamous Cell Cancer: A Tale of Three Cancers? Clin. Otolaryngol. 2018, 43, 999–1003. [Google Scholar] [CrossRef]
  3. Gujral, D.M.; Nutting, C.M. Patterns of Failure, Treatment Outcomes and Late Toxicities of Head and Neck Cancer in the Current Era of IMRT. Oral Oncol. 2018, 86, 225–233. [Google Scholar] [CrossRef] [PubMed]
  4. Gupta, T.; Agarwal, J.; Jain, S.; Phurailatpam, R.; Kannan, S.; Ghosh-Laskar, S.; Murthy, V.; Budrukkar, A.; Dinshaw, K.; Prabhash, K. Three-Dimensional Conformal Radiotherapy (3D-CRT) versus Intensity Modulated Radiation Therapy (IMRT) in Squamous Cell Carcinoma of the Head and Neck: A Randomized Controlled Trial. Radiother. Oncol. 2012, 104, 343–348. [Google Scholar] [CrossRef] [PubMed]
  5. Sapkaroski, D.; Osborne, C.; Knight, K.A. A Review of Stereotactic Body Radiotherapy–Is Volumetric Modulated Arc Therapy the Answer? J. Med. Radiat. Sci. 2015, 62, 142–151. [Google Scholar] [CrossRef] [PubMed]
  6. Srinivasan, K.; Mohammadi, M.; Shepherd, J. Applications of Linac-Mounted Kilovoltage Cone-Beam Computed Tomography in Modern Radiation Therapy: A Review. Polish J. Radiol. 2014, 79, 181. [Google Scholar]
  7. Reggiori, G.; Mancosu, P.; Tozzi, A.; Cantone, M.C.; Castiglioni, S.; Lattuada, P.; Lobefalo, F.; Cozzi, L.; Fogliata, A.; Navarria, P. Cone Beam CT Pre-and Post-daily Treatment for Assessing Geometrical and Dosimetric Intrafraction Variability during Radiotherapy of Prostate Cancer. J. Appl. Clin. Med. Phys. 2011, 12, 141–152. [Google Scholar] [CrossRef] [PubMed]
  8. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2015; Volume 9351, pp. 234–241. [Google Scholar]
  9. Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A Nested u-Net Architecture for Medical Image Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
  10. Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A Deep Learning Framework for Semantic Segmentation of Remotely Sensed Data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
  11. Vaswani, A. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  12. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
  13. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  14. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205–218. [Google Scholar]
  15. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
  16. Li, W.; Zhang, W. UTAC-Net: A Semantic Segmentation Model for Computer-Aided Diagnosis for Ischemic Region Based on Nuclear Medicine Cerebral Perfusion Imaging. Electronics 2024, 13, 1466. [Google Scholar] [CrossRef]
  17. Gao, Y.; Huang, R.; Yang, Y.; Zhang, J.; Shao, K.; Tao, C.; Chen, Y.; Metaxas, D.N.; Li, H.; Chen, M. FocusNetv2: Imbalanced Large and Small Organ Segmentation with Adversarial Shape Constraint for Head and Neck CT Images. Med. Image Anal. 2021, 67, 101831. [Google Scholar] [CrossRef] [PubMed]
  18. Zhang, Z.; Qi, X.; Zhang, B.; Wu, B.; Le, H.; Jeong, B.; Liao, Z.; Liu, Y.; Verjans, J.; To, M.-S. Segreg: Segmenting Oars by Registering Mr Images and Ct Annotations. In Proceedings of the 2024 IEEE International Symposium on Biomedical Imaging (ISBI), Athens, Greece, 27–30 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
  19. Zhong, Y.; Yang, Y.; Fang, Y.; Wang, J.; Hu, W. A Preliminary Experience of Implementing Deep-Learning Based Auto-Segmentation in Head and Neck Cancer: A Study on Real-World Clinical Cases. Front. Oncol. 2021, 11, 638197. [Google Scholar] [CrossRef] [PubMed]
  20. Ibragimov, B.; Xing, L. Segmentation of Organs-at-risks in Head and Neck CT Images Using Convolutional Neural Networks. Med. Phys. 2017, 44, 547–557. [Google Scholar] [CrossRef] [PubMed]
  21. Luan, S.; Wei, C.; Ding, Y.; Xue, X.; Wei, W.; Yu, X.; Wang, X.; Ma, C.; Zhu, B. PCG-Net: Feature Adaptive Deep Learning for Automated Head and Neck Organs-at-Risk Segmentation. Front. Oncol. 2023, 13, 1177788. [Google Scholar] [CrossRef] [PubMed]
  22. Narayanan, M. SENetV2: Aggregated Dense Layer for Channelwise and Global Representations. arXiv 2023, arXiv:2311.10807. [Google Scholar]
  23. Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  24. Luo, X.; Fu, J.; Zhong, Y.; Liu, S.; Han, B.; Astaraki, M.; Bendazzoli, S.; Toma-Dasu, I.; Ye, Y.; Chen, Z. Segrap2023: A Benchmark of Organs-at-Risk and Gross Tumor Volume Segmentation for Radiotherapy Planning of Nasopharyngeal Carcinoma. arXiv 2023, arXiv:2312.09576. [Google Scholar]
  25. Raudaschl, P.F.; Zaffino, P.; Sharp, G.C.; Spadea, M.F.; Chen, A.; Dawant, B.M.; Albrecht, T.; Gass, T.; Langguth, C.; Lüthi, M.; et al. Evaluation of Segmentation Methods on Head and Neck CT: Auto-Segmentation Challenge 2015. Med. Phys. 2017, 44, 2020–2036. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The overall structure of OAR-UNet.
Figure 1. The overall structure of OAR-UNet.
Electronics 13 03771 g001
Figure 2. Local Feature Perception Module, with the rectangles representing network structure modules and the cubes representing feature maps or feature vectors.
Figure 2. Local Feature Perception Module, with the rectangles representing network structure modules and the cubes representing feature maps or feature vectors.
Electronics 13 03771 g002
Figure 3. Local Encoding Module and Cross-Shaped Transformer Block.
Figure 3. Local Encoding Module and Cross-Shaped Transformer Block.
Electronics 13 03771 g003
Figure 4. Cross-Shaped Self-Attention (CSSA).
Figure 4. Cross-Shaped Self-Attention (CSSA).
Electronics 13 03771 g004
Figure 5. Comparison of evaluation metrics for nine anatomical structures in the segrap2023 dataset. (a) Box plots of IoU for different methods; (b) box plots of Dice for different methods.
Figure 5. Comparison of evaluation metrics for nine anatomical structures in the segrap2023 dataset. (a) Box plots of IoU for different methods; (b) box plots of Dice for different methods.
Electronics 13 03771 g005aElectronics 13 03771 g005b
Figure 6. Comparison of evaluation metrics for nine anatomical structures in the PDDCA dataset. (a) Box plots of IoU for different methods; (b) box plots of Dice for different methods.
Figure 6. Comparison of evaluation metrics for nine anatomical structures in the PDDCA dataset. (a) Box plots of IoU for different methods; (b) box plots of Dice for different methods.
Electronics 13 03771 g006aElectronics 13 03771 g006b
Figure 7. Visual comparison of OAR-UNet with seven other segmentation methods on the SegRap2023 dataset. The figure showcases one slice from each of the four selected cases.
Figure 7. Visual comparison of OAR-UNet with seven other segmentation methods on the SegRap2023 dataset. The figure showcases one slice from each of the four selected cases.
Electronics 13 03771 g007
Figure 8. Visual comparison of OAR-UNet with seven other segmentation methods on the PDDCA dataset. The figure showcases one slice from each of the three selected cases.
Figure 8. Visual comparison of OAR-UNet with seven other segmentation methods on the PDDCA dataset. The figure showcases one slice from each of the three selected cases.
Electronics 13 03771 g008
Table 1. Ablation experiments with LFPM, CSTB, and LEM. The best results are shown in bold.
Table 1. Ablation experiments with LFPM, CSTB, and LEM. The best results are shown in bold.
SegRap2023PDDCA
LFPMCSTBLEMmIoUmDicemIoUmDice
×××0.67510.70140.78940.8145
××0.70430.73550.81640.8451
××0.69320.72510.80410.8355
××0.68430.71510.79180.8189
×0.72410.75980.84110.8741
0.74780.78220.85450.8942
Table 2. Ablation experiments with Loss Function. Where CE represents cross-entropy loss, Dice represents Dice loss, and ratio represents the ratio of α to β in Equation (14). The best results are shown in bold.
Table 2. Ablation experiments with Loss Function. Where CE represents cross-entropy loss, Dice represents Dice loss, and ratio represents the ratio of α to β in Equation (14). The best results are shown in bold.
SegRap2023PDDCA
CEDiceRatiomIoUmDicemIoUmDice
×0.72550.76130.83780.8798
×0.71350.75890.82760.8681
1:10.73720.77640.84360.8842
2:10.74780.78220.85450.8942
3:10.73910.77810.84920.8897
1:20.72950.77100.83980.8791
1:30.72180.76430.83410.8746
Table 3. Comparative results of OAR-UNet and other seven methods: classic UNet, UNet++, ResUNet, TransUNet, SwinUNet, and state-of-the-art methods SegReg and FocusNetv2. The best results are shown in bold.
Table 3. Comparative results of OAR-UNet and other seven methods: classic UNet, UNet++, ResUNet, TransUNet, SwinUNet, and state-of-the-art methods SegReg and FocusNetv2. The best results are shown in bold.
SegRap2023PDDCA
MethodmIoUmDicemIoUmDice
UNet [8]0.65420.68510.77150.8074
UNet++ [9]0.66890.70070.78110.8123
ResUNet [10]0.67340.70150.78260.8193
TransUNet [12]0.69180.72020.79450.8310
SwinUNet [14]0.71010.73870.80680.8432
SegReg [18]0.72130.76740.83520.8752
FocusNetv2 [17]0.73620.76950.82230.8785
OAR-UNet0.74780.78220.85450.8942
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peng, K.; Zhou, D.; Gong, S. OAR-UNet: Enhancing Long-Distance Dependencies for Head and Neck OAR Segmentation. Electronics 2024, 13, 3771. https://doi.org/10.3390/electronics13183771

AMA Style

Peng K, Zhou D, Gong S. OAR-UNet: Enhancing Long-Distance Dependencies for Head and Neck OAR Segmentation. Electronics. 2024; 13(18):3771. https://doi.org/10.3390/electronics13183771

Chicago/Turabian Style

Peng, Kuankuan, Danyu Zhou, and Shihua Gong. 2024. "OAR-UNet: Enhancing Long-Distance Dependencies for Head and Neck OAR Segmentation" Electronics 13, no. 18: 3771. https://doi.org/10.3390/electronics13183771

APA Style

Peng, K., Zhou, D., & Gong, S. (2024). OAR-UNet: Enhancing Long-Distance Dependencies for Head and Neck OAR Segmentation. Electronics, 13(18), 3771. https://doi.org/10.3390/electronics13183771

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop