Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation

Liao, Zhihao; Fan, Neng; Xu, Kai

doi:10.3390/app12094735

Open AccessArticle

Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation

by

Zhihao Liao

^1,*,†

,

Neng Fan

^2,†

and

Kai Xu

^2,†

¹

School of Information and Engineering, Nanchang University, Nanchang 330031, China

²

School of Mathematics and Computer Sciences, Nanchang University, Nanchang 330031, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2022, 12(9), 4735; https://doi.org/10.3390/app12094735

Submission received: 20 April 2022 / Revised: 5 May 2022 / Accepted: 6 May 2022 / Published: 8 May 2022

(This article belongs to the Special Issue Selected Papers from the ICCAI and IMIP 2022)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Featured Application

The proposed Swin-PANet can be utilized for computer-aided diagnosis (CAD) of skin cancer or cell cancer to improve the segmentation efficiency and accuracy, considered as a significant technique for the accurate screening of diseased or abnormal area of patients to assist doctors to better evaluate disease and optimize prevention measures.

Abstract

Transformer complements convolutional neural network (CNN) has achieved better performance than improved CNN-based methods. Specially, Transformer is utilized to be combined with U-shaped structure, skip-connections, encoder, and even them all together. However, the intermediate supervision network based on the coarse-to-fine strategy has not been combined with Transformer to improve the generalization of CNN-based methods. In this paper, we propose Swin-PANet, which is applying a window-based self-attention mechanism by Swin Transformer in the intermediate supervision network, called prior attention network. A new enhanced attention block based on CCA is also proposed to aggregate the features from skip-connections and prior attention network, and further refine details of boundaries. Swin-PANet can address the dilemma that traditional Transformer network has poor interpretability in the process of attention calculation and Swin-PANet can insert its attention predictions into prior attention network for intermediate supervision learning which is humanly interpretable and controllable. Hence, the intermediate supervision network assisted by Swin Transformer provides better attention learning and interpretability in network for accurate and automatic medical image segmentation. The experimental results evaluate the effectiveness of Swin-PANet which outperforms state-of-the-art methods in some famous medical segmentation tasks including cell and skin lesion segmentation.

Keywords:

prior attention network; medical image segmentation; Swin Transformer; convolutional neural networks

1. Introduction

Medical image segmentation is considered as a significant technique for the accurate screening of diseased or abnormal area of patients to assist doctors in better evaluating disease and optimizing prevention measures. The evaluation and analysis of pathologies based on lesion segmentation provide valuable information such as the progression of disease in order to help physicians improve the quality of clinical diagnosis, monitor the plan of treatment strategies, and efficiently judge prediction of a patient’s outcome. For instance, cell segmentation in microscopic images is a critical challenge in biological study, clinical practice, and disease diagnosis. Segmentation of robust plasma cell is the initial step towards detecting malignant cells in case of Multiple Myeloma (MM), a type of blood cancer. Given the voluminous data accessible, there is an increasing demand for automated methods and tools for cell analysis. Furthermore, due to variable intra-cellular and inter-cellular dynamics, as well as structural features of cells, there is a constant need for more accurate and effective segmentation models. Hence, accurate medical image segmentation is of great significance for computer-aided diagnosis and image-guided clinical surgery [1,2,3].

Convolutional neural networks based on U-shaped topology are widely popular in medical image segmentation. However, despite great breakthroughs in this field, CNNs-based methods mostly demonstrate limitations in capturing features to model global dependency, caused by the inner locality of convolution operation. Attention mechanism is a useful solution to settle the above problem and it is inspired by human’s visual cognition and perception. Recently, there is an excellent work called prior attention network [4] which proposes a novel structure consisting of U-shaped convolutional neural network and an intermediate supervision network between encoder and decoder. It follows the coarse-to-fine strategy and intermediate supervision strategy. Particularly, they propose an attention-based method called attention guiding decoder in intermediate supervision network, and it takes the rich information of multi-scale features from the encoder to generate spatial attention maps for guiding the final segmentation of decoder in CNNs. Further, the process of traditional attention learning is often not humanly interpretable, and the regions focused by the network are usually different from the regions that we pay attention to. In prior attention network, the intermediate supervision learning that guides the next step of segmentation can provide the interpretability of the network in a certain manner. However, this non-local attention mechanism has a poor capability for aggregating multi-scale features from different modules and extracting boundary information.

Different from the non-local attention-based methods, Transformer has been proposed as an alternative method for modeling long-range dependency by its self-attention mechanism. The current research direction of Transformer on medical image segmentation are pure Transformer network and hybrid Transformer network. An interesting work called UCTransNet [5] investigates the potential limitation and semantic gap of the skip-connection between the encoder and decoder in U-Net, and proposes a novel design called channel-wise Transformer module to replace the skip-connection for better fusing multi-scale features from encoder and reducing the semantic gap. Combining the above excellent works, it is very feasible to apply Transformer-based method in intermediate supervision network for further improving the performance of medical image segmentation, enhancing the transferability of prior attention network, providing the interpretability of Transformer in a certain manner for better performing dual supervision strategy. Specially, Swin Transformer [6] is proposed for expanding applicability of vision Transformer and serving as a general backbone in the field of computer vision. The key design of Swin Transformer is the shifted window-based attention mechanism, in which the shifted windows bridge the windows layer by layer and construct connection for further enhancing the power of modeling long-range dependency.

Motivated by the excellent design of prior attention network and the Swin Transformer’s success, we propose a novel structure called Swin Transformer Assisted Prior Attention Network (Swin-PANet) to further leverage the power of Transformer on medical image segmentation. To our best knowledge, Swin-PANet is the first dual supervision network structure assisted by Transformer, which consists of two components: a prior attention network and a hybrid Transformer network. These networks are both Transformer complements CNN methods and follow the dual supervision strategy, in which the prior attention network performs intermediate supervision learning, and then hybrid Transformer performs direct supervision learning. The proposed Swin-PANet follows the coarse-to-fine strategy that achieves the two steps of medical segmentation in a network. Swin Transformer block and attention guiding decoder are combined to complete a prior attention network to generate attention prediction. Then Swin-PANet performs the second step of segmentation by inserting the attention prediction in hybrid Transformer network, which further enhances the ability of attention learning by applying window-based self-attention mechanism compared with the traditional prior attention network. Hybrid Transformer network is essentially based on U-Net structure, and its key component is the enhanced attention block that is based on the CCA of UCTransNet and is utilized in decoder layers of hybrid Transformer network. Different from the traditional CCA, the enhanced attention block has the capability of aggregating the features from skip-connections, previous level of decoder blocks, and prior attention network for further achieving better performance of attention learning.

This paper adopts a structure of deduction, modeling design, and summary. Introduction in Section 1 is to define the research question of the paper and provide a brief introduction of the research work and contributions. State of the Art and Related Work in Section 2 includes the contributions and outcomes of CNN-based method, Vision Transformer, and Transformer Complements CNN. Modeling, Methods, and Design in Section 3 provide detailed introduction of the proposed Swin-PANet. The last Section 4 and Section 5 provide experimental setup and results, and conclusion of the research work, respectively.

The contributions of this work lie in three aspects:

A novel network structure called Swin-PANet is proposed for medical image segmentation, which consists of a prior attention network and a hybrid Transformer network. The proposed Swin-PANet follows the coarse-to-fine strategy and dual supervision strategy to improve the segmentation performance.
The proposed Swin-PANet addresses the dilemma that traditional Transformer network has poor interpretability in the process of attention calculation, and Swin-PANet can insert its attention predictions into prior attention network for intermediate supervision learning which is humanly interpretable and controllable.
Enhanced attention bock is proposed and equipped in the hybrid Transformer network, which is to utilize a feature fusion between global and local contexts along channel-wise for achieving a better performance of attention learning.

2. State of the Art and Related Work

2.1. CNN-Based Methods

Early researches on medical image segmentation predominantly focus on contour-based algorithms and traditional machine learning algorithms [7,8]. Subsequently, after the development of convolution neural networks, U-shaped [9,10,11] structure is proposed where the encoder is applied to extract features with the receptive fields of convolution operations, and the decoder up-samples the features extracted by the encoder to generate prediction map with pixel-level scores. Ronneberger [9] proposed a novel network called U-Net for medical image segmentation which was utilized to connect the semantic information gap in encoder-decoder, and had proven its effectiveness in recovering detail features of the target [12,13]. Due to the simplicity of encoder-decoder U-shaped structure and outstanding performance in various computer vision tasks, many novel UNet-like models are constantly proposed such as UNet++ [14], Res-UNet [15], Attention U-Net [16], nnUNet [17], and UNet 3+ [18], which all achieve superior performance on medical image segmentation. Specially, Unet-like models are also applied into the medical image segmentation of three-dimensional (3D) datasets, such as V-Net [19] and 3D-Unet [20]. Currently, CNN-based methods have gained great popularity and made eminent contributions in medical image segmentation due to their remarkable ability of feature extraction and representation. However, since the inherent locality of the convolution operation exactly caused by the limited receptive fields, it is difficult for existing CNN-based methods to capture global contextual information by the inherent limitation [2]. The difficulties in some medical image tasks are also hard to address and CNN-based methods cannot roundly meet the requirements of these tasks for segmentation accuracy. A great deal of effort has been devoted to address this problem by using atrous convolution layers [21,22], self-attention mechanisms [23,24], and image pyramids [25].

2.2. Vision Transformers

Transformer [26] was initially proposed for solving the problems of machine translation and made eminent contributions in nature language processing (NLP). Inspired by the design of Transformer, researchers proposed an advanced vision transformer (ViT) [27], which considered an image as a sequence of patches and models’ long-range dependencies by using self-attention mechanisms [27,28,29]. Vision transformer had achieved top-1 performance on ImageNet classification tasks by utilizing Transformer self-attention mechanism to process images. However, ViT has its disadvantages in comparison with CNN-based methods because it needs to be pre-trained on large dataset before formal training. To address this problem in training ViT, DeiT [30] proposed that applying several training strategies to help ViT better be trained on ImageNet such as integrating it with a distillation method for enhancing the robust and scalability of the network. Recently, there are several excellent researches that have been done based on vision transformer [6,31,32]. Particularly, Swin Transformer, an efficient and effective hierarchical vision Transformer, adopted a hierarchical window scheme that imitates the process of CNN to enlarge the receptive field, and being designed as a backbone in vision tasks such as ImageNet classification, objection detection, and semantic segmentation based on its shifted windows mechanism, which outperforms the state-of-the-art methods such as ViT/DeiT [27,30], ResNet models [12], and importantly, with similar parameters and latency on these tasks. Swin Transformer obtains 53.5 mIoU on ADE20K and gains improvement of +3.2 mIoU over the state-of-the-art methods such as SETR, and it achieves a 87.3% top-1 accuracy on image classification task ImageNet-1K. Different from the previous works in the field of vision transformer, Swin Transformer has the following two outstanding contributions [6] that distinguish it from other works: (1) Swin Transformer constructs the hierarchical feature maps of the image on the basis of linear computational complexity. These hierarchical features can make Swin Transformer suitable as a general backbone for kinds of computer vision tasks. (2) Swin Transformer proposes a key design where shifted windows are equipped between consecutive attention layers, which can enhance modeling power while performing computation-efficient strategy. Motivated by Swin Transformer’s tremendous success in various vision tasks, Swin-Unet [3] is firstly proposed to apply pure Transformer-based U-shaped architecture to replace the traditional U-shaped composed with convolution blocks.

2.3. Transformer Complements CNNs

In the last few years, researchers have attempted to apply self-attention mechanism to complete CNN in order to further improve the generalization ability of the network [24]. Schlemper et al. [23] proposed to append the skip connections with an additive attention gate to perform accurate segmentation of medical images. Specially, PANet [4] reaches competitive segmentation performance with significantly lower additional computational cost by combining the two phases of a cascaded network via a spatial attention mechanism, enabling it much easier to train and apply to production applications. It also introduces a spatial attention-based strategy that integrates the attention of lesion sites with the features, allowing for effective synchronous training of multiple modules in the network, and incorporates intermediate supervision strategy into the proposed network to improve the convergence and performance of network. Compared with the cascaded U-Net, the proposed Prior Attention Network (PANet) achieves competitive results with substantially lower computational cost for 2D segmentation. Nevertheless, it is essentially still based on the backbone of convolution neural network that belongs to CNN-based method. Currently, TransUNet is proposed to utilize an integrated CNN-Transformer architecture to take advantage of rich spatial information from CNN and global semantic information encoded by Transformers. Similar to TransUNet [2], Transfuse [33], MedT [34], Swin-Unet [3], and UTnet [35] use the complementarity of CNN and Transformer to improve the feature extraction and generalization ability of the model. The results show that this hybrid architecture has superior performance. Various combinations of CNN and Transformer are applied in 3D medical image segmentation such as multi-modal brain tumor segmentation [36].

3. Modeling, Methods, and Design

In this section, we will give a detailed introduction of Swin-PANet. Firstly, in Section 3.1, the overview of the proposed network will be explained which consists of a prior attention network and a hybrid Transformer network. Then, details about the Swin Transformer block and attention guiding decoder are provided in Section 3.2. Whereafter, hybrid Transformer network with enhanced attention block will be explained in Section 3.3, which is receiving multiple features for aggregation and refining. In Section 3.4, the dual supervision strategy that achieves two steps of segmentation in a single network will be introduced.

3.1. Overview of Network Structure

Swin-PANet consists of a prior attention network and a hybrid Transformer network. The illustration of the proposed Swin-PANet can be shown in Figure 1. Prior attention network assisted by Swin Transformer performs intermediate supervision learning. Hybrid Transformer network with enhanced attention blocks performs direct supervision learning. The dual supervision strategy can enhance the performance and interpretability of the Transformer attention mechanism, and provide a humanly interpretable way to guide the attentional learning in Transformer. In the prior attention network, Swin Transformer block and attention guiding decoder are cascaded for receiving multi-scale features to shifted-window attention learning and feature fusion. Attention prediction from the out of prior attention network will be involved in enhanced attention block to guide the subsequent direct supervision learning. The hybrid Transformer network is basically modified based on the U-Net [9] structure, which adopts a U-shaped topology with skip-connections between encoder and decoder, and enhanced attention block to guiding the information filtration and cross attention of the Transformer attention features along channel-wise. Enhanced attention block at each decoder layer will be involved with the multi-scale features of the previous block, attention prediction generated by prior attention network and the features from corresponding skip-connections, which can ensure the consistency between encoder and decoder and recover the discarded information caused by convolution operations. Enhanced attention block can also utilize a feature fusion between global and local contexts along channel-wise for achieving better performance of attention learning. The coarse process of Swin Transformer and attention guiding decoder, and the finer process of the enhanced attention blocks are integrated into the proposed coarse-to-fine strategy.

3.2. Swin Transformer and Attention Guiding Decoder

Different from the typical multi-head self-attention (MSA) mechanism in vision transformer, Swin Transformer [6] is based on shifted windows to implement self-attention mechanism. The shifted windows are equipped between consecutive attention layers, which can enhance modeling power while performing computation-efficient strategy. At the same time, the hierarchical feature maps of the image are constructed on the basis of linear computational complexity. These hierarchical feature maps can make Swin Transformer more suitable as a general backbone for kinds of computer vision tasks. Since the non-local attention mechanism in traditional prior attention network has poor capability for aggregating multi-scale features from different modules and extracting boundary information, we insert Swin Transformer block into the prior attention network for enhancing the modeling power of network. As shown in Figure 2, there are two cascading Swin Transformer modules which construct one complete block. It can be seen that each Swin Transformer module consists of LayerNorm layer (LN), multi-head self-attention module (MSA), multilayer perceptron (MLP) with non-linearity activation function GELU, and twice residual connection between LayerNorm layers (LN). The important difference [6] is the window-based multi-head self-attention (W-MSA) module applied in the first Swin Transformer module, and the shifted window-based multi-head self-attention (SW-MSA) module applied in the next Swin Transformer module. Based on two successive transformer blocks with conventional window and shifted window partitioning mechanism, the attention learning process of the Swin transformer block can be formulated as:

{\hat{z}}^{l} = W-MSA (LN (z^{l - 1})) + z^{l - 1}

(1)

z^{l} = MLP (LN ({\hat{z}}^{l})) + {\hat{z}}^{l}

(2)

{\hat{z}}^{l + 1} = SW-MSA (LN (z^{l})) + z^{l}

(3)

z^{l + 1} = MLP (LN ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1}

(4)

where

{\hat{z}}^{l}

and

z^{l}

represent the outputs of the W-MSA of the

l_{t h}

layer or the SW-MSA of the

{(l - 1)}_{t h}

layer, and the MLP of the

l_{t h}

layer. Window-based self-attention is calculated in this formula:

A t t e n t i o n = softmax (\frac{Q K^{T}}{\sqrt{d}} + B) V

(5)

where

Q, K, V \in ℝ^{M^{2} \times d}

represent the matrices of query, key, and value. There are

d

and

M

existed in above

ℝ^{M^{2} \times d}

that denote the dimension of the

Q, K, V

, and the number of patches in a window. The matrix

B

is determined by the bias matrix

\hat{B} \in ℝ^{(2 M - 1) \times (2 M + 1)}

.

In the traditional cascaded networks, the first step is to perform a coarse segmentation and find the ROIs in medical images. In the typical prior attention network, the above process of finding coarse ROIs is performed by attention guiding decoder. In the proposed Swin-PANet, the attention guiding decoder is utilized to generate ROI-related attention prediction from the outputs of Swin Transformer. The process of attention guiding decoder plays a role of refining the feature representations and improving the quantity of segmentation. Then the refined features will be sent to the multi-level enhanced attention blocks for performing finer segmentation. The coarse process of Swin Transformer and attention guiding decoder, and the finer process of the enhanced attention blocks are integrated into the proposed coarse-to-fine strategy. As shown in Figure 3, the feature maps E1 to E4 extracted from the one-to-one Swin Transformer block are fed into the attention guiding decoder for feature fusion.

As the encoded feature maps have different spatial size, the interpolation with the mode of bilinear will firstly be utilized to perform up-sampling to ensure the same spatial size. Then the feature map after up-sampling is compressed to control the irrelevant information and reduce computational cost. The next step is to concatenate encoded feature in previous level and compressed feature, respectively, in the channel dimension for performing feature fusion. By that analogy, finally, these four encoded feature maps are fused together to generate attention prediction after convolution layer with sigmoid.

We use

E_{i} \in ℝ^{C_{i} \times H_{i} \times W_{i}}, i \in (1, 4)

to represent the feature maps extracted from the outputs of Swin Transformer blocks, where

E_{1}

represents the shallowest feature and

E_{4}

represents the deepest feature. The process of attention guiding decoder can be formulated as:

D_{3} = W_{c 4}^{T} E_{4} \oplus E_{3}

(6)

D_{2} = (W_{c 3}^{T} (W_{3}^{T} D_{3})) \oplus E_{2}

(7)

D_{1} = (W_{c 2}^{T} (W_{2}^{T} D_{2})) \oplus E_{1}

(8)

D_{3}

represents the fused feature of

E_{4}

and

E_{3}

;

D_{2}

represents the fused feature of

E_{3}

and

E_{2}

;

D_{1}

represents the fused feature of

E_{2}

and

E_{1}

;

W_{c 4}

,

W_{c 3}

, and

W_{c 2}

represent the corresponding compression convolutions in channel dimension;

W_{4}

,

W_{3}

and

W_{2}

represent the fusion convolutions, and

\oplus

represents the operation of feature concatenation.

At the end of attention guiding decoder, the output Y is computed and will be sent to the calculation of loss function in prior attention network.

Y = σ (W_{o u t}^{T} (W_{1}^{T} D_{1}))

(9)

W_{1}

represents the fuse convolutions to fuse

E_{1}

and

E_{2}

,

W_{o u t}^{T}

represents the output convolution, and

σ

represents the operation of sigmoid activation, respectively. Until the calculation of loss function is completed and attention predictions are inserting into the enhanced attention blocks, the forward propagation process of prior attention network is also completed. In particular, the parameter values of network should be saved and the parameter update setup of loss function in the coder can be set, which achieves the intermediate supervision learning of the prior attention network.

3.3. Hybrid Transformer Network

To better perform the dual supervision strategy and fuse multi-scale features of inconsistent semantics between prior attention network and hybrid Transformer network, enhanced attention block is proposed for guiding the information filtration and cross attention of the Transformer attention features along channel-wise. Enhanced attention block is equipped in decoder layer of the hybrid Transformer network, which is to utilize a feature fusion between global and local contexts along channel-wise for achieving better performance of attention learning. In the enhanced attention block, compared with the traditional CAA module, the difference is that it receives multiple features from previous level of enhanced attention block, corresponding with skip-connection between encoder and decoder, prior attention network assisted by Swin Transformer. As shown in Figure 4, we take the

i_{t h}

level output of enhanced attention block

D_{i} \in ℝ^{C \times H \times W}

, the attention prediction

M

, and the

i_{t h}

level output of skip-connections

{\hat{O}}_{i}

as the input of next level enhanced attention block. C, H, W denotes the number of channels, and the height and width of features, respectively. Compared with traditional residual blocks, we set a learnable parameter

α_{i}, i \in (1, 4)

in the residual paths, which can be updated along with the back propagation, and plays a role in retaining effective features and adding non-linearity to the process of integrating and refining the features. On account of the channel-axis attention of the hybrid Transformer network,

α_{i}, i \in (1, 4)

can activate the effective channels and restrain the useless channels so as to increase the convergence speed of network while ensuring segmentation performance.

The refined feature map

R_{i}, i \in (1, 4)

after the residual module is computed as follows:

R_{i} = α_{i} \cdot {\hat{O}}_{i} + M \cdot {\hat{O}}_{i}, i \in (1, 4)

(10)

The operation of spatial squeeze is performed through global average pooling (GAP) [5] which produces a vector

κ (X) \in ℝ^{C \times 1 \times 1}

, and the

l_{t h}

channel of vector can be formulated as:

κ (X) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X^{k} (i, j)

(11)

Through this operation, we can embed global spatial information to generate the attention map:

\begin{matrix} \hat{M} & = L_{1} \cdot κ (R_{i}) + L_{2} \cdot κ (D_{i}) \\ = L_{1} \cdot κ (α_{i} \cdot {\hat{O}}_{i} + M \cdot {\hat{O}}_{i}) + L_{2} \cdot κ (D_{i}) \end{matrix}

(12)

where

L_{1}

and

L_{2}

represent the weight matrices of linear layers and the ReLU layers. The operation in Equation (12) plays a role of encoding the dependency in channel dimension.

Following ECA-Net [37], which experimentally exhibits the importance of avoiding dimensionality reduction for better learning attention among patches. Hence, there is only one Linear layer and Sigmoid function before the element-wise multiplication. The output

D_{i + 1}

can be formulated as:

D_{i + 1} = σ (\hat{M}) \cdot {\hat{O}}_{i}

(13)

where the

σ (\cdot)

represents the function of activation, and the result

σ (\hat{M})

calculates the attention extent of each channel. Whereafter, the predicted mask

D_{i + 1}

will be fed into next enhanced attention block with the attention prediction and next level output of multi-head cross fusion Transformer. After the first step of attention learning in Swin Transformer and attention guiding decoder, enhanced attention block receives the attention prediction and fuses it with other multiple features. It can acquire semantic-rich features and recover the origin image information lost by the attention calculation. On the other hand, the second step of attention learning in enhanced attention block guides the information filtration and cross attention of the attention features along channel-wise, which enhance the ability of extracting global context features and modeling long-range dependency along channel-wise.

3.4. Dual Supervision Strategy

Traditional attention mechanism in medical image segmentation automatically generates the attention maps and its internal process of attention is not interpretable, resulting in poor performance of the network that excessively focus on regions not of interest to us. The performance and generalization capacity of the network will be immensely enhanced if the process can be humanly interpretable and regulated. Hence, the strategy of dual supervision learning is proposed to training the network. The loss functions of prior attention network and hybrid Transformer network are both Weighted BCE Dice loss function as

L_{1}, L_{2}

. In a multi-lesion segmentation task with

C

types of lesions, assumed that

G_{i}, i \in (1, \dots, C)

denotes the

i_{t h}

type of lesion, where the foreground represents the specific type of lesion and the background represents everything except the region of this type of lesion. The binary ground truth map

G_{b}

is formulated as:

G_{b} = \sum_{i = 1}^{C} G_{i}

(14)

The loss function is then utilized to calculate the loss between attention prediction and binary ground truth:

L_{B C E} = - \frac{1}{N} \sum_{i = 1}^{N} g_{i} \log (p_{i}) + (1 - g_{i}) \log (1 - p_{i})

(15)

L_{D i c e} = 1 - 2 \times \frac{\sum_{i} P_{i} G_{i}}{\sum_{i} P_{i} + \sum_{i} G_{i}}

(16)

L_{1} = L_{2} = ℓ (G_{b}, M) = w_{B C E} \cdot L_{B C E} + w_{D i c e} \cdot L_{D i c e}

(17)

where

ℓ (\cdot)

represents the binary loss function and

M

represents the attention prediction from attention guiding decoder.

g_{i} \in {0, 1}

and

p_{i} \in {0, 1}

denote the ground truth of annotation and the probability map,

P_{i}

denotes the probability that the

i_{t h}

pixel contains in the segmentation region, and

G_{i}

denotes the ground truth of the

i_{t h}

pixel;

w_{1}

and

w_{2}

are uniformly set to 0.5 as default. The computed loss

L_{1}

will be used to supervise the parameter update of the attention guiding decoder. Then the attention predictions of the prior attention network will be inserted into the enhanced attention network for guiding final step of segmentation. The dual supervision strategy can enhance the performance and interpretability of the Transformer attention mechanism, and provide a humanly interpretable way to guide the attentional learning in Transformer.

4. Experimental Results

In this section, experiments on different publicly available datasets will be conducted for evaluating the effectiveness and performance of the Swin-PANet. Whereafter, ablation study will be designed to further explore the validity of prior attention network and enhanced attention block. The code can be acquired in the link of Supplementary Materials.

4.1. Experiment Setup

4.1.1. Datasets

For overall comparison with other state-of-the-art methods in direction of Transformer complements CNNs, gland segmentation dataset [38] and MoNuSeg segmentation dataset [39] are selected to evaluate our proposed method. Glands are important histological structures as the main mechanism for secreting proteins which its nuclear images have been used to assess the degree of malignancy. Similarly, MoNuSeg dataset is obtained by carefully annotating tissue images of patients’ tumors of different organs. Gland segmentation dataset (GlaS) contains 85 images for training and 80 images for testing. MoNuSeg segmentation dataset (MoNuSeg) contains numbers of 30 images for training and numbers of 14 images for testing. Experiments on GlaS and MoNuSeg datasets followed the same experimental protocols with the recent research [5], in which the datasets are divided according to the training and test datasets provided by the competition, and the five-fold cross-validation strategy is conducted on training and testing datasets for fair comparison.

Furthermore, for investigating the performance of the proposed modules and further demonstrating the effectiveness of Swin-PANet, the skin lesion segmentation task in International Symposium on Biomedical Imaging (ISBI) from the year 2016 is implemented on the proposed Swin-PANet. The skin lesion segmentation datasets are acquired from different medical centers, and archived by the International Skin Imaging Collaboration (ISIC). The ISIC conducts a challenging competition called skin lesion analysis toward melanoma detection for enhancing the quantity of melanoma diagnosis. ISIC 2016 dataset contains a number of 900 images in total for training and a number of 379 images in total for testing.

4.1.2. Implementation Details

The proposed Swin-PANet is implemented on a single NVIDIA RTX 3060 Ti. The experimental setup includes PyCharm 2021, Python 3.5.7, and PyTorch 1.9.0 framework on an Ubuntu 20.04 server. The code snippets of the network structure can refer to Appendix A. For the final step of segmentation in enhanced attention block, bilinear interpolation in attention guiding decoder is applied with scale factors being 2. All images on above datasets are resized to 224 × 224 considering the memory-efficient and computation-efficient strategy. To better explore the performance of the Swin-PANet, the comparative experiment is conducted with the process of data augmentation including vertical and horizontal flip, random rotating, and random scale change (limited to 0.9–1.1). Although data augmentations can change the size and shape of images for enriching the samples, on the other hand, we can enhance the generalization ability of the Swin-PANet in modeling global context dependencies. To achieve faster convergence on training, the Adam optimizer is employed in network with a learning rate of 0.005 to optimize the performance of the network. Adam optimizer [40] can be straightforward to implement, performs memory-efficient and computation-efficient strategies, and is very suited for networks that are large in terms of parameters. The batch size is set to be 2 and the model is trained for 300 epochs. The above dataset divisions follow the same experimental protocols in UCTransNet [5]. In addition, pre-trained weights are not implemented for training the proposed network. The cross-entropy loss and dice loss are integrated as the loss function to be implemented in the Swin-PANet. Note that the above settings and loss function are utilized for training all the models in experiments.

4.1.3. Evaluation Criteria

Two widely used evaluation metrics, including the dice coefficient (Dice) and Intersection over Union (IoU) are applied as the evaluation metrics. For GlaS, MoNuSeg and ISIC 2016 datasets, the Dice and IoU metrics are both utilized for evaluating the proposed Swin-PANet compared with other state-of-the-art methods. The details of these evaluation metrics are as follows:

D i c e = \frac{2 \sum_{i = 1}^{N} g_{i} \cdot p_{i} + ε}{\sum_{i = 1}^{N} g_{i} + \sum_{i = 1}^{N} p_{i} + ε}

(18)

I o U = \frac{T P}{T P + F P + F N}

(19)

where

g_{i} \in {0, 1}

and

p_{i} \in {0, 1}

denotes the ground truth of annotation and the prediction map, and

ε \in R

is used to prevent division by zero and enhance the numerical stability.

T P, F P, T N, F N

denote True Positive, False Positive, True Negative, and False Negative, respectively.

4.2. Comparisons with State-of-the-Art Methods on GlaS and MoNuSeg Datasets

To verify the overall segmentation performance of the proposed Swin-PANet, we conduct the comparison experiment with other state-of-the-arts. Swin-PANet is compared with three types of methods for overall evaluation, including traditional U-Net [4] and three improved U-Net methods: UNet++ [14], MultiReUNet [41] and Attention U-Net [16], and three state-of-the-art Transformer complements CNN-based methods: MedT [34], TransUNet [2], UCTransNet [5], and Swin-Unet [3].

Experimental results on GlaS and MoNuSeg datasets are shown in Table 1 which shows the quantitative performances of our proposed Swin-PANet with other methods. It can be seen that Swin-PANet achieves the best segmentation performance, such as on GlaS dataset, performance gains improvement ranges from 3.7% (5.39%) to 5.08% (8.07%) on the basis of Dice (IoU) compared with the conventional U-Net-based methods and from 1.24% (1.93%) to 3.79% (5.78%) on the basis of Dice (IoU) compared with Transformer complements CNN-based methods. In addition, on the segmentation performance on MoNuSeg dataset, our proposed Swin-PANet consistently performs the best nuclear segmentation, though there are much smaller and denser nuclei. The Dice metric of Swin-PANet is noticeably improved from 79.87% to 81.59% compared to the UCTransNet, which is the best competitor. Moreover, the IoU metric of Swin-PANet is obviously improved from 66.68% to 69.00% compared with it.

Furthermore, the visual comparisons of nuclear segmentation on GlaS and MoNuSeg datasets are shown in Figure 5. The first two rows denote segmentation results on GlaS dataset and the last row denotes segmentation results on MoNuSeg dataset. It can be seen that our proposed Swin-PANet generates better segmentation boundaries, which are more refined and similar to the ground truth than the above Transformer-based methods. The results of quantitative comparison and observations in qualitative comparison demonstrate that our proposed Swin-PANet can also achieve excellent performance in nuclear segmentation tasks, which suggests our method has the capability of extracting much small boundary information and performing finer segmentation while keeping detailed shape information.

4.3. Ablation Studies

For further demonstrating the effectiveness of the proposed Swin-PANet, we conduct a series of ablation studies on the network, including three major components in Swin-PANet: (1) Swin Transformer applied in prior attention network (Swin-Trans), (2) attention guiding decoder used for attention fusing (AGD), and (3) enhanced attention block used for aggregation and refining (EAB). The ISIC 2016 dataset is applied to be trained on the proposed method for evaluating the above three modules.

As shown in Table 2, by integrating Swin-Trans module into baseline, it obtains improvement range from 86.82% (78.23%) to 89.98% (83.43%) in Dice and IoU, respectively. Furthermore, by integrating AGD module into baseline, it obtains improvement range from 86.82% (78.23%) to 89.79% (82.81%) in Dice and IoU, respectively. By the incorporation of AGD and EAB, the Dice metric and IoU metric of methods also increase by a large margin compared with baseline, but they do not exceed the performance of the baseline with Swin-Trans and EAB. This result indicates the great importance of integrating prior attention network with shifted-window attention to improve the performance of skin lesion segmentation. On the other hand, integrating AGD and EAB modules into the baseline obtains improvement range from 89.79% (82.81%) to 90.12% (83.74%) compared with the baseline with AGD, and integrating Swin-Trans, AGD, and EAB modules into the baseline obtains improvement range from 89.98% (83.43%) to 90.68% (84.06%) compared with the baseline with Swin-Trans. These results confirm the effectiveness of enhanced attention block (EAB) and dual supervision strategy in Swin-PANet. Enhanced attention block has the capability of aggregating multiple features from prior attention network (the attention prediction of AGD) and extracting contextual information for further refining boundary.

Furthermore, the visual comparisons of ablation studies are shown in Figure 6. The first two rows represent the situation with very ambiguous boundaries of skin lesion areas. The third row represents the situation with the existence of hair partially covering the skin lesion and destroying local contextual information. The last row represents the situation with the immense variability of skin lesion including irregular shape and ambiguous boundary. The proposed Swin-PANet can perform better segmentation compared with other combinations with baseline. It can be seen in the second row that the proposed method has the capability of modeling global dependencies between boundary pixels and accurately segmenting the ambiguous boundary.

4.4. Experimental Summary and Discussion

To combine the advantages of two excellent algorithms and enhance the attention ability of the network, the proposed Swin-PANet integrates both attention guiding decoder and Swin Transformer into prior attention network for capturing both global contextual information and local features. Furthermore, an enhanced attention block is designed for better performing the coarse-to-fine strategy and enhancing attention ability of the network, where experiments and ablation studies on the GlaS, MoNuSeg, and ISIC 2016 datasets have been implemented in our model for overall evaluation. Since the proposed Swin-PANet can extract rich global contextual information with the shifted-window self-attention mechanism in prior attention network, and enhanced attention blocks are equipped in hybrid Transformer network to capture the long-range dependency along channel-wise for further enhancing the attention ability of the network, it can be seen in the quantitative comparison on GlaS and MoNuSeg datasets that Swin-PANet consistently achieves better performance than other state-of-the-art methods. Compared with the UCTransNet, the proposed Swin-PANet gains improvement range from 89.84% (82.24%) to 91.42% (84.88%) in terms of Dice and IoU metrics on GlaS dataset, and 79.87% (66.68%) to 81.59% (69.00%) in terms of Dice and IoU metrics on MoNuSeg dataset. On the other hand, ISIC 2016 dataset is selected for evaluating the contribution of each module in Swin-PANet. From the quantitative and visual comparisons of skin lesion segmentation, the results confirm the effectiveness of enhanced attention block and dual supervision strategy in Swin-PANet. Enhanced attention block is applied in hybrid Transformer network to achieve better feature fusion between global context and local contexts along channel-wise.

Although Swin-PANet achieves better performance compared with some state-of-the-art methods, the proposed network still has limitation in the ability of transfer learning. As shown in Table 2, Swin-PANet is implemented on another dataset such as ISIC 2016, and achieves the performance of 90.68% and 84.06% in terms of Dice and IoU metrics. Compared with some special designed methods such as FAT-Net [42], Ms RED [43], and BAT [44], Swin-PANet has an immense gap in performance of skin lesion segmentation. How to make Swin-PANet perform well in different segmentation tasks is a challenging task, and we believe the backbone ability of Swin Transformer and the potential of the combination of Transformer and CNNs can make this task possible.

5. Conclusions

In this paper, we proposed a novel network structure with the combination of two algorithms, called Swin-PANet, following the coarse-to-fine strategy and dual supervision strategy, and aimed to perform accurate segmentation of medical images including the challenging tasks of cell segmentation and skin lesion segmentation. The proposed Swin-PANet can be utilized for computer-aided diagnosis (CAD) of skin cancer to improve the segmentation efficiency and accuracy, considered as a significant technique for the accurate screening of diseased or abnormal area of patients to assist doctors to better evaluate disease and optimize prevention measures.

In conclusion, the proposed Swin-PANet integrates both Swin Transformer and attention guiding decoder into prior attention network for performing intermediate learning and capturing global contextual information between pixel-level. Furthermore, an enhanced attention block is designed and proposed for utilizing feature fusion between global and local contexts along channel-wise, and better performing the coarse-to-fine strategy and enhancing attention ability of the network. Extensive experiments are conducted on three public datasets (GlaS, MoNuSeg, and ISIC 2016) for the overall evaluation of the proposed Swin-PANet. The quantitative and visual comparisons with state-of-the-art methods also demonstrate the effectiveness of the proposed Swin-PANet and excellent performance in the segmentation tasks based on our coarse-to-fine and dual supervision strategies.

Although Swin-PANet achieves better performance than some state-of-the-art methods, the proposed network still has limitations in the ability of transfer learning. Recent work [2] demonstrates the Transformer has superior transferability for kinds of downstream tasks under pre-training. Our future work is to investigate the transferability of the combination of Swin Transformer and CNNs, and to design a more powerful and reliable network structure on medical image segmentation.

Supplementary Materials

The code and basic information can be downloaded at https://github.com/liaozhihao2757/Swin-PANet (accessed on 27 December 2021).

Author Contributions

Conceptualization, Z.L., N.F. and K.X.; Data curation, Z.L.; Formal analysis, Z.L., N.F. and K.X.; Investigation, N.F.; Methodology, Z.L., N.F. and K.X.; Project administration, Z.L.; Resources, Z.L. and K.X.; Software, Z.L.; Supervision, Z.L., N.F. and K.X.; Validation, Z.L., N.F. and K.X.; Visualization, Z.L. and N.F.; Writing—original draft, Z.L., N.F. and K.X.; Writing—review and editing, Z.L., N.F. and K.X. These authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are publicly available at https://challenge.isic-archive.com/data/ (ISIC), https://monuseg.grand-challenge.org/Data/ (MoNuSeg), and https://warwick.ac.uk/fac/cross_fac/tia/data/glascontest (GlaS). The implemented code that supports the findings of this research are openly available at https://github.com/liaozhihao2757/Swin-PANet (accessed on 27 December 2021).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Code Snippets of the Network Structure

n_channels = 3

n_labels = 1

epochs = 300

img_size = 224

vis_frequency = 10

early_stopping_patience = 100

pretrain = False

task_name = ‘GlaS’ # GlaS, MoNuSeg, and ISIC 2016

learning_rate = 0.005

batch_size = 2

Transformer_patch_sizes = [2, 4]

Transformer.dropout_rate = 0.1

base_channel = 32

class Swin-PANet(nn.Module):

def __init__(self, config,n_channels = 3, n_classes = 1,img_size = 224,vis = False):

super().__init__()

self.vis = vis

self.n_channels = n_channels

self.n_classes = n_classes

in_channels = config.base_channel

self.down1 = DownBlock(in_channels, in_channels*2, nb_Conv = 2)

self.down2 = DownBlock(in_channels*2, in_channels*4, nb_Conv = 2)

self.down3 = DownBlock(in_channels*4, in_channels*8, nb_Conv = 2)

self.down4 = DownBlock(in_channels*8, in_channels*8, nb_Conv = 2)

self.prior_attention_network = Intermediate_supervision(in_channels*2, in_channels*4, in_channels*8, in_channels*8)

self.up4 = UpBlock(in_channels*16, in_channels*4, in_channels*4, nb_Conv = 2)

self.up3 = UpBlock(in_channels*8, in_channels*2, in_channels*4, nb_Conv = 2)

self.up2 = UpBlock(in_channels*4, in_channels, in_channels*4, nb_Conv = 2)

self.up1 = UpBlock(in_channels*2, in_channels, in_channels*4, nb_Conv = 2)

self.outc = nn.Conv2d(in_channels, n_classes, kernel_size = (1,1), stride = (1,1))

self.last_activation = nn.Sigmoid() # if using BCELoss

def forward(self, x):

x = x.float()

x1 = self.ConvBatchNorm(x)

x2 = self.down1(x1)

x3 = self.down2(x2)

x4 = self.down3(x3)

x5 = self.down4(x4)

attention_prediction = self.prior_attention_network(x1,x2,x3,x4)

x = self.up4(x5, x4, attention_prediction)

x = self.up3(x, x3, attention_prediction)

x = self.up2(x, x2, attention_prediction)

x = self.up1(x, x1, attention_prediction)

logits = self.last_activation(self.outc(x))

return attention_prediction, logits

images, masks = batch[‘image’], batch[‘label’]

images, masks = images.cuda(), masks.cuda()

preds, attention_prediction = model(images)

intermediate_loss = intermediate_criterion(attention_prediction, masks.float())

out_loss = direct_criterion(preds, masks.float())

intermediate_loss.backward(grad = True)

out_loss.backward()

optimizer.step()

References

Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Xu, D. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5 January 2022. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. arXiv 2021, arXiv:2105.05537. [Google Scholar]
Zhao, X.; Zhang, P.; Song, F.; Ma, C.; Fan, G.; Sun, Y.; Zhang, G. Prior Attention Network for Multi-Lesion Segmentation in Medical Images. arXiv 2021, arXiv:2110.04735. [Google Scholar]
Wang, H.; Cao, P.; Wang, J.; Zaiane, O.R. UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-wise Perspective with Transformer. arXiv 2021, arXiv:2109.04335. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11 October 2021. [Google Scholar]
Tsai, A.; Yezzi, A.; Wells, W.; Tempany, C.; Tucker, D.; Fan, A.; Grimson, W.; Willsky, A. A shape-based approach to the segmentation of medical imagery using level sets. IEEE Trans. Med. Imaging 2003, 2, 137–154. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Held, K.; Kops, E.R.; Krause, B.J.; Wells, W.M.; Kikinis, R.; Muller-Gartner, H.W. Markov random field segmentation of brain mr images. IEEE Trans. Med. Imaging 1997, 16, 878–886. [Google Scholar] [CrossRef] [Green Version]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5 October 2015. [Google Scholar]
Isensee, F.; Jaeger, P.F.; Kohl, S.A.; Petersen, J.; Maier-Hein, K.H. nnu-net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods 2021, 18, 201–211. [Google Scholar] [CrossRef]
Jin, Q.; Meng, Z.; Sun, C.; Cui, H.; Su, R. Ra-unet: A hybrid deep attention-aware network to extract liver and tumor in ct scans. Front. Bioeng. Biotechnol. 2020, 8, 1471. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June 2016. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Bangalore, India, 17 August 2017. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. Deep Learn. Med. Image Anal. Multimodal Learn. Clin. Decis. Support 2018, 11045, 3–11. [Google Scholar]
Xiao, X.; Lian, S.; Luo, Z.; Li, S. Weighted res-unet for high-quality retina vessel segmentation. In Proceedings of the 2018 9th International Conference on Information Technology in Medicine and Education, Hangzhou, China, 19 October 2018. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Isensee, F.; Petersen, J.; Kohl, S.A.; Jäger, P.F.; Maier-Hein, K.H. nnu-net: Breaking the spell on successful medical image segmentation. arXiv 2019, arXiv:1904.08128. [Google Scholar]
Huang, H.; Lin, L.; Tong, R.; Hu, H.; Zhang, Q.; Iwamoto, Y.; Han, X.; Chen, Y.-W.; Wu, J. Unet 3+: A full-scale connected unet for medical image segmentation. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Online, 4 May 2020. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision, Stanford, CA, USA, 25 October 2016. [Google Scholar]
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S.S.; Brox, T.; Ronneberger, O. 3D U-Net: Learning dense volumetric segmentation from sparse annotation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Greece, Athens, 17 October 2016. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gu, Z.; Cheng, J.; Fu, H.; Zhou, K.; Hao, H.; Zhao, Y.; Zhang, T.; Gao, S.; Liu, J. Ce-net: Context encoder network for 2d medical image segmentation. IEEE Trans. Med. Imaging 2019, 38, 2281–2292. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Schlemper, J.; Oktay, O.; Schaap, M.; Heinrich, M.; Kainz, B.; Glocker, B.; Rueckert, D. Attention gated networks: Learning to leverage salient regions in medical images. Med. Image Anal. 2019, 53, 197–207. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18 June 2018. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21 July 2017. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Advan. Neural Infor. 2017, 30, 5998–6008. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Online, 23 August 2020. [Google Scholar]
Prangemeier, T.; Reich, C.; Koeppl, H. Attention-based transformers for instance segmentation of cells in microstructures. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine, Seoul, Korea, 16 December 2020. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; J´egou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Online, 18 July 2021. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 10 March 2021. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. arXiv 2021, arXiv:2103.00112. [Google Scholar]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September 2021. [Google Scholar]
Valanarasu, J.M.J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical transformer: Gated axial-attention for medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September 2021. [Google Scholar]
Gao, Y.; Zhou, M.; Metaxas, D.N. UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September 2021. [Google Scholar]
Wang, W.; Chen, C.; Ding, M.; Yu, H.; Zha, S.; Li, J. Transbts: Multimodal brain tumor segmentation using transformer. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September 2021. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16 June 2020. [Google Scholar]
Sirinukunwattana, K.; Pluim, J.P.W.; Chen, H.; Qi, X.; Heng, P.-A.; Guo, Y.B.; Wang, L.Y.; Matuszewski, B.J.; Bruni, E.; Sanchez, U.; et al. Gland Segmentation in Colon Histology Images: The GlaS Challenge Contest. Med. Image Anal. 2017, 35, 489–502. [Google Scholar] [CrossRef] [Green Version]
Kumar, N.; Verma, R.; Sharma, S.; Bhargava, S.; Vahadane, A.; Sethi, A. A Dataset and a Technique for Generalized Nuclear Segmentation for Computational Pathology. IEEE Trans. Med. Imaging 2017, 36, 1550–1560. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Ibtehaz, N.; Rahman, M.S. MultiResUNet: Rethinking the U-Net Architecture for Multimodal Biomedical Image Segmentation. Neural Netw. 2020, 121, 74–87. [Google Scholar] [CrossRef]
Wu, H.; Chen, S.; Chen, G.; Wang, W.; Lei, B.; Wen, Z. FAT-Net: Feature adaptive transformers for automated skin lesion segmentation. Med. Image Anal. 2022, 76, 102327. [Google Scholar] [CrossRef] [PubMed]
Dai, D.; Dong, C.; Xu, S.; Yan, Q.; Li, Z.; Zhang, C.; Luo, N. Ms RED: A novel multi-scale residual encoding and decoding network for skin lesion segmentation. Med. Image Anal. 2022, 75, 102293. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Wei, L.; Wang, L.; Zhou, Q.; Zhu, L.; Qin, J. Boundary-Aware Transformers for Skin Lesion Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Strasbourg, France, 27 September 2021. [Google Scholar]

Figure 1. The illustration of the proposed Swin-PANet in the task of skin lesion segmentation. Swin-PANet follows the coarse-to-fine strategy and dual supervision strategy. It consists of two components: a prior attention network assisted by Swin Transformer performs intermediate supervision learning, and a hybrid Transformer network with enhanced attention blocks performs direct learning.

Figure 2. The illustration of two successive Swin Transformer blocks. W-MSA and SW-MSA are window-based self-attention modules which are based on regular or shifted windowing configurations, respectively.

Figure 3. The illustration of the attention guiding decoder. It receives the multi-scale features from Swin Transformer blocks, and then performs attention feature fusing and up-sampling, which plays a role of refining the feature representations and improving the quantity of segmentation.

Figure 4. The illustration of enhanced attention block. Different from traditional CAA module that it receives multiple features for aggregating attention information and refining boundaries, which is to utilize a feature fusion between global and local contexts along channel-wise for achieving better performance of attention learning.

Figure 5. Visual comparisons of nuclear segmentation results produced by different methods on GlaS and MoNuSeg datasets. The first two rows represent segmentation results on GlaS dataset and the last row represents segmentation results on MoNuSeg dataset. The regions highlighted by red boxes show Swin-PANet performs better segmentation than other state-of-the-art methods and the results of segmentation are more similar to the ground truth.

Figure 6. Visual comparisons of skin lesion segmentation results produced by kinds of baselines on ISIC 2016 dataset. The baseline denotes the traditional network of U-Net. The Method A, Method B, and Method C denotes, respectively, the baseline with AGD, Swin-Trans, AGD + EAB. The regions highlighted by red boxes show that Swin-PANet has the capability of modeling global dependencies between boundary pixels and accurately segmenting the ambiguous boundary.

Table 1. The quantitative comparison on GlaS and MoNuSeg datasets implemented on state-of-the-art methods. For simplicity, AttUNet and MRUNet denote the above methods: Attention U-Net and MultiReUNet. UCTransNet-pre denotes UCTransNet based on pre-training.

Network	GlaS		MoNuSeg
Network	Dice (%)	IoU (%)	Dice (%)	IoU (%)
U-Net (2015)	86.34	76.81	73.97	59.42
UNet++ (2018)	87.07	78.10	75.28	60.89
AttUNet (2018)	86.98	77.53	76.20	62.64
MRUNet (2020)	87.72	79.39	77.54	63.80
TransUNet (2021)	87.63	79.10	79.20	65.68
MedT (2021)	86.68	77.50	79.24	65.73
Swin-Unet (2021)	88.25	79.86	78.49	64.72
UCTransNet (2021)	89.84	82.24	79.87	66.68
UCTransNet-pre	90.18	82.95	77.19	63.80
Swin-PANet (ours)	91.42	84.88	81.59	69.00

Table 2. Ablation experiments on ISIC 2016 dataset. Swin-Trans denotes the Swin Transformer block applied in prior attention network, AGD denotes the attention guiding decoder used for attention fusing, and EAB denotes enhanced attention block used for aggregation and refining.

Method	ISIC 2016
Method	Dice (%)	IoU (%)
Baseline (U-Net)	86.82	78.23
Baseline + Swin − Trans	89.98	83.43
Baseline + AGD	89.79	82.81
Baseline + AGD + EAB	90.12	83.74
Baseline + Swin − Trans + EAB	89.56	82.50
Baseline + Swin − Trans +AGD + EAB	90.68	84.06

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, Z.; Fan, N.; Xu, K. Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation. Appl. Sci. 2022, 12, 4735. https://doi.org/10.3390/app12094735

AMA Style

Liao Z, Fan N, Xu K. Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation. Applied Sciences. 2022; 12(9):4735. https://doi.org/10.3390/app12094735

Chicago/Turabian Style

Liao, Zhihao, Neng Fan, and Kai Xu. 2022. "Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation" Applied Sciences 12, no. 9: 4735. https://doi.org/10.3390/app12094735

APA Style

Liao, Z., Fan, N., & Xu, K. (2022). Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation. Applied Sciences, 12(9), 4735. https://doi.org/10.3390/app12094735

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Swin Transformer Assisted Prior Attention Network for Medical Image Segmentation

Abstract

Featured Application

Abstract

1. Introduction

2. State of the Art and Related Work

2.1. CNN-Based Methods

2.2. Vision Transformers

2.3. Transformer Complements CNNs

3. Modeling, Methods, and Design

3.1. Overview of Network Structure

3.2. Swin Transformer and Attention Guiding Decoder

3.3. Hybrid Transformer Network

3.4. Dual Supervision Strategy

4. Experimental Results

4.1. Experiment Setup

4.1.1. Datasets

4.1.2. Implementation Details

4.1.3. Evaluation Criteria

4.2. Comparisons with State-of-the-Art Methods on GlaS and MoNuSeg Datasets

4.3. Ablation Studies

4.4. Experimental Summary and Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. The Code Snippets of the Network Structure

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI