PANDA: A Polarized Attention Network for Enhanced Unsupervised Domain Adaptation in Semantic Segmentation

Kao, Chiao-Wen; Chang, Wei-Ling; Lee, Chun-Chieh; Fan, Kuo-Chin

doi:10.3390/electronics13214302

Open AccessArticle

PANDA: A Polarized Attention Network for Enhanced Unsupervised Domain Adaptation in Semantic Segmentation

¹

Department of Applied Artificial Intelligence, Ming Chuan University, Taoyuan 320, Taiwan

²

Department of Computer Science and Information Engineering, National Central University, Taoyuan 320, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(21), 4302; https://doi.org/10.3390/electronics13214302

Submission received: 30 September 2024 / Revised: 25 October 2024 / Accepted: 30 October 2024 / Published: 31 October 2024

(This article belongs to the Special Issue Digital Signal and Image Processing for Multimedia Technology)

Download

Browse Figures

Versions Notes

Abstract

:

Unsupervised domain adaptation (UDA) focuses on transferring knowledge from the labeled source domain to the unlabeled target domain, reducing the costs of manual data labeling. The main challenge in UDA is bridging the substantial feature distribution gap between the source and target domains. To address this, we propose Polarized Attention Network Domain Adaptation (PANDA), a novel approach that leverages Polarized Self-Attention (PSA) to capture the intricate relationships between the source and target domains, effectively mitigating domain discrepancies. PANDA integrates both channel and spatial information, allowing it to capture detailed features and overall structures simultaneously. Our proposed method significantly outperforms current state-of-the-art unsupervised domain adaptation (UDA) techniques for semantic segmentation tasks. Specifically, it achieves a notable improvement in mean intersection over union (mIoU), with a 0.2% increase for the GTA→Cityscapes benchmark and a substantial 1.4% gain for the SYNTHIA→Cityscapes benchmark. As a result, our method attains mIoU scores of 76.1% and 68.7%, respectively, which reflect meaningful advancements in model accuracy and domain adaptation performance.

Keywords:

unsupervised domain adaptation; semantic segmentation; attention mechanism; feature extraction; self-training

1. Introduction

With the rapid advancement of machine learning technologies, deep learning has become a key solution to numerous computer vision challenges, typically requiring large amounts of labeled data for model training. Semantic segmentation, a critical task in this domain, assigns each pixel in an image to a specific semantic category, relying on pixel-level annotation. This technique is widely applied across various fields, playing a vital role in enhancing automation, accuracy, and efficiency. In autonomous driving, semantic segmentation enables vehicles to recognize essential elements such as roads, pedestrians, and traffic signs, facilitating intelligent navigation. In medical imaging, it assists doctors in accurately identifying and analyzing tissues or lesions, improving the precision and efficiency of diagnosis and treatment. In industry, it empowers robots to reliably distinguish different objects in their operational environment, aiding in more accurate task execution [1].

Unsupervised domain adaptation (UDA) is increasingly recognized as an effective method to reduce the burden of data annotation. Traditional supervised learning relies on labeled datasets for model training, but acquiring such labeled data is both costly and time consuming. To circumvent the laborious process of dataset annotation, UDA has emerged as a solution. It typically leverages annotated source-domain data for training, while using unlabeled target-domain data for testing. The objective of UDA is to ensure that neural networks trained on the source domain can generalize well and perform effectively on the target domain.

Despite recent advancements, UDA for semantic segmentation still faces several challenges and limitations. First, in practical applications, variations in data due to different environments and conditions often lead to differences in feature distribution between the source and target domains, a phenomenon known as domain shift. This domain shift can lead to reduced model performance, as the model may overfit to the source-domain features and fail to generalize to the target domain. Additionally, the lack of labeled data in the target domain makes it more difficult to extract effective features. Another challenge lies in maintaining the balance between achieving high accuracy in the source domain and transferring this accuracy effectively to the target domain without performance degradation.

Moreover, current UDA methods often struggle with scalability and adaptability to real-world scenarios, where conditions and data distributions are highly dynamic and unpredictable. Developing robust UDA methods that can handle such variability is crucial for practical deployment. To address these challenges, we propose Polarized Attention Network Domain Adaptation (PANDA), a novel approach incorporating Polarized Self-Attention (PSA) [2] for UDA semantic segmentation. By combining two PSA blocks with a standard convolution block, PANDA effectively extracts complex channel and spatial features while maintaining high resolution, enhancing the understanding of objects with varying sizes and shapes by capturing more informative features.

2. Related Work

2.1. Semantic Segmentation

Semantic segmentation plays a critical role in intelligent vehicle perception, offering pixel-level semantic insights that are essential for tasks such as drivable area detection and object landmark identification. With the advent of deep learning, there has been rapid progress in developing more sophisticated segmentation techniques. In recent years, various methods have been proposed to improve the performance of semantic segmentation [3]. For instance, prototype learning [4,5] focuses on establishing a latent feature space where predictions are derived by measuring the distance between the test anchor and the prototypes for each class. On the other hand, pixel-wise contrastive learning [6] is applied to strengthen the similarity between pixels of the same semantic category.

Despite these advancements, the high cost of acquiring annotated datasets for training remains a challenge, leading researchers to explore simulated data as a cost-effective alternative. However, the large differences between synthetic and real-world environments, as well as varying weather conditions [7,8], still make it hard for models to perform well in all situations.

2.2. UDA

In typical machine learning problems, it is often assumed that the training and test datasets follow similar distributions. However, real-world applications frequently encounter significant differences between these datasets. As a result, models trained on the training dataset may perform poorly on the test dataset. Transfer learning has emerged as a solution for mitigating distributional differences. Within this field, UDA has become crucial for tackling such challenges. Various studies have been explored for UDA in different tasks like image classification [9,10,11,12], object detection [13,14,15,16,17], and semantic segmentation [18,19,20,21,22].

UDA methods are generally categorized into three primary approaches that include discrepancy minimization, adversarial training, and self-training. Firstly, discrepancy minimization methods aim to narrow the gap between source and target domains by mapping their features into a shared space. Techniques like maximum mean discrepancy [23] and correlation alignment [24] are employed to minimize the distance between these domains. Secondly, adversarial training [25,26] utilizes generative adversarial network (GAN) to align features across domains. In this approach, a generator produces domain-invariant features, while a discriminator distinguishes between features from the source and target domains. Through this process, the generator learns to generate features that align with the target domain. Finally, self-training employs a pretrained model to generate pseudo-labels for the target domain. Strategies like confidence thresholds [27] and pseudo-label prototypes [28] help mitigate the issues of pseudo-label drift.

Most state-of-the-art UDA methods rely on self-training. For instance, DAFormer [18] utilizes Transformer to capture dependencies over long ranges, while HRDA [19] integrates features from both high and low resolutions to capture contextual relationships across different scales. Masked Image Consistency (MIC) [20] encourages the model to learn contextual relationships by generating masked regions in target-domain images. However, accurately capturing fine-grained features remains a significant challenge in this field.

2.3. Attention Mechanism

The fundamental concept behind attention mechanisms is to assign varying levels of importance to different parts of input data, enabling models to selectively focus on relevant features essential for the task. Efficient Attention (EA) [29] modifies the order of multiplication to reduce the need for computing relationships between every position, achieving computational efficiency that scales linearly with input size. SENet [30] uses global average pooling to generate a channel representation, which is then processed through a fully connected layer to produce channel weights that are applied to the original feature map. GCNet [31] simplifies computation by calculating a general attention map for all positions in the input feature map, replacing the compression step in SENet [30] to reduce computation costs while preserving global information. CBAM [32] independently applies attention operations to both the channel and spatial dimensions, avoiding extensive convolutional operations typical in traditional attention mechanisms.

While these attention mechanisms maintain high resolution in the channel and spatial dimensions, they trade off with increased time complexity and sensitivity to noise. Although some alternative methods are more computationally efficient, they often compromise on preserving high-resolution features. To tackle challenges in complex semantic segmentation tasks, we employ Polarized Self-Attention (PSA) [2] to capture contextual information effectively, thereby enhancing adaptation between the source and target domains. As a result, PANDA preserves high-resolution features in both the channel and spatial dimensions, proving advantageous for practical applications in real-world scenarios.

3. Proposed Approaches

3.1. Overview

PANDA is designed to capture richer contextual information within images. As shown in Figure 1, the PANDA model consists of three main components, which are source domain training, target-domain training, and cross-domain training. The abbreviations used in the paper are listed in the Abbreviations section.

PANDA includes a teacher model and a student model, both identical in structure. The teacher model generates pseudo-labels for the unlabeled target-domain data (blue), and the model is jointly optimized using the supervised source loss (red), the augmented cross-loss (purple), and the masked loss (yellow). Additionally, the parameters of the teacher model are updated using exponential moving average (EMA) (gray). Inspired by [33], we integrate Polarized Attention Network (PAN) into both the student network

g_{θ}

and teacher network

g_{ϕ}

, enabling the model to better capture key features within images. The network includes several key components: MiT Encoder [34], embedding layer, context-aware fusion layer, bottleneck module, and classification layer. In Section 3.3, we provide a detailed description of each component, explaining their functionalities and contributions to the overall network.

Firstly, for the source domain, we employ rare-class sampling (RCS) to increase the sampling probability of images containing rare classes. This ensures that the model can still effectively learn from classes with a limited number of training samples. The selected RCS images are fed into both the student network and ImageNet Encoder, where respective predictions are made, and the source loss and feature distance loss are computed.

For the target domain, masked images are input into the student network to guide the learning process. This allows the model to confidently predict meaningful features, even under masked conditions. The predicted results, along with the pseudo-labels generated by the teacher network from the complete target-domain image, are used to compute the masked loss.

Lastly, in the cross-domain component, we adopt ClassMix [35] to combine the source and target domains images for generating mixed images. These mixed images provide a more comprehensive and diverse training set. The student network adopts the mixed images to make predictions, while the pseudo-labels and annotations from the source domain are similarly mixed and used to compute the mixed loss.

In conclusion, PANDA employs a combination of strategies tailored to source, target, and cross-domain training. This comprehensive approach allows the model to fully utilize the available information from each domain, improving its ability to generalize across varying conditions.

3.2. Self-Training for UDA

3.2.1. Source-Domain Training

Given the source-domain images and their corresponding annotations, denoted as

X_{S} = {\{(x_{S}^{i}, y_{S}^{i})\}}_{i = 1}^{N_{S}}

, the student network

g_{θ}

is trained by calculating the source loss to achieve good performance on the target-domain image

X_{T} = {\{x_{T}^{i}\}}_{i = 1}^{N_{T}}

, which does not have annotations. Here,

N_{S}

and

N_{T}

represent the number of images in the source and target domains, respectively. However, models trained solely using the source-domain loss typically struggle to generalize to the target domain, leading to poor performance of the student network when applied to target-domain images. Additionally, adversarial training approaches [28,36] are often unstable and have been outperformed by self-training methods [18,19,20]. Therefore, we adopt a self-training approach by utilizing an additional teacher network

g_{ϕ}

to generate reliable pseudo-labels for the target domain.

Due to the long-tail distribution characteristics of the dataset, there is a possibility that random sampling might only select samples related to rare classes in the later stages of model training. Such a delay could result in the model becoming biased toward common classes early on, making it difficult to relearn these rare classes with only a few available samples. To prevent the early neglect of rare classes during training, we adopt an RCS strategy based on DAFormer [18]. This strategy increases the sampling frequency of samples containing rare classes, to mitigate the impact of class imbalance on the model’s training process. Specifically, the sampling frequency

f_{c}

for each class c is computed based on the percentage of pixels in the dataset belonging to class c:

f_{c} = \frac{1}{N_{S} \cdot H \cdot W} \sum_{n = 1}^{N_{S}} \sum_{k = 1}^{H \times W} [y_{S}^{n, k, c}]

(1)

where

H \times W

denotes image height and width, and

[\cdot]

represents the Iverson bracket, which equals 1 if the condition inside the brackets is true and 0 otherwise. The sampling probability

P (c)

for class c is calculated using a Softmax function related to

f_{c}

, as shown in Equation (2). Classes with lower sampling frequencies have higher probabilities, with temperature T controlling the smoothness of the distribution. A higher T leads to a more uniform distribution, while a lower T results in more frequent sampling of rare classes with smaller

f_{c}

.

P (c) = \frac{e^{(1 - f_{c}) / T}}{\sum_{c^{'} = 1}^{C} e^{(1 - f_{c^{'}}) / T}}

(2)

In practice, a class c is randomly selected, and an image containing that class is randomly sampled from the subset of data containing that class for training. The selected source-domain image is then fed into the student network

g_{θ}

to generate pixel-wise predictions. The source loss

L_{S}

is subsequently computed by comparing these predictions with the ground truth using cross-entropy:

L_{S} = - \sum_{k = 1}^{H \times W} \sum_{c = 1}^{C} y_{S}^{k, c} log g_{θ} (x_{R C S}^{k, c})

(3)

To prevent the model from overfitting when processing source-domain images and to preserve useful features obtained through ImageNet pretraining, we utilize the feature distance (FD) proposed in DAFormer [18]. This method normalizes the L2 distance between the bottleneck features

F_{θ}

of the student network

g_{θ}

and the bottleneck features

F_{I m a g e N e t}

of ImageNet:

d^{k} = {∥ F_{I m a g e N e t} (x_{R C S}^{k}) - F_{θ} (x_{R C S}^{k}) ∥}^{2}

(4)

Given that ImageNet primarily trains on thing classes (objects with specific sizes or shapes such as vehicles, riders, etc.) rather than stuff classes (background regions without specific shapes such as sky, buildings, etc.), the loss is computed only for image regions containing the thing class

C_{t h i n g s}

. Firstly, the labels

y_{S}

are downsampled to match the size of the bottleneck features

H_{F} \times W_{F}

, followed by global average pooling (GAP) over the channels for each category. If the pooled value exceeds a threshold proportion

r_{I m a g e N e t}

, the category is retained to obtain the downsampled labels

y_{S, s m a l l}

:

y_{S, s m a l l}^{c} = δ (AvgPool (y_{S}^{c}, \frac{H}{H_{F}}, \frac{W}{W_{F}}))

(5)

δ (x) = \{\begin{matrix} 1, & if x > r_{I m a g e N e t} \\ 0, & otherwise \end{matrix}

(6)

Subsequently, to ensure that the feature map corresponds to the considered thing class, only the portion of

y_{S, s m a l l}

belonging to the thing class is retained to generate the binary mask:

M_{t h i n g s}^{k} = \sum_{c^{'} = 1}^{C} y_{S, s m a l l}^{k, c^{'}} \cdot [c^{'} \in C_{t h i n g s}]

(7)

Finally, the binary mask

M_{t h i n g s}

is applied to calculate the average distance

d^{k}

for the thing classes, which represents the feature distance loss:

L_{F D} = \frac{\sum_{k = 1}^{H_{F} \times W_{F}} d^{k} \cdot M_{t h i n g s}^{k}}{\sum_{k = 1} M_{t h i n g s}^{k}}

(8)

3.2.2. Target-Domain Training

To enhance the model’s ability to learn contextual relationships in the target domain, we use MIC [20], which incorporates random masks into target-domain image to infer results close to pseudo-labels from information surrounding the masks. The mask generation involves randomly sampling a patch mask M from a uniform distribution:

M_{i b + 1 : (i + 1) b, j b + 1 : (j + 1) b} = [υ < r^{M a s k}], υ \sim U (0, 1)

(9)

where i and j are patch indices with

i \in [0 \dots H / b - 1]

and

j \in [0 \dots W / b - 1]

, b is the patch size, and

r^{m a s k}

represents the mask ratio. The masked target image is obtained through element-wise multiplication of the mask and the target-domain image:

x_{M a s k} = M ⊙ x_{T}

(10)

To ensure that the model can accurately reconstruct labels even in the absence of complete target-domain images and utilize contextual information effectively, we use the masked loss

L_{M}

to maintain consistency between the outputs of the student network

g_{θ}

and the teacher network

g_{ϕ}

. The prediction of

x_{M a s k}

is compared with pseudo-label

p_{T}

using cross-entropy:

L_{M} = - \sum_{k = 1}^{H \times W} \sum_{c = 1}^{C} q_{T} p_{T}^{(k, c)} log g_{θ} (x_{M a s k}^{k, c})

(11)

where

p_{T}

represents the pseudo-label, as defined in Equation (12), derived from the class with the maximum Softmax value in the predictions obtained by inputting the target-domain image into the teacher network

g_{ϕ}

. Importantly, gradients are not propagated back to the teacher network.

p_{T} = [c = arg max_{c^{'}} g_{ϕ} (x_{T})]

(12)

Since the pseudo-labels may be incorrect, their quality is weighted based on the confidence estimate

q_{T}

. This is computed as the proportion of pixels where the maximum Softmax value in the prediction exceeds the threshold

τ

, defined as Equation (13). As training iterations proceed, the generated pseudo-labels become more accurate, Softmax values increase,

q_{T}

grows larger, and

L_{M}

imposes a stronger constraint on the model, facilitating the learning of more precise contextual information.

q_{T} = \frac{\sum_{k = 1}^{H \times W} [{max}_{c^{'}} g_{ϕ} (x_{T}^{k, c^{'}}) > τ]}{H \cdot W}

(13)

Typically, the teacher network is designed as an EMA teacher [37] to enhance prediction stability. The weights of the teacher network

g_{ϕ}

are updated as the exponential moving average of the weights of the student network

g_{θ}

with a smoothing factor

α

:

ϕ_{t + 1} \leftarrow α ϕ_{t} + (1 - α) θ_{t}

(14)

where t denotes a training step.

ϕ

and

θ

represent the weights corresponding to the teacher network and the student network, respectively. Implementing a temporal ensemble with past student models, the EMA teacher improves the reliability and temporal stability of the pseudo-labels. As it updates based on

g_{θ}

, the teacher progressively enhances its feature learning capacity.

3.2.3. Cross-Domain Training

Following DACS [36], we use color jitter, Gaussian blur, and ClassMix [35] to produce the augmented image. From the RCS image

x_{R C S}

, we randomly select a half-set of classes H. Then, we generate a binary mask:

M_{M i x} = \{\begin{matrix} 1, & if y_{S} \in H \\ 0, & otherwise \end{matrix}

(15)

Then, we apply

M_{M i x}

to the corresponding pixels of the target image

x_{T}

to create the mixed image

x_{M i x}

, defined in Equation (16). To generate labels corresponding to

x_{M i x}

, we blend the pseudo-label

p_{T}

with source label

y_{S}

in the same manner to obtain mixed labels

y_{M i x}

, as shown in Equation (17).

x_{M i x} = M_{M i x} ⊙ x_{R C S} + (1 - M_{M i x}) ⊙ x_{T}

(16)

y_{M i x} = M_{M i x} ⊙ y_{S} + (1 - M_{M i x}) ⊙ p_{T}

(17)

The student network is trained using

x_{M i x}

, which enhances the learning of domain-robust features. The cross-loss

L_{C}

is used to further train the student

g_{θ}

on the source domain:

L_{C} = - \sum_{k = 1}^{H \times W} \sum_{c = 1}^{C} y_{M i x}^{(k, c)} log g_{θ} (x_{M i x}^{k, c})

(18)

The overall PANDA loss L is calculated as the weighted sum of the individual loss components, expressed as

L = L_{S} + λ_{F D} L_{F D} + L_{M} + L_{C}

, which incorporates the supervised source-domain loss

L_{S}

, feature distance loss

L_{F D}

, masked loss

L_{M}

, and cross-domain loss

L_{C}

. In the training process, the source-domain loss, mask loss, and cross-loss are all computed using cross-entropy, ensuring consistent learning across both the source and target domains. Additionally, the feature distance loss employs distance averaging and introduces a balancing weight

λ_{F D}

to adjust its impact within the overall loss function. This component preserves effective features learned from ImageNet to guide model learning. The total loss allows PANDA to progressively adapt to different domain data distributions, thereby improving its ability to adapt to the unlabeled target domain over time.

3.3. Polarized Attention Network

According to Figure 1, the network comprises key components such as MiT Encoder [34], embedding layer, context-aware fusion layer, pan, bottleneck module, and classification layer. The multi-level features from MiT Encoder are first transformed into a consistent channel dimension through the embedding layer, then upsampled to match the target dimensions. These features are concatenated along the channel dimension and input into the context-aware fusion layer, which fuses the multi-level features. The output is then passed to PAN, which consists of two PSA blocks and a convolutional layer. The bottleneck module reduces the dimensionality, and final predictions are generated through the classification layer.

Since Polarized Self-Attention (PSA) is the core module of the PAN architecture, we first provide a detailed explanation of PSA. PSA comprises two attention mechanism structures: channel-only self-attention (CSA) and spatial-only self-attention (SSA), as shown in Figure 2. CSA focuses on the importance of each channel, extracting critical channel-specific information, while SSA highlights spatial feature information, corresponding to the location of the target within the feature map.

Given a feature map

X \in ℜ^{C \times H \times W}

as input, the channel-wise attention weight

A^{c h} (X) \in ℜ^{C \times 1 \times 1}

is calculated as follows:

\begin{matrix} A^{c h} (X) = & F_{S G} [F_{L N} (W_{z} (σ_{1} (W_{v} (X)) \times F_{S M} (σ_{2} (W_{q} (X)))))] \end{matrix}

(19)

where X represents the tensor derived from either the context-aware fusion layer or the standard convolution module;

W_{q}

,

W_{v}

, and

W_{z}

are the 1 × 1 convolutional layers;

σ_{1}

and

σ_{2}

are the tensor reshape operators; × is a matrix dot-product operator;

F_{S G} (\cdot)

is a sigmoid operator;

F_{L N} (\cdot)

is a layer normalization; and

F_{S M} (\cdot)

is a Softmax operator. The output of CSA is

Z^{c h} = A^{c h} (X) ⊙ X \in ℜ^{C \times H \times W}

, where ⊙ is an element-wise multiplication operator.

In SSA structures, the spatial-wise attention weight

A^{s p} (X) \in ℜ^{1 \times H \times W}

is defined as follows:

A^{s p} (X) = F_{S G} [σ_{3} (F_{S M} (σ_{1} (F_{G P} (W_{q} (X)))) \times σ_{2} (W_{v} (X)))]

(20)

where X denotes the tensor derived from either the context-aware fusion layer or the standard convolution module;

W_{q}

and

W_{v}

are the 1 × 1 convolutional layers;

σ_{1}

,

σ_{2}

, and

σ_{3}

are tensor reshape operators; × is a matrix dot-product operator;

F_{S G} (\cdot)

is a sigmoid operator;

F_{S M} (\cdot)

is a Softmax operator; and

F_{G P} (\cdot)

is GAP. The output of SSA is

Z^{s p} = A^{s p} (X) ⊙ X \in ℜ^{C \times H \times W}

, where ⊙ is element-wise multiplication.

These two core structures, CSA and SSA, can be arranged in two layouts: parallel and sequential, as illustrated in Figure 3. In the sequential layout, there are two configurations: CSA→SSA and SSA→CSA. Pre-experiment results comparing these configurations on segmentation tasks for the GTA→Cityscape and SYNTHIA→Cityscape benchmarks indicate that the CSA→SSA configuration performs better on both benchmarks, as detailed in Table 1. Therefore, we adopt the CSA→SSA sequential layout for PAN in this work.

Hence, in the parallel layout, X is separately input into CSA and SSA, and their outputs are element-wise added to produce the parallel layout result, computed as Equation (21). In the sequential layout, it involves initially inputting into CSA, followed by passing the output of CSA to SSA, resulting in the sequential layout outcome, expressed as Equation (22).

P S A_{p} (X) = Z^{c h} + Z^{s p} = A^{c h} (X) ⊙ X + A^{s p} (X) ⊙ X

(21)

P S A_{s} (X) = Z^{s p} (Z^{c h}) = A^{s p} (A^{c h} (X) ⊙ X) ⊙ A^{c h} (X) ⊙ X

(22)

where + is an element-wise addition operator.

P S A_{p}

and

P S A_{s}

denote the way to fuse two structures in parallel and in sequence, respectively.

Inspired by [33], PAN employs two PSA blocks and a standard convolution module in both the student and teacher networks. We discuss the performance of different PSA layout combinations in Section 4.2. To capture more comprehensive features, the point-depth convolution used in [33] is replaced with a standard convolution module (denoted as Conv) to allow simultaneous capture of both spatial and channel features, resulting in richer feature representations. Specifically, taking the output of the first set of PSA as input, a 3 × 3 convolution, denoted as W, is employed to extract features from the input. The extracted features undergo stabilization through batch normalization, represented as

F_{B N}

, which normalizes the features and aids in stabilizing the training process by reducing internal covariate shift. Following this, non-linear transformations are introduced using ReLU, denoted as

F_{R L}

, which enhances the network’s ability to learn complex features by introducing non-linearity. This sequence of operations is crucial as it prepares the features to be fed into the second set of PSA, facilitating further refinement of features for downstream tasks. The output of PAN is computed as follows:

X_{μ} = P S A (F_{R L} (F_{B N} (W (P S A (X)))))

(23)

Having established the PAN architecture, we now provide a detailed explanation of other critical layers in the network. These layers complement the PAN module by facilitating robust feature extraction. The embedding layer consists of four linear layers. The context-aware fusion layer, a variant of ASPP [38], differs by excluding GAP. The bottleneck module comprises a 3 × 3 convolution, batch normalization, and ReLU, while the classification layer uses dropout and 1 × 1 convolution. These components work in tandem with the PAN module to ensure efficient feature extraction and refinement, ultimately enhancing segmentation performance.

4. Experiments

4.1. Implementation Details

Our model is implemented in PyTorch 1.7.1+cu110, Python 3.8.5, and CUDA 11.4, running on a system with Ubuntu 18.04.5 LTS, a single NVIDIA Tesla V100 GPU with 32 GB memory, an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20 GHz, and 256 GB RAM. We evaluate our proposed method on the two standard large-scale UDA segmentation benchmarks: GTA→Cityscape and SYNTHIA→Cityscape. The synthetic datasets GTA [39] and SYNTHIA [40] serve as the source domains, while Cityscapes [7] is used as the target domain. GTA [39] contains 24,966 synthetic images with a resolution of 1914 × 1052, and SYNTHIA [40] consists of 9400 synthetic images with a resolution of 1280 × 760. Cityscapes [7] provides 2975 training images and 500 validation images, each with a resolution of 2048 × 1024.

For evaluation, we employ the mean intersection over union (mIoU), as defined in Equation (24), which is a widely adopted metric for measuring segmentation performance.

mIoU = \frac{1}{n} \sum_{i = 1}^{n} \frac{T P_{i}}{F N_{i} + F P_{i} + T P_{i}}

(24)

where

T P

,

F P

, and

F N

are the number of true positive, false positive, and false negative pixels, respectively. Referencing [18,19,20], we evaluate on 19 classes for GTA [39] and 16 classes for SYNTHIA [40], ensuring compatibility with the Cityscapes dataset [7].

We train PANDA with a MiT-B5 encoder [34], initializing the backbone with ImageNet pretraining weights. In the UDA settings, we adopt the training parameters from MIC [20] and use the AdamW [41] optimizer. Specifically, we set the learning rate to

6 \times 10^{- 5}

for the encoder and

6 \times 10^{- 4}

for the decoder, applying a linear learning rate warmup during the first 1.5 k iterations. Afterward, a weight decay of 0.01 is adopted. We also use an EMA factor

α = 0.999

and a quality threshold

τ = 0.968

. During training, inputs are randomly cropped to a resolution of 1024 × 1024, with a batch of 2, for a total of 40 k iterations.

4.2. Variation in PSA Layout Combinations

In Table 2, we evaluate the influence of various PSA block configurations in the PAN architecture on the performance of UDA models. Six distinct configurations were explored, including single-PSA block models such as PANDA(P) and PANDA(S), as well as dual-PSA block models such as PANDA(P-P), PANDA(S-S), PANDA(S-P), and PANDA(P-S). Here, “P” represents a parallel layout, while “S” denotes a sequential layout.

The experimental results show that all PSA configurations outperform the baseline MIC method [20] in both GTA→Cityscapes and SYNTHIA→Cityscapes scenarios. Notably, the PANDA(S) model achieves the highest mIoU of 76.2% on the GTA→Cityscapes task, indicating that the sequential layout may allow for more refined stage-wise feature extraction. This suggests that sequential processing within a single PSA block is more effective than parallel processing for segmentation tasks, likely due to its ability to capture and refine feature information in multiple steps.

Moreover, dual-block models consistently outperform single-block configurations. Among the dual-block models, PANDA(P-S), which first implements a parallel layout followed by a sequential layout, demonstrates the best overall performance. It achieves an mIoU of 68.7% on the SYNTHIA→Cityscapes task and maintains competitive results on the GTA→Cityscapes task. The success of PANDA(P-S) lies in its ability to combine the broad feature extraction capabilities of the parallel block with the focused refinement afforded by the sequential block. This balance proves particularly advantageous, allowing the model to excel across diverse tasks.

Based on these observations, the PANDA(P-S) configuration, in particular, offers an optimal compromise by leveraging the strengths of both parallel and sequential attention, making it the preferred setup for further experiments and analysis.

4.3. Comparison of Different Convolution Modules

In Table 3, we compare the standard convolution module inside PANDA with point-depth convolution used in [33]. The standard convolution module employs complete convolutional kernels to process all input channels, enabling it to capture richer features and contextual information. In contrast, point-depth convolution decomposes the convolution operation into pointwise and depthwise convolutions, reducing computational costs but potentially losing detail in feature extraction. On GTA→Cityscapes, PANDA with standard convolution achieves an mIoU of 76.1%, while the model utilizing point-depth convolution maintains the same performance, comparable to MIC [20]. Similarly, on SYNTHIA→Cityscapes, the improvement achieved using the standard convolution module significantly outperforms point-depth convolution. The mIoU improves by 1.4%, reaching 68.7% mIoU, compared to a 0.3% mIoU increase with point-depth convolution. The results indicate that the standard convolution module is more effective at capturing both detailed and global features in complex scenes for high-precision segmentation tasks. This might be due to the standard convolution module considering interactions between neighboring pixels during convolution, thereby preserving more channel and spatial information to enhance detailed feature extraction.

4.4. Comparison with State-of-the-Art Methods

We evaluate the performance of our proposed method, PANDA, and compare it with various UDA methods. As detailed in Table 4, PANDA demonstrates notable improvements, achieving a +0.2% mIoU gain over MIC [20] on the GTA→Cityscapes task and a +1.4% mIoU gain on the SYNTHIA→Cityscapes task, reaching 76.1% and 68.7% mIoU, respectively. In a detailed class-wise IoU comparison, PANDA outperforms MIC [20] in 10 out of 19 categories for the GTA→Cityscapes task and in 9 out of 16 categories for the SYNTHIA→Cityscapes task. The most significant improvements are observed in categories such as wall, pole, and traffic sign on GTA→Cityscapes (see Figure 4); and road, sidewalk, and bus on SYNTHIA→Cityscapes (see Figure 5). The white dashed boxes highlight areas with significant improvements compared to other methods.

PANDA’s superior performance can be attributed to two key factors. Firstly, its enhanced feature extraction capabilities allow it to capture finer details and more intricate patterns in specific categories, such as pole and traffic sign, which are often challenging due to their complex and subtle shapes. PANDA excels in learning and differentiating these variations. Secondly, PANDA leverages context-aware mechanisms that utilize surrounding environmental information to improve object identification and classification. This is particularly beneficial for categories like road and sidewalk, where understanding the broader scene context is crucial for accurate segmentation. Together, these improvements allow PANDA to handle categories that were traditionally challenging due to variability and contextual dependencies.

Additionally, class imbalance in the datasets presents a challenge during model training, as it can cause the model to develop a bias toward high-frequency categories, thereby reducing its ability to recognize rare categories. To address this, PANDA not only improves the overall mIoU but also maintains a strong capacity to recognize rare categories. Specifically, when rigorously considering PANDA’s performance in surpassing both MIC [20] and HRDA [19], excluding high-frequency categories (such as road, building, sidewalk, sky, and vegetation), PANDA improved the recognition rate for 70% of the categories in the GTA→Cityscapes task and for 55% of the categories in the SYNTHIA→Cityscapes task.

Further analysis, considering cases where PANDA outperforms either MIC [20] or HRDA [19], shows that the recognition rate increases to 95% for the GTA→Cityscapes task and to 88% for the SYNTHIA→Cityscapes task across all categories. These results highlight PANDA’s significant advantage in addressing class imbalance, demonstrating strong generalization capabilities, particularly for challenging and rare categories. It is also important to note that while the performance on certain categories, such as wall, is slightly lower compared to the baseline, this is primarily due to the lower occurrence frequency of this category in the SYNTHIA dataset [40], which may have resulted in insufficient training data for the model. Overall, PANDA exhibits excellent performance across the majority of categories, particularly with significant improvements in the recognition of rare categories.

4.5. Ablation Study

In this section, we investigate the effects of various attention mechanisms on model performance and compare their effectiveness with that of PANDA. Specifically, we evaluate the performance of PAN when substituted with several attention mechanisms, including EA [29], SENet [30], GCNet [31], and CBAM [32]. To enhance understanding, we begin by clarifying the columns presented in Table 5, which provides an ablation study on the effects of these attention mechanisms. The columns represent key performance metrics, such as mIoU, as well as computational costs, including the number of parameters, memory usage, and throughput associated with each method. The channel resolution and spatial resolution columns indicate the dimensions of the feature maps processed by each attention mechanism, where a larger resolution typically allows for more detailed information. The non-linearity column specifies the activation functions used, which can impact the model’s ability to capture complex patterns. The GTA→CS (mIoU) and SYN→CS (mIoU) columns report mIoU performance on the respective tasks, serving as a measure of segmentation accuracy. The param (M) column denotes the number of parameters in millions, providing insight into the model’s complexity. The memory (GB) column indicates the amount of GPU memory consumed during inference, which is crucial for understanding resource requirements. Finally, the throughput (img/s) column reflects the model’s processing speed, representing how many images can be processed per second.

According to Table 5, none of these four attention mechanisms yield significant performance improvements on the GTA→Cityscapes task. The observed enhancement in PANDA’s performance can be attributed to PSA [2], which retains a larger number of channels (

\frac{C}{2}

) and spatial dimensions (

[W, H]

). Furthermore, unlike the other methods, PSA utilizes a combination of Softmax and sigmoid as non-linear functions, which helps approximate more realistic and refined output results.

For the SYNTHIA→Cityscapes task, while the alternative attention mechanisms yield slight improvements, PANDA achieves a more substantial enhancement in segmentation performance compared to these mechanisms. Notably, PANDA increases the total number of parameters by only 1.69% compared to traditional attention mechanisms, indicating its efficiency in maintaining model complexity. Additionally, PANDA leads to a 2.41% reduction in GPU memory consumption, demonstrating its advantage in resource usage over conventional attention methods. Moreover, PANDA exhibits a 7.25% decrease in training speed relative to these approaches, emphasizing its effectiveness in balancing performance and computational overhead. Overall, these experiments highlight that PANDA is more suitable for UDA semantic segmentation tasks than previous attention mechanisms, balancing improved performance with reduced computational overhead.

4.6. Failure Case Analysis

In this section, we provide three representative failure cases of PANDA. As presented in Figure 6, PANDA confuses semantically similar objects such as train vs. bus and terrain vs. vegetation. In addition, it is extremely challenging to distinguish partially occluded objects (e.g., differentiating between person and rider, or identifying sidewalk behind car). In the future, we will therefore focus on addressing these issues.

5. Conclusions

In this study, we introduced PANDA, a novel model for unsupervised domain adaptation (UDA) in semantic segmentation that effectively integrates advanced design elements to enhance feature extraction capabilities. Our experiments revealed that PANDA significantly outperforms existing state-of-the-art methods, including MIC, achieving mIoU scores of 76.1% on GTA→Cityscapes and 68.7% on SYNTHIA→Cityscapes, reflecting respective gains of +0.2% and +1.4%.

The detailed analysis of various attention mechanisms revealed that traditional models, such as EA, SENet, GCNet, and CBAM, failed to achieve significant improvements on the GTA→Cityscapes. In contrast, PANDA employs Polarized Self-Attention, which effectively preserves a greater number of both channel and spatial dimensions that are essential for capturing intricate and contextually relevant features. Furthermore, the unique combination of Softmax and sigmoid non-linear functions within the PSA architecture optimizes feature selection, leading to more refined output predictions.

Additionally, when evaluating model complexity and computational efficiency, PANDA shows a minor increase in total parameter count. However, it offsets this by reducing GPU memory consumption and accelerating training times, making it not only highly accurate but also efficient for real-time deployment scenarios. This dual benefit is particularly advantageous in applications requiring both high performance and low latency. While acknowledging the model’s limitations in handling rare classes in certain instances, PANDA demonstrates considerable success in addressing class imbalance and boosting recognition across the majority of categories. This indicates that the model’s design choices, such as PSA and advanced feature fusion, significantly contribute to its robustness. Looking ahead, our future work will focus on refining PANDA’s ability to handle rare categories more effectively, thereby improving its applicability to highly imbalanced datasets.

In summary, PANDA offers a balanced and effective solution for addressing the challenges of cross-domain learning tasks, particularly in handling class imbalances and enhancing the recognition of rare or challenging categories. Its performance demonstrates resilience and adaptability, reinforcing its potential as a viable unsupervised domain adaptation model for both academic research and practical applications in semantic segmentation.

Author Contributions

Conceptualization, C.-W.K., W.-L.C. and C.-C.L.; methodology, W.-L.C. and C.-W.K.; software, W.-L.C.; validation, W.-L.C. and C.-W.K.; formal analysis, W.-L.C. and C.-W.K.; investigation, W.-L.C.; resources, C.-W.K., C.-C.L. and K.-C.F.; data curation, W.-L.C.; writing—original draft preparation, W.-L.C. and C.-W.K.; writing—review and editing, C.-W.K. and W.-L.C.; visualization, W.-L.C.; supervision, C.-W.K. and C.-C.L.; project administration, C.-W.K. and C.-C.L.; funding acquisition, K.-C.F. and C.-W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council, Taiwan, R.O.C., grant numbers NSTC 113-2221-E-008-102-MY3 and NSTC 112-2222-E-130-001.

Data Availability Statement

The original data presented in the study are openly available in [7,39,40].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UDA	Unsupervised domain adaptation
PANDA	Polarized Attention Network Domain Adaptation
PSA	Polarized Self-Attention
mIoU	Mean intersection over union
GTA	Grand Theft Auto
GAN	Generative adversarial network
HRDA	High-Resolution Domain-Adaptive
MIC	Masked Image Consistency
EA	Efficient attention
SENet	Squeeze-and-excitation network
GCNet	Global Context Network
CBAM	Convolutional Block Attention Module
EMA	Exponential moving average
PAN	Polarized Attention Network
MiT	Mix Transformer
RCS	Rare Class Sampling
FD	Feature distance
GAP	Global average pooling
DACS	Domain Adaptation via Cross-domain Mixed Sampling
ASPP	Atrous Spatial Pyramid Pooling
CSA	Channel-only Self-attention
SSA	Spatial-only Self-attention
Conv	Convolution
SW	Sidewalk
Build.	Building
TL	Traffic light
TS	Traffic sign
Veg.	Vegetation
M.Bike	Motorcycle
SYN	SYNTHIA
CS	Cityscapes

References

Toldo, M.; Maracani, A.; Michieli, U.; Zanuttigh, P. Unsupervised domain adaptation in semantic segmentation: A review. Technologies 2020, 8, 35. [Google Scholar] [CrossRef]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
Zhou, T.; Zhang, F.; Chang, B.; Wang, W.; Yuan, Y.; Konukoglu, E.; Cremers, D. Image Segmentation in Foundation Model Era: A Survey. arXiv 2024, arXiv:2408.12957. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Zhou, T.; Wang, W.; Konukoglu, E.; Van Gool, L. Rethinking semantic segmentation: A prototype view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 2582–2593. [Google Scholar]
Zhou, T.; Wang, W. Cross-image pixel contrasting for semantic segmentation. IEEE Trans. Pattern. Anal. Mach. Intell. 2024, 46, 5398–5412. [Google Scholar] [CrossRef] [PubMed]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Sakaridis, C.; Dai, D.; Van Gool, L. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 10765–10775. [Google Scholar]
Chen, L.; Chen, H.; Wei, Z.; Jin, X.; Tan, X.; Jin, Y.; Chen, E. Reusing the task-specific classifier as a discriminator: Discriminator-free adversarial domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7181–7190. [Google Scholar]
Wang, Q.; Meng, F.; Breckon, T.P. Data augmentation with norm-AE and selective pseudo-labelling for unsupervised domain adaptation. Neural Netw. 2023, 161, 614–625. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.; Bai, H.; Wang, L. Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 3561–3571. [Google Scholar]
Singh, I.P.; Ghorbel, E.; Kacem, A.; Rathinam, A.; Aouada, D. Discriminator-free unsupervised domain adaptation for multi-label image classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 3936–3945. [Google Scholar]
Mattolin, G.; Zanella, L.; Ricci, E.; Wang, Y. ConfMix: Unsupervised Domain Adaptation for Object Detection via Confidencebased Mixing. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 423–433. [Google Scholar]
Kennerley, M.; Wang, J.G.; Veeravalli, B.; Tan, R.T. 2pcnet: Two-phase consistency training for day-to-night unsupervised domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–23 June 2023; pp. 11484–11493. [Google Scholar]
VS, V.; Oza, P.; Patel, V.M. Towards online domain adaptive object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 478–488. [Google Scholar]
VS, V.; Oza, P.; Patel, V.M. Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 3520–3530. [Google Scholar]
Pu, B.; Wang, L.; Yang, J.; He, G.; Dong, X.; Li, S.; Tan, Y.; Chen, M.; Jin, Z.; Li, K.; et al. M3-UDA: A New Benchmark for Unsupervised Domain Adaptive Fetal Cardiac Structure Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle WA, USA, 17–21 June 2024; pp. 11621–11630. [Google Scholar]
Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9924–9935. [Google Scholar]
Hoyer, L.; Dai, D.; Van Gool, L. HRDA: Context-aware high-resolution domain-adaptive semantic segmentation. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 372–391. [Google Scholar]
Hoyer, L.; Dai, D.; Wang, H.; Van Gool, L. MIC: Masked image consistency for context-enhanced domain adaptation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 11721–11732. [Google Scholar]
Chen, M.; Zheng, Z.; Yang, Y.; Chua, T.S. Pipa: Pixel-and patch-wise self-supervised learning for domain adaptative semantic segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1905–1914. [Google Scholar]
Zhao, X.; Mithun, N.C.; Rajvanshi, A.; Chiu, H.P.; Samarasekera, S. Unsupervised domain adaptation for semantic segmentation with pseudo label self-refinement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 2399–2409. [Google Scholar]
Long, M.; Zhu, H.; Wang, J.; Jordan, M.I. Deep transfer learning with joint adaptation networks. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 2208–2217. [Google Scholar]
Sun, B.; Saenko, K. Deep coral: Correlation alignment for deep domain adaptation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; pp. 443–450. [Google Scholar]
Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.; Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. Proc. Int. Conf. Mach. 2018, 80, 1989–1998. [Google Scholar]
Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2517–2526. [Google Scholar]
Zou, Y.; Yu, Z.; Kumar, B.; Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 289–305. [Google Scholar]
Zhang, P.; Zhang, B.; Zhang, T.; Chen, D.; Wang, Y.; Wen, F. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12414–12424. [Google Scholar]
Shen, Z.; Zhang, M.; Zhao, H.; Yi, S.; Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3531–3539. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer VisionWorkshop, Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yu, Q.; Wei, W.; Pan, Z.; He, J.; Wang, S.; Hong, D. GPF-Net: Graph-polarized fusion network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–22. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Olsson, V.; Tranheden, W.; Pinto, J.; Svensson, L. Classmix: Segmentation-based data augmentation for semi-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2021; pp. 1369–1378. [Google Scholar]
Tranheden, W.; Olsson, V.; Pinto, J.; Svensson, L. Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVFWinter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 1379–1389. [Google Scholar]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 1195–1204. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for data: Ground truth from computer games. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 102–118. [Google Scholar]
Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3234–3243. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]

Figure 1. Overview of the proposed PANDA architecture.

Figure 2. The structure of PSA. Channel-only self-attention (CSA) is in the left half, and spatial-only self-attention (SSA) is in the right half. LN is layer normalization.

Figure 3. Illustration of the PSA block under different connection schemes: (a) Parallel layout and (b) sequential layout.

Figure 4. Qualitative comparison of PANDA with previous methods on GTA→Cityscapes.

Figure 5. Qualitative comparison of PANDA with previous methods on SYNTHIA→Cityscapes.

Figure 6. Failure cases of segmentation results on GTA→Cityscapes (rows 1 and 2) and SYNTHIA→Cityscapes (rows 3 and 4).

Table 1. Comparison of sequential layout configurations. The best performance in each column is shown in bold.

Method	GTA→Cityscape	SYNTHIA→Cityscape
MIC (baseline)	75.9	67.3
SSA→CSA	76.0	67.1
CSA→SSA	76.2	68.1

Table 2. Performance comparison of different PSA arrangements. The best performance in each column is shown in bold.

Method	GTA→Cityscapes	SYNTHIA→Cityscapes
MIC [20]	75.9	67.3
PANDA(P)	76.1	67.8
PANDA(S)	76.2	68.1
PANDA(P-P)	76.0	68.5
PANDA(S-S)	76.0	67.7
PANDA(S-P)	76.1	68.0
PANDA(P-S)	76.1	68.7

Table 3. Performance comparison of internal convolution modules.

Method	Point-Depth Conv	Standard Conv	GTA→Cityscapes	SYNTHIA→Cityscapes
MIC [20]			75.9	67.3
PANDA	✔		75.9	67.6
PANDA		✔	76.1	68.7

Table 4. Comparison with state-of-the-art methods for UDA. We present pre-class IoU and mIoU. The best performance in every column is shown in bold.

Method	Road	SW	Build.	Wall	Fence	Pole	TL	TS	Veg.	Terrain	Sky	Person	Rider	Car	Truck	Bus	Train	M.Bike	Bike	mIoU
GTA→Cityscapes
ADVENT [26]	89.4	33.1	81.0	26.6	26.8	27.2	33.5	24.7	83.9	36.7	78.8	58.7	30.5	84.8	38.5	44.5	1.7	31.6	32.4	45.5
DACS [36]	89.9	39.7	87.9	30.7	39.5	38.5	46.4	52.8	88.0	44.0	88.8	67.2	35.8	84.5	45.7	50.2	0.0	27.3	34.0	52.1
ProDA [28]	87.8	56.0	79.7	46.3	44.8	45.6	53.5	53.5	88.6	45.2	82.1	70.7	39.2	88.8	45.5	59.4	1.0	48.9	56.4	57.5
DAFormer [18]	95.7	70.2	89.4	53.5	48.1	49.6	55.8	59.4	89.9	47.9	92.5	72.2	44.7	92.3	74.5	78.2	65.1	55.9	61.8	68.3
HRDA [19]	96.4	74.4	91.0	61.6	51.5	57.1	63.9	69.3	91.3	48.4	94.2	79.0	52.9	93.9	84.1	85.7	75.9	63.9	67.5	73.8
MIC [20]	97.4	80.1	91.7	61.2	56.9	59.7	66.0	71.3	91.7	51.4	94.3	79.8	56.1	94.6	85.4	90.3	80.4	64.5	68.5	75.9
PANDA	97.3	79.0	91.8	63.4	58.4	62.0	66.6	73.4	91.4	52.7	93.3	80.8	58.0	94.4	85.7	86.6	80.2	64.3	66.9	76.1
SYNTHIA→Cityscapes
ADVENT [26]	85.6	42.2	79.7	8.7	0.4	25.9	5.4	8.1	80.4	-	84.1	57.9	23.8	73.3	-	36.4	-	14.2	33.0	41.2
DACS [36]	80.6	25.1	81.9	21.5	2.9	37.2	22.7	24.0	83.7	-	90.8	67.6	38.3	82.9	-	38.9	-	28.5	47.6	48.3
ProDA [28]	87.8	45.7	84.6	37.1	0.6	44.0	54.6	37.0	88.1	-	84.4	74.2	24.3	88.2	-	51.1	-	40.5	45.6	55.5
DAFormer [18]	84.5	40.7	88.4	41.5	6.5	50.0	55.0	54.6	86.0	-	89.8	73.2	48.2	87.2	-	53.2	-	53.9	61.7	60.9
HRDA [19]	85.2	47.7	88.8	49.5	4.8	57.2	65.7	60.9	85.3	-	92.9	79.4	52.8	89.0	-	64.7	-	63.9	64.9	65.8
MIC [20]	86.6	50.5	89.3	47.9	7.8	59.4	66.7	63.4	87.1	-	94.6	81.0	58.9	90.1	-	61.9	-	67.1	64.3	67.3
PANDA	91.7	59.8	89.1	45.0	9.4	61.8	68.1	62.3	88.5	-	94.4	80.9	58.5	90.2	-	67.3	-	67.8	63.8	68.7

Table 5. Ablation study on the effects of PSA and different attention blocks.

Method	Channel Resolution	Spatial Resolution	Non-Linearity	GTA→CS (mIoU)	SYN→CS (mIoU)	Param (M)	Memory (GB)	Throughput (img/s)
MIC [20]	-	-	-	75.9	67.3	85.69	22.60	0.98
+EA [29]	≪C	≪min(W, H)	Softmax	-	-	103.5	30.88	-
+SENet [30]	C/4	-	ReLU + Sigmoid	75.6	68.1	95.65	25.51	0.65
+GCNet [31]	C/4	-	ReLU + Softmax	74.8	67.9	96.18	25.55	0.73
+CBAM [32]	C/16	[W,H]	Sigmoid	75.6	68.4	95.39	26.02	0.70
PANDA	C/2	[W,H]	Sigmoid + Softmax	76.1	68.7	99.33	26.34	0.64

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kao, C.-W.; Chang, W.-L.; Lee, C.-C.; Fan, K.-C. PANDA: A Polarized Attention Network for Enhanced Unsupervised Domain Adaptation in Semantic Segmentation. Electronics 2024, 13, 4302. https://doi.org/10.3390/electronics13214302

AMA Style

Kao C-W, Chang W-L, Lee C-C, Fan K-C. PANDA: A Polarized Attention Network for Enhanced Unsupervised Domain Adaptation in Semantic Segmentation. Electronics. 2024; 13(21):4302. https://doi.org/10.3390/electronics13214302

Chicago/Turabian Style

Kao, Chiao-Wen, Wei-Ling Chang, Chun-Chieh Lee, and Kuo-Chin Fan. 2024. "PANDA: A Polarized Attention Network for Enhanced Unsupervised Domain Adaptation in Semantic Segmentation" Electronics 13, no. 21: 4302. https://doi.org/10.3390/electronics13214302

APA Style

Kao, C. -W., Chang, W. -L., Lee, C. -C., & Fan, K. -C. (2024). PANDA: A Polarized Attention Network for Enhanced Unsupervised Domain Adaptation in Semantic Segmentation. Electronics, 13(21), 4302. https://doi.org/10.3390/electronics13214302

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PANDA: A Polarized Attention Network for Enhanced Unsupervised Domain Adaptation in Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. UDA

2.3. Attention Mechanism

3. Proposed Approaches

3.1. Overview

3.2. Self-Training for UDA

3.2.1. Source-Domain Training

3.2.2. Target-Domain Training

3.2.3. Cross-Domain Training

3.3. Polarized Attention Network

4. Experiments

4.1. Implementation Details

4.2. Variation in PSA Layout Combinations

4.3. Comparison of Different Convolution Modules

4.4. Comparison with State-of-the-Art Methods

4.5. Ablation Study

4.6. Failure Case Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI