1. Introduction
Deep neural networks have achieved impressive success in a wide range of computer vision applications [
1], but their success usually demands massive quantities of labeled data for better representations. This often follows the assumption that training and testing sets are from the same data distribution. Nevertheless, this situation does not always work well in practice. One way out could be resorting to the unsupervised domain adaptation (UDA), which trains deep neural network models on rich labeled data from a related source domain. But this supervised learning suffers from the domain shift issue, resulting in poor generalization performance on other new target domains. To address this issue, considerable research efforts are devoted to such UDA tasks [
2,
3,
4,
5] by bridging the distribution discrepancy, minimizing distance metrics or adversarial learning, etc. In such arts, most existing approaches advance convolution neural networks (CNNs)-based frameworks to learn the domain-invariant feature representation. Such features are often from local receptive fields.
With the success of transformers in various visual tasks, recent UDA methods focus more on global features by using encoder–decoder frameworks, in contrast to local features learned by CNN frameworks. The most advanced domain adaptation methods extract global features of images by using transformer architecture as backbone network. Recent studies show that the models with transformers are obviously better than those with pure convolutional neural networks. For example, transferable vision transformers (TVT) [
5] utilize the transferability adaptation module of vision transformers (ViT) [
6] for domain adaptation. Cross-domain transformers (CDTrans) [
4] use the robustness of cross-attention in transformers to propose a three-branch transformer model for UDA tasks. To take full advantage of both transformer and CNN architectures, a natural idea is combining both of them. However, CDTrans uses a two-stage training method, which takes a long time and is not conducive to the rapid migration of the model. The challenge of hybrid models is how to maintain the robustness of cross-attention with high efficiency.
On one hand, we introduce a Gaussian attended MLP module to further empower the robustness of the encoder of transformer by adjusting more attention to major channel dimensions of features, thus improving the quality of features. As shown in
Figure 1, Gaussian attention can attend to more important visual clues than the baseline. This is because Gaussian distribution can smooth the distribution of the attention weights, thereby filtering out the respective noisy values. Moreover, since it only involves the mean value and deviations, the speed of forming the attention is lightning-fast. The corresponding extra overhead is negligible. On the other hand, the context information of features is able to enhance the spatial semantics of the ‘Class Token (CLS-Token)’ features. Inspired by ConvNeXt [
7], which reparameterizes the transformer architecture into the fully CNN model for efficiency, we design an efficient dynamic convolution module with the context information by using the Gaussian error linear units (GELU) activation function and the layer normalization. This module is also lightly weighted.
In summary, the contributions of this paper are summarized as follows:
We propose a novel hybrid model of both transformers and convolution networks, termed TransConv. It improves the robustness of cross-attention with a Gaussian attended MLP module and meanwhile absorbs more semantics via the context-aware dynamic convolution module.
TransConv better trades off model performance and efficiency as compared to the state-of-the-arts with a large margin on five datasets.
The rest of this paper is organized as follows: first of all, we review the related work in the
Section 2. Then,
Section 3 introduces the overall architecture of the proposed TransConv model and each improved module are introduced in detail.
Section 4 reports the experimental results on five commonly used datasets and ablation experiments. At last, the conclusion of this paper and the future works are given in
Section 5.
4. Experiments
To verify the effectiveness of our model, we evaluate our proposed method on four widely used datasets including Office-31, Office-Home, ImageCLEF-DA, and VisDA-2017 for object recognition. MNIST, USPS, and SVHN are used for digit classification. And we compare them with the state-of-the-art UDA methods.
Digit classification is an UDA benchmark, consisting of MNIST [
44], USPS and Street View House Number (SVHN) [
45]. We use the same settings as previous work to train our model, i.e., training phase uses training sets for each pair of source and target domains, and the testing phase uses the test set for target domain to perform evaluations.
The
Office-31 dataset [
46] contains 4652 images in 31 categories and consists of three domains: Amazon (A), DSLR (D), and Webcam (W). Amazon (A) contains 2817 images, which were downloaded from
www.amazon.com; 498 images in DSLR (D) and 795 images in Webcam (W) were captured in an office environment by web and digital SLR cameras, respectively.
The
Office-Home dataset [
47] consists of 15,588 images in 65 object categories. It contains images from four different domains: artistic images (A), clip art (C), product images (P), and real-world images (R). Images in each domain are collected in office and home environments. There are 2427 images in (A), 4365 images in (C), 4439 images in (P), and 4357 images in (R), respectively.
The ImageCLEF-DA dataset contains 1800 images in 12 categories. It consists of three domains: Caltech-256 (C), ImageNet ILSVRC 2012 (I), and Pascal VOC 2012 (P). There are 600 images in each domains and 50 images in each category.
The
VisDA-2017 dataset [
48] contains about 280k images in 12 categories. It includes three domains: training, validation, and test domains. It is a dataset from simulation to a real environment. The training set has 152,397 images, which were generated by the same object under different circumstances, the 55,388 images in the validation set and the 72,372 images in the test set are real-world images.
Baseline Methods For Digital dataset, we compare TransConv with DANN [
49], ADDA [
12], SHOT [
50], DSAN [
41], CDAN [
11], MCD [
51] and TVT [
5]. For Office-31 dataset, we compare TransConv with ResNet-50 [
1], DANN, CDAN+E [
11], SHOT, ALDA [
52], DSAN, ALSDA [
53], PICSCS [
54], TVT and CDTrans [
4]. For Office-Home dataset, we compare TransConv with ResNet-50, SHOT, ALDA, CDAN+E, DSAN, ALSDA, PICSCS, TVT and CDTrans. For ImageCLEF-DA dataset, we compare TransConv with ResNet-50, DANN, CDAN+E, DSAN, PICSCS, DALN [
55] and MCC+NWD [
55]. For VisDA-2017 dataset, we compare TransConv with ResNet-50, DANN, CDAN+E, SHOT, DSAN, ALDA, ALSDA, TVT, and CDTrans. The results of most baselines are extracted from [
5,
41]. For the rest, we refer to the results in their original articles.
Implementation Details The ViT-B/16 model pretrained on ImageNet 21k is used as a backbone network to extract image features. The input image size in our experiments is 256 × 256, and the size of each patch is 16 × 16. The transformer encoder of ViT-B/16 consists of 12 transformer encoder layers. We train the model using a minibatch stochastic gradient descent (SGD) optimizer with a momentum of 0.9, and we initialize the learning rate as 0. We linearly increase its learning rate to 3 after 500 training steps, and then decrease it by the cosine decay strategy. Experiments are conducted on a single card 2080 Ti with 11 G memory. The batch size is set to 16.
Results of Digit Recognition The classification results of the three tasks in digital recognition are shown in
Table 1. Since current compared methods only evaluate three cases (i.e., SVHN→MNIST, USPS→MNIST and MNIST→USPS), and there is no comparisons for the remaining three cases, we also use the same settings as the previous studies. TransConv achieves the same best accuracy as TVT on MNIST→USPS task, and 0.2% lower than the best average classification accuracy. The above-mentioned results demonstrate the effectiveness of the TransConv model and well alleviate the domain shift problem.
Results of Object Recognition. We evaluate four datasets for object recognition tasks, including Office-31, ImageCLEF-DA, Office-Home, and VisDA-2017. The results of object recognition are shown in
Table 2,
Table 3,
Table 4 and
Table 5. In
Table 3, TransConv achieves the best average classification accuracy on ImageCLEF-DA, and achieves the significant improvement over the best prior UDA method (92.3% vs. 91.3%). But TransConv is lower than the best prior UDA method on Office-31, Office-Home and VisDA-2017. In
Table 2 and
Table 4, TransConv is lower than TVT on Office-31 (92.8% vs. 93.9%) and Office-Home (82.9% vs. 83.6%). In
Table 5, TransConv is lower than CDTrans on VisDA-2017 (80.9% vs. 88.4%). In
Table 2, it can be seen from the difference in the number of samples and the results obtained in the three domains (Amazon, DSLR, and Webcan) of the Office-31 dataset that the larger the source domain dataset, the higher the corresponding performance. Moreover, as shown in
Table 6, TransConv surpasses the Baseline (92.8% vs. 91.7%). This is also evidenced by the
t-SNE visualization of learned features as shown in
Figure 6. We visualize the network activations of baseline and TransConv for task A→W of Office-31 dataset. Red points are source samples and blue are target samples.
Figure 6a shows the result for baseline, we can find that the source and target domains are not aligned very well and some points are hard to classify. In contrast,
Figure 6b shows the our TransConv. It is observed that the source and target domains are aligned very well. The experimental results show that the hybrid model using the Gaussian attended MLP module with denoising capability and highly efficient dynamic convolution module can improve the domain adaptation problem to some extent. TransConv is an effective attempt to the hybrid model of transformers and convolutional neural networks.
Ablation Study. In order to learn the individual contribution of Gaussian attended MLP, dynamic convolution and context information in improving the knowledge transferability of ViT, we conduct the ablation study, as shown in
Table 6. Compared to the TransConv model, the Gaussian attended MLP, dynamic convolution, and context information are subtracted, respectively. Without the Gaussian attended MLP, the average classification accuracy is reduced by 0.2%; without dynamic convolution, the average classification accuracy is reduced by 0.5%; and without the context information, the average classification accuracy is reduced by 0.6%. Baseline is the result of simultaneously being without the Gaussian attended MLP, dynamic convolution, and contextual information, which reduces the average classification accuracy by 1.1%, indicating the significance of the three improvement methods in the model performance. To understand the effect of each improvement in dynamic convolution, we conduct the ablation study, as shown in
Table 7; compared to the TransConv model, without LN, the average classification accuracy is reduced by 0.4%, and without GELU, the average classification accuracy is reduced by 0.1%. Baseline is the result of without LN and without GELU at the same time, which reduces the average classification accuracy by 0.2%, indicating that the two improvements play a role in the model performance.
Parameter Sensitivity and Robustness. In our model, the hyperparameter
controls the
. To better understand the effects of
, we report the sensitivities of
in
Figure 7a. It can be seen that our TransConv achieves the best results when
= 0.1. Therefore, this article fixes
= 0.1. We also observe the robustness of our model by changing the distribution of the source domain and target domain. In
Figure 7b, the
d represents the interdomain distance, ‘-’ represents reducing the interdomain distance, and ‘+’ represents increasing the interdomain distance. It can be seen that when the increasing/decreasing distance is less than
d, the model performance decreases. When increasing the distance greater than
d, the larger the distance, the more the model performance decreases.