1. Introduction
Supervised learning techniques work under the assumption that training and testing datasets are drawn from the same distribution. Thus, traditional techniques require at least some labeled data for the problem at hand so as to obtain a useful model upon training, which, in many cases, is not possible. Domain adaptation (DA) [
1] is a technique for generalizing classifiers to different domains. More specifically, given two datasets with the same label space, with one of them labeled (the
source), a DA algorithm may retrieve labels for the other (the
target). Note that this does not make any assumption on the underlying distributions of data, yet it is still quite restrictive, since it requires the two datasets to have the same labels.
In practice, we are typically interested in extracting labels from datasets with much larger label spaces than any specific task dataset [
2]. To this end, the partial domain adaptation (PDA) [
2,
3] framework was proposed, in which the assumption is that labels for target data are contained within the larger label space of source data. Of course, the creation of such general-purpose source datasets, which represent the entire label space of any given specific task well, is unrealistic. As an example, some of the largest image datasets currently available [
4,
5] are comprised of millions of images, but many classes are under-represented.
To address this issue, open-set domain adaptation (OSDA) was proposed [
6], essentially combining open-set recognition (OSR) [
7] and DA. The aim in OSDA is to automatically extract the information from the source relevant to the target and to identify target instances relevant to the source. That is, a non-empty intersection between the target and source label spaces is assumed, and the goal is to identify source and target instances with labels in the common label subspace. The advantage of identifying such instances is two-fold: It allows us to minimize negative transfer (i.e., the inclusion of source-specific features in the adapted representation) and to identify the subset of target instances that can be reliably labeled through the source.
Note that we cannot hope to predict the label of a target instance belonging to a class not represented in the source label space; however, the advantage of identifying which instances belong to such classes is two-fold. Firstly, it allows us to minimize negative transfer (i.e., the inclusion of source-specific features in the adapted representation) during the adaptation procedure. Secondly, it allows us to identify the subset of target instances that can be reliably labeled using this source.
In this paper, we propose an algorithm for tackling OSDA, which we term the
=Doubly Importance Weighted Adversarial Network (DIWAN), which was inspired by the distribution reweighing techniques introduced in [
8]. Our algorithm bridges the gap between the adversarial neural network (ANN) techniques used in PDA and OSDA by taking into account outlier target instances during adaptation. We prove that our algorithm constructs a representation in the which source and target distributions of transfer-relevant instances (TRI) are aligned. Moreover, we empirically demonstrate that in the open-set setting, DIWAN outperforms non-adapted versions of the DA and PDA algorithms.
The rest of this paper is organized as follows: In
Section 2, we present related work and the contribution of DIWAN. Then, in
Section 3, we provide the theoretical background and the methodology used by DIWAN. Moreover, the theoretical support for this work is presented in
Section 4. An experimental evaluation of DIWAN is presented and discussed in
Section 5, and conclusions are drawn in
Section 7.
2. Related Work and Contributions
Our algorithm relies on ANNs for DA, as they were successful in classical DA [
9,
10] and PDA [
2,
3,
8,
11,
12]. Such adversarial schemes for DA typically include a source model, which is comprised of a classifier and a representer network, a target representer network, and a domain discriminator. The source and target representer networks are embeddings of the source and target data, respectively, into some
latent space. During the adaptation, the target representer and the domain discriminator antagonize each other, while the source representer and classifier are either fixed with pre-trained weights or trained in a supervised fashion.
At each iteration, the discriminator is presented with data from the source and target domains embedded in latent space and trained to discriminate between the two. The target representer is then trained using reversed discriminator gradients. This process can be shown to minimize the Jensen–Shannon Divergence [
13] between the distributions of source and target data in latent space, mitigating the covariate shift problem. This standard scheme is typically augmented by other networks to deal with PDA problems. For example, in [
2,
3], a collection of domain discriminators were used to reweigh source instances so as to ignore outlier classes in the source domain. In [
8], a second domain discriminator was used to reweigh source instances in a similar fashion. Moreover, in [
11], both a domain discriminator and classifier information were used to separate outlier source classes. Finally, in [
12], a shared-label classifier was used to re-weight instances, and it was trained using information from the source classifier.
Contrary to previous works, we reweigh
both source and target instances to correct for target outliers. As in [
8,
11], we re-weight using a second domain discriminator, but our scheme may be modified to use
any heuristic for TRIs. Moreover, we use the obtained weights to identify target outlier instances by setting a threshold on the target instance weights.
In [
6], OSDA was introduced and an algorithm was proposed for tackling OSDA problems based on a constrained integer programming model. A rejection parameter was used to tune open-space risk tolerance. In [
14], the authors introduced a method that relied on classifying outlier target instances as “unknown”, removing the need for a rejection parameter but making classification harder. The model training utilized techniques introduced in [
15] for OSR that generated “unknown” target instances.
The main contribution of this work is the adaptation of existing algorithms for PDA to achieve improved performance in the OSDA setting. Our method mitigates the covariate-shift problem only for source and target instances that have labels in the common labels’ subspace. We further show that our algorithm gives rise to a natural heuristic for identifying target instances that are probably transfer relevant. We perform experiments and empirically validate our approach. In
Figure 1, we illustrate the effect of domain adaptation in the presence of source and target outliers.
3. Theory and Methodology
The upshot of OSDA is the development of methods for partially labeling an unlabeled dataset by using a model trained on a related dataset. Unlike typical DA, we assume that there exist target instances with labels that are not contained in the source label space, i.e., outlier instances. We are interested in identifying those instances so as to ignore them during the adaptation procedure. The rest of the instances, i.e., transfer-relevant ones, can potentially be reliably labeled through the adaptation procedure.
As in typical DA works, we also assume a machine learning problem described by a domain
and a task
, where
X is a random variable (the covariates) taking values in
and
Y is a random variable (the labels) taking values in
. We assume the existence of two problems
and
, where (a)
and (b)
. In the OSDA setting, we assume that
and
Our aim is to align the distributions of instances in the source and target domains with labels in
. Moreover, we want to identify instances that are
probably transfer relevant with respect to some confidence measure or heuristic. In the following, we describe DIWAN. Specifically, five neural networks are used: the source and target representer networks
, a
re-weighting network , a
constrained domain discriminator
D, and a source classifier
C.
and
C are pre-trained in a standard supervised way on the source dataset. The target representer is typically initialized as
. During training, weights are assigned to each instance
x. These are:
for the source and target, respectively. These choices are explained in the next section.
denote the collection of weights for a fixed
and
. The expectation above is estimated by averages over mini-batches. The objective function used is
For
and for
, we have, given
,
Here,
is used to emphasize that the weights are computed for all instances before
is updated. The pseudocode for DIWAN is provided in Algorithm 1. A visual overview of the five neural networks and the training procedure is illustrated in
Figure 2.
Algorithm 1: Doubly Importance Weighted Adversarial Network (DIWAN). |
|
4. Analysis
We now provide theoretical support regarding the design of the proposed algorithm.
Proposition 1. Let and be distributions over some bounded space and let be some real-valued function. We define Proof. We will use a variational calculus argument. We denote
, define the functional
and note that
. We solve the Euler–Lagrange equations, yielding
□
Corollary 1. Let have bounded support. Then, Lemma 1. Let have bounded support. Let be non-negative weight functions that satisfy . Then, is well defined.
Proof. Let and . Since and , these are valid probability densities. Furthermore, for , the support of is contained in the support of . □
Corollary 2. Let , have bounded support. Let be as in (
2)
and let be the minimizer of for a fixed . Then, Essentially, we see that OSDA can be cast as an alternating optimization problem. Moreover, we see that a heuristic for identifying probable TRIs can define an algorithm for OSDA by simply normalizing its heuristic scores by using them as weights in
. Intuitively, we aim to down-weight the outliers in the source and target domains. We use an adapted heuristic similar to the one used in [
8,
11] for re-weighting. The idea behind it is that TRIs will lie near the decision boundary of a “good” domain discriminator, and outlier classes will be far from it; a good domain discriminator will easily distinguish that an outlier source (target) instance belongs to the source (target) domain. In particular, source instances
for which
are likely to be outliers, since
can easily tell that they originate from the source domain. Thus, they should be down-weighted. Similarly, for target instances
,
.
5. Experiments
For the empirical verification of our approach, we performed three different experiments. In the first experiment, our aim was to show that the adversarial neural network techniques commonly used in DA are prone to negative transfer when applied in settings with target outlier classes. In particular, we demonstrated that if adaptation is performed using the ADDA algorithm [
9], the accuracy on transfer-relevant instances will be lower when target outliers are present.
In the second experiment, our goal was to demonstrate that DIWAN mitigated this negative transfer and to compare the results obtained with ADDA and IWPDA [
8]. To this end, we used only transfer-relevant instances to evaluate the methods. Clearly, target outlier instances were misclassified by all three methods and were thus ignored. However, DIWAN offers a heuristic for identifying transfer-relevant instances, and this was evaluated in the final experiment.
Lastly, we demonstrated that we could select a threshold for the assigned weights of target instances after DIWAN was run, such that target instances with weights above this threshold were very likely to be transfer relevant. In particular, we calculated the accuracy of the target model obtained through DIWAN over all data instances with weights above certain thresholds, and we showed that there were threshold choices that could be selected through cross-validation, yielding high-quality classification.
5.1. Dataset and Task Description
Our experiments were conducted for four different DA tasks for images. These were image obstruction, rotation, displacement, and re-scaling. The samples used within these tasks are depicted in
Figure 3. Our dataset was comprised of the MNIST [
16] and USPS [
17] handwritten digit datasets, as in [
8,
9]. The former contained
pixel gray-scale images, while the latter contained
pixel gray-scale images. Both datasets contained 10 classes corresponding to each of the 10 digits. The USPS images were padded to obtain a homogeneous DA problem. That is, we appended zeros on all four edges of each USPS image so that it became a
pixel image, where the original image occupied the central pixels. In all of our experiments, MNIST was used as a source dataset and USPS was used to generate a synthetic dataset for each of the tasks.
For the obstruction task, we set a 14 pixel area starting from the top-left corner in each image in the USPS dataset to have an intensity of 1. Similarly, each image in USPS was rotated counter-clockwise by for the rotation task. For the re-scaling task, the image was shrunk by a factor of on both the vertical and horizontal axes, and then it was re-centered. Finally, for the displacement task, each image was displaced by five pixels to the left and three pixels upwards. The covariate shift introduced in all four tasks was systematic; the same transformation was applied on all USPS images.
5.2. Experiments and Methodology
We started by running experiments to test whether outlier target labels were a source of negative transfer in DA. ADDA was run for each task for 50 trials on a full DA scenario with five source labels. The labels were chosen randomly for each task and trial. We then repeated for a scenario where there were 10 target labels and five source labels, and the accuracy was only measured on transfer-relevant instances. The source network was kept the same for each scenario, and the hyperparameters were tuned to optimize the performance for each task. A convolutional architecture was used for the source representer network. We performed this experiment for all four tasks and plotted average accuracies against the number of iterations.
For the second experiment, for each task, we varied the number of outlier instances in both the source and target domains, and for each combination, we reported the mean results of four algorithms on ten trials. The algorithms that we used were ADDA, IWPDA, DIWAN, and the source model (without any adaptations). For each trial, the target and source labels were selected randomly. Note also that the number of source labels was always equal to 5.
For our final experiment, recall that once training was finished, we could use
to calculate (
2), where the expectation was replaced by the empirical average over the entire target dataset. We present histograms of weights for each task, where representative setups of labels are chosen. Our aim was to illustrate that the weights tended to be larger for transfer-relevant instances. We then went on to compute the adapted model accuracies after a threshold was imposed. We kept the models from experiment 2 for the case in which there were 10 target labels. The average over the ten trials is presented for each threshold.
For each task, we used four different thresholds, and for each, we computed the accuracy of our model on the subset of the target dataset that we obtained after the cut-off procedure. We further gave the percentage of total transfer-relevant target instances that remained in the dataset after applying the threshold and the percentage of transfer-relevant instances in the remaining dataset.