2.2.1. Mapping Net
Human beings can summarize the attributes of the observed objects according to the visual features seen by the naked eye and deduce the categories of the observed objects according to the attributes. For example, if a child learns from watching a horse, a panda, and a tiger that they are “horse-like”, “black-white”, and “striped”, he or she can easily distinguish a zebra from a variety of animals after being told that a zebra is a horse with black and white stripes [
30]. This ability to recognize objects without any visual samples, only prior knowledge, is zero-shot learning. It is very necessary to ensure machines have the zero-shot learning ability: first, in real life, the object categories to be recognized usually follow the long-tail distribution, some of which have rich training samples, while others have few or no available training samples. Zero-shot learning can not only get rid of the dependence on a large number of manual labeling samples but also have high commercial value in some applications lacking labeling samples. Hence, in order to enable machines to have this capability, ref. [
9] introduced a manually defined attribute layer for the first time. Through this attribute layer, the classifier based on low-dimensional image features is transformed into a classifier based on high-dimensional semantic features (attribute layer) so that the trained classifier has broader classification ability and the ability to break through category boundaries. For example, in an animal identification problem in an image, attributes can be a body color (for example, “gray”, “brown”, and “yellow”) or habitat (for example, “coastal”, “desert”, and “forest”). These attributes are then used to construct semantic spaces.
Semantic embedding (SE) in conventional ZSL aims to learn an embedding function
E that maps a visual feature
x into the semantic embedding space denoted as
. The embedding function
E is usually a linear transformation consisting of two liner layers, whose input dimension is set to the dimension of the visual feature and output dimension is set to the dimension of the semantic feature. At the same time,
is also called linear semantic space because it is composed of fully connected layers. These commonly used semantic embedding methods rely on a structured loss function proposed in [
15]. According to the dot product similarity in the embedding space, the structured loss requires that the embedding of
x is closer to the semantic embedding
a of its ground-truth class than the other class embeddings. Specifically, the structured loss formula is as follows:
where
is the empirical distribution of the training samples of seen classes,
is a random selection semantic descriptor of the other categories except
a, and
and is a margin parameter to make
E more robust.
On the basis of the traditional embedding function, Chen et al. [
31] found that adding a non-linear projection head
H in embedding space as
can better constrain the original linear embedding space
, because they showed experimentally that more information can be formed and maintained in
h through this non-linear projection. In the same way that
is called linear space because
E is composed of fully connected layers, we called
a non-linear space because the projection
H actually is a ReLU non-linearity. We follow Chen’s strategy in our model; the difference is, Chen set
H and
E with the same output dimensionality (e.g., 2048-d), while we change the output dimension of
E to the dimension of the semantic descriptor of the dataset (e.g., for dataset CUB, 312-d); then, the linear space can be limited to the semantic embedding space.
For the non-linear space
, we follow the strategy in [
32] to perform the
-way classification on
to learn the embedding
, where
K is the number of negative examples
, which refers to the samples whose class label is different from the class label of
, while the only one positive example is
. Concretely, the cross-entropy loss of this (
K + 1)-way classification problem is calculated as follows:
where
is a constant called the temperature parameter, which is manually set to adjust the degree of attention paid to negative samples. The smaller the temperature parameter is, the more attention is paid to separating this sample from other samples that are most similar.
2.2.2. Feature Generation Nets
The main disadvantage of embedding-based methods is that they suffer from the bias problem. This means that since the projection function is learned using only seen classes during training, it will be biased to predict with seen class labels as output; this bias problem is caused by a serious data imbalance between seen and unseen class data.
In supervised learning, the problem of data imbalance refers to the huge difference in the number of samples in each category of the dataset. Take the binary classification problem as an example: assuming that the number of samples of the positive class is much larger than that of negative class, in this case, the data are called unbalanced data. In zero-shot learning, this problem is even more extreme; that is, part of the class samples as unseen classes are completely missing and cannot participate in the model training process. Therefore, in supervised learning, the method of repeatedly sampling categories with fewer samples (over-sampling) or reducing sampling for categories with more samples (under-sampling) to achieve data balance is not applicable to zero-shot learning. After all, no samples can be collected from unseen classes. Therefore, unseen class data generation has become a hot research topic, which can generate pseudo-samples for unseen classes, so that both seen and unseen classes have training samples and transform unsupervised learning into supervised learning. Generative Adversarial Networks [
23] are particularly appealing as they allow generating realistic and sharp images conditioned, for instance, on object categories. Previous work on generation-based methods learn a generation network to produce the unseen sample. However, in previous work on generation-based methods, the synthesized instances are usually assumed to follow some distributions (usually Gaussian distribution) [
13], which also leads to a large deviation between the generated sample and the real sample, and it cannot truly represent the real data situation of the unseen class. The idea of stacking multilevel generation networks has been proven to be effective in improving the quality of generation quality, but it has not been used in the ZSL field. In this paper, two conditional GANs (
and
) are stacked to solve the problem of data imbalance from different aspects.
,
the GAN for generating visual feature: The network based on traditional GAN takes random noise as the prior information input, and the inherent randomness of the deep neural network makes the quality of the image generated by it unstable. To solve this problem, conditional GAN is proposed. By adding conditional information to the network model, it guides the network model to generate pseudo-samples matching the conditions. We extend the GAN to a conditional GAN by integrating the class embedding to both the generator
and the discriminator
. Given the training data of seen classes,
takes random Gaussian noise
and semantic embedding
as its inputs and outputs a CNN image feature
of class
y. Once the generator
learns to generate CNN features of seen class images, i.e.,
x, conditioned on the seen class embedding
, it can also generate
of any unseen class via its class embedding
. The objective function can be expressed as:
However, the adversarial nature of GANs makes them notoriously difficult to train, and the Jenson–Shannon divergence optimized by the original GAN leads to instability issues. To cure the unstable training issues of GANs, Wasserstein-GAN (WGAN) [
33] is proposed, which optimizes an efficient approximation of the Wasserstein distance [
25]. While WGAN attains better theoretical properties than the original GAN, it still suffers from vanishing and exploding gradient problems due to weight clipping to enforce the 1-Lipschitz constraint on the discriminator. So, we use the improved variant of WGAN, that is, WGAN-GP [
34], which can enforce the Lipschitz constraint through gradient penalty. We extend the original WGAN-GP to a conditional WGAN-GP by integrating the class embedding
to both the generator and the discriminator.
The loss is,
where
,
with
, and
is the penalty coefficient. In contrast to the traditional GAN, the discriminative network here eliminates the sigmoid layer and outputs a real value. Instead of optimizing the log-likelihood in Equation (
3), the first two terms in Equation (
4) approximate the Wasserstein distance, and the third term is the gradient penalty which enforces the gradient of
to have a unit norm along the straight line between pairs of real and generated points.
, the GAN for generating semantic embedding: The embedding-based method obtains labeled instances of unseen classes by mapping instances in feature space and attribute prototypes in semantic space into the same space. Feature space contains labeled training instances of seen classes, and semantic space contains attribute prototypes of seen and unseen classes. Both spaces are real number spaces in which instance and attribute prototypes are vectors, respectively. By projecting the instance vectors from these two spaces into a common space, we can obtain labeled instances of unseen classes and classify them in the mapping space. However, in the embedding-based method, for every unseen class , it has no labeled instance in the feature space; thus, its attribute prototype in semantic space is the only labeled instance belonging to the unseen class. That is, only one labeled instance is available for each unseen class. Therefore, since there are few label instances of unseen classes, the feature generation methods are proposed to solve the problem of data imbalance by generating visual features for unseen classes in feature space. However, in semantic space, labeled instances of the unseen class are still scarce. Especially in the GZSL setting, the mapping results are still biased toward the seen class. Therefore, appropriately adding semantic vectors of unseen classes in semantic space can alleviate the bias of mapping results.
Active learning is similar to zero-shot learning to some extent. Both of them are designed to reduce the dependence on large-scale labeling data and are targeted at scenarios where labeled data are rare or the “cost” of labeling is high. The difference is that zero-shot learning aims to realize knowledge transfer in the absence of labeled samples, while active learning aims to maximize model performance by actively selecting the most valuable samples for labeling. Therefore, some techniques in active learning can enlighten us. In active learning, Parvaneh et al. proposed the feature mixing method: compute the average visual representation
of the labeled samples per class and call it an anchor. The anchors for all classes form the anchor set
and serve as representatives of the labeled instances [
17]. Inspired by their method, we take the average visual feature as a representation of one class and generate the semantic embedding
of how this class might be mapped, as shown in
Figure 2. The generated semantic embedding
should have the following two characteristics. First, by generating the different semantic embeddings that may be mapped from the same class, we extend the original unique semantic discriptor of each category in the semantic space into a semantic discriptor a set
, where
is the real semantic discritptor of category
i provided by the dataset, while
to
are the synthetic pseudo-semantic-discriptors just like extending the unique evaluation criteria to establish a qualifying interval. Second, generated semantic embeddings should be representative and authentic, which are similar to the semantic embeddings of the real existence mapped by the visual feature, and they can truly simulate the possible mapping situation without deviating from reality. We formulate our assumption for the pseudo-semantic embedding generation method as follows:
where
n is the number of visual features instances the class
i contains, and
is the set of visual features contained in class
i.
We still select the condition WGAN by integrating the visual feature average
to both the generator and the discriminator. The loss is,
where
is, corresponding to the synthetic semantic embedding
, the real semantic embedding obtained by inputting the average visual features
of category
i into the mapping net.