1. Introduction
Deep metric learning aims to develop a similarity metric superior to traditional Euclidean distance by focusing on the derivation of high-dimensional data representations [
1]. Contemporary research underscores the criticality of designing batch samplers [
2,
3,
4] and employing online triplet mining [
5,
6,
7,
8], with hard negative mining schemes [
9,
10,
11] traditionally serving as the preferred choice, proving their crucial role in significantly refining the similarity metric in deep metric learning. For an understanding of key concepts related to hard example mining and Batch Samplers, see the
Section 3. These techniques have found utility in various tasks, including face recognition [
12,
13,
14,
15,
16], person re-identification (ReID) [
2] within Computer Vision, and Natural Language Processing tasks such as few-shot relation extraction [
17], few-shot text classification [
18], text clustering [
19], and sentence embedding [
20], thereby boosting performance. Although certain methods, including the batch sampler and online triplet mining, have indeed demonstrated admirable performance, two significant challenges persist that invite further scholarly exploration.
Challenge 1: Traditional batch samplers struggle to identify hard examples situated at cluster intersections. To the best of our knowledge, existing types of batch samplers include the standard PK sampler (PK) [
21] and the sampler associated with hard negative mining (HNM) schemes. The PK, while randomly selecting
P classes and
K examples from each class, experiences compromised efficiency in sampling hard examples due to the randomness of its approach. To circumvent this limitation, certain HNM schemes construct mini-batches for each epoch by mining hard examples from the entire training datasets, thereby achieving significant results [
2,
3,
22,
23]. These schemes primarily identify hard classes by creating Nearest Neighbor Graphs (NNGs) at instance-to-instance [
3] or instance-to-class [
22,
23] (e.g., state-of-the-art Graph Sampling (GS) [
2]) levels. Subsequently, examples (e.g., the examples in the red rectangular box as depicted in
Figure 1) are randomly chosen from these hard classes to compose the mini-batch [
2]. It is worth noting that the direct mining of hard examples, such as those enclosed by the rectangle in
Figure 1 situated at the intersections of clusters (i.e., groups of data objects belonging to the same class as represented by green circles in
Figure 1), is often overlooked in much of the existing research due to the prevalence of random sampling from hard classes. Importantly, these examples pose significant classification challenges for the model. Although they may seem analogous to the ‘support vectors’ in SVM [
24] due to their positions and influences on the model, they play a different role in our context because of their distinct definitions and usages. We refer to these hard examples as ‘support examples’ and their corresponding classes as ‘support classes’. The necessity of studying the construction of prototype-based NNGs arises, aiming to identify the support classes and select support examples located at the intersections of support class clusters. Details on prototype acquisition will be thoroughly discussed in the
Section 4. As depicted in
Table 1, we compared the PK, GS [
2], and our proposed Prototype-based Support Example Miner (pSEM), detailed subsequently. Since the PK and GS randomly sample hard examples, they are incapable of mining support examples. In contrast, our non-random pSEM focuses on intersections between clusters, directly mining support examples, as illustrated in
Figure 1.
Challenge 2: Online Triplet Mining overlooks the issue of excessively loose clusters caused by outliers serving as anchors. The Online Triplet Mining (OTM) scheme, proposed by Triplet Loss [
5], operates on triplets of examples (anchor, positive, negative) and generally aims to minimize the distance between the anchor and the positive example while maximizing the distance between the anchor and the negative example within the mini-batch [
5,
7,
8,
25]. Nonetheless, when the anchor is an outlier, it can lead to overly loose clusters, which are not preferable in deep learning due to their potential detriment to model classification. Specifically, the positive example distances itself from the class cluster as the distance between the outlier anchor and the positive example is minimized, as illustrated in
Figure 2(1). This scenario may consequently result in an excessively loose cluster to which the positive example belongs. This adverse outcome also arises when the distance between the outlier anchor and the negative example is maximized. Inspired by stochastic optimization strategies, our approach integrates the concept of historical gradients—a technique that uses previously computed gradients to curb oscillation and stabilize the learning process [
26]—to effectively address the challenges associated with outlier anchors and excessively loose clusters. To tackle these issues, we introduce the notion of ‘historical normal anchors’, which we define as the normal anchors derived from previous mini-batches during the training process. In an innovative twist, we merge these historical normal anchors with outlier anchors, utilizing the exponentially weighted average formula [
26] to create a corrected, normal anchor. Through this merging process, the outlier anchors are infused with the intrinsic structure characteristic of historical normal anchors—a structure embodying inherent patterns, similarities, and clusters within the data. This maneuver not only brings outlier anchors closer to both historical normal anchors and cluster centers but also facilitates better alignment with the underlying structure of the data. Ultimately, this method significantly enhances the representational quality of outlier anchors within the intrinsic structure, demonstrating its robustness and innovation in addressing the outlined challenges. As depicted in
Figure 2(2), the corrected outlier anchor pulls the positive example, alleviating the aforementioned adverse outcome.
To address the challenges encountered in the two stages of deep metric learning, we propose a Prototype-based Support Example Miner and Triplet Loss. Initially, the Prototype-based Support Example Miner (pSEM) identifies support classes by building the NNGs with the prototype and mines the support examples at the cluster intersections of support classes. Secondly, to alleviate the problem of excessively loose clusters caused by outlier anchors, we design the Prototype-based Triplet Loss (pTriplet Loss). pTriplet Loss can detect the outlier anchor and subsequently correct it by integrating the outlier anchor with the historical normal anchor. We showcase the effectiveness and generalizability of our schemes through experiments on Person Re-identification (ReID) and Few-shot Relation Extraction (FSRE), which are typical tasks in Computer Vision (CV) and Natural Language Processing (NLP), respectively.
2. Related Work
In the domain of deep metric learning, several studies focus on the design of batch samplers and OTM to learn a similarity metric function. These methods are specifically engineered to mine hard examples from the training set, thereby enabling the model to concentrate on learning these examples during the training process [
1].
Batch sampler designing. The standard methodologies include typical PK samplers and HNM schemes. The PK sampler fails to mine hard examples, as elaborated in the
Section 1. HNM schemes, on the other hand, stabilize the deep neural network after a few iterations or each epoch and then leverage the updated model to detect hard examples across the entire training set. This model constructs hard examples within each mini-batch for training at each step. Bootstrapping [
27] is a technique that employs the updated model every few iterations to find hard examples within the training set while training deep neural networks, although this dramatically decelerates the training progress. Recent work aims to expedite the training progress by adapting training strategies. SmartMining [
3], for instance, establishes approximate Nearest Neighbor Graphs (NNGs) grounded in the instance-to-instance distance at the outset of each training epoch. However, considering all examples makes constructing instance-to-instance level NNGs resource-consuming. Wang et al. [
22] suggests randomly sampling hard examples from hard classes determined by
K-means clustering. Nonetheless,
K-means can easily converge to a local optimal solution [
28], implying that the hard class obtained might not necessarily be the support class. Suh et al. [
23] procures hard classes from NNGs constructed based on instance-to-class distances, and then the hard example is obtained through random sampling from the hard class. As this operation is conducted at each iteration, it results in enormous computational costs. To mitigate these shortcomings, GS [
2] randomly samples a single example per class for NNG construction based on instance-to-instance distance. Despite its effectiveness, it is not without flaws. For instance, when a class encompasses tens of thousands of training examples, a randomly selected example may poorly represent the class. Moreover, the selected hard examples, randomly sampled from the hard classes, may not be situated at the intersection between clusters, as discussed in
Section 1. Hence, the hard examples obtained by such methods are the non-support examples. Distinct from the aforementioned methods, our proposed Prototype-based Support Example Miner (pSEM) builds the NNGs predicated on the prototype (i.e., class-to-class distance). It identifies the support classes across the entire training dataset and then selects examples at the intersection between clusters, thus effectively mining the support example.
Online Triplet Mining. Schroff et al. [
5] introduced OTM, which mines hard positives and hard negatives in each mini-batch based on the loss value to form a triplet (anchor, hard positive, hard negative). These triplets are then utilized to update the model. A substantial number of the remaining triplets do not participate in the model update process, significantly accelerating the model’s convergence rate. To address the ’local optimal solutions’ concern—which arises when the Triplet Loss method mines hard samples within a mini-batch, not the entire training set, potentially leading to the less accurate identification of truly hard samples or understating their hardness level—Ge [
7] proposed a solution. They suggested constructing a hierarchical class tree encompassing all classes in the training set, thereby expanding the mining scope for a more accurate hard sample identification. Cai et al. [
25] proposed the Hard Exemplar Reweighting Triplet Loss, which assigns weights to the triplets according to their difficulty level. The t-Distributed Stochastic Triplet Embedding (t-STE) [
29] algorithm adeptly refines data embeddings by enforcing similarity triplet constraints while concurrently amalgamating similar instances and diverging dissimilar ones, thereby unearthing the genuine structure of underlying data. Vasudeva et al. [
8] presented LoOp, which can be employed to alleviate the biased embedding caused by Triplet Loss. To the best of our knowledge, existing research has not addressed the problem of excessively loose clusters resulting from anchors being outliers. In response, we propose pTriplet Loss, which amalgamates outlier anchors and prototypes to generate a typical anchor near the cluster center, thereby alleviating this issue.
4. Methodology
Figure 3 illustrates the architecture of our proposed framework, comprising primarily the Prototype-based Support Example Miner (pSEM) and Triplet Loss components. In
Step 1, we utilize pSEM to mine support examples, which, in the context of pedestrian re-identification, refer to pedestrians with similar appearances. Initially, we derive a prototype for each class within the training set using the model updated at the end of the previous epoch. Subsequently, we identify support classes by constructing Nearest Neighbor Graphs (NNGs) based on these prototypes. Finally, we mine support examples from the intersections between clusters of support classes, which are then used to assemble the mini-batch. In
Step 2, we refine the anchor using pTriplet Loss. For the current epoch, the input mini-batch is processed through the backbone network to obtain the feature vectors of the examples. Our pTriplet Loss detects the outlier anchor and then amends it by fusing the outlier anchor with the historical normal anchor to generate a new, standard anchor. Ultimately, the pTriplet Loss integrates other loss functions (e.g., cross-entropy loss function) for back-propagation. Upon completion of the current epoch’s training, this process is repeated until model convergence.
Formally, given a training set consisting of C classes, we extract the support example set from . The pSEM identifies the support example set on the training set prior to the commencement of each epoch and utilizes these examples to construct the mini-batch for subsequent model training. The mini-batch encompasses M classes, with N examples from each class. Additionally, pTriplet Loss combines the outlier anchor and prototype to produce a new, standard anchor .
4.1. Prototype-Based Support Example Miner
The pSEM mines support examples at the intersections of clusters belonging to support classes on NNGs to construct mini-batches for model training.
Formulation of Nearest Neighbor Graphs (NNGs) for Support Class Determination. NNGs, which are prevalent graph structures, are formulated for all classes within the training set using their respective prototypes. Each prototype is depicted as a node, with edges connecting it to its ‘nearest neighbors’, which are ascertained via a distance metric such as cosine similarity. To elaborate, the cosine similarity between classes is computed utilizing the class-specific prototype, conceived as the mean of the feature vectors corresponding to each class’s instances.
where
f denotes the backbone,
i signifies the
i-th class, and
k represents the number of examples in the
i-th class. Moreover,
is the feature vector of the
j-th example in the
i-th class, and
represents the prototype of the
i-th class.
For any class in the training set, we identify its nearest-neighbor class by comparing the cosine distance between the prototype of this class and the prototypes of all other classes. Each class and its nearest neighbors, being the closest in the feature vector space and challenging for the model to distinguish, are considered support classes and grouped into a set of support classes.
where
represents the set of support classes consisting of the
i-th class and its neighbors,
l is the size of the
, and
denotes the
i-th support class. Thus, the NNGs are constructed as
In this case, refers to all classes in the training set, and refers to the set of edges. The represents the support classes of class .
Mining Support Examples. To identify the support example, we explore the intersection between clusters of support classes on
. Specifically, for any two support classes that are nearest neighbors, we compute the vector
of their midpoints based on the prototype
.
where
and
denote the prototypes of support class
a and support class
b, respectively.
We consider the sphere centered at
with radius
as the region encompassing the intersection between clusters of support classes. Subsequently, we calculate the cosine distance between
and each example within the support classes. Ultimately, the example with a distance less than the threshold
is selected as the support example.
where
represents the set of examples corresponding to the
i-th class and
represents the set of support examples. By adjusting the value of
, we can effectively select support examples of varying degrees of difficulty. Refer to the hyper-parameter sensitivity analysis in
Section 5.8.3 for further details.
4.2. Prototype-Based Triplet Loss
The aim of pTriplet Loss is to enhance the efficacy of Triplet Loss by identifying and rectifying outlier anchors.
Identification of Outlier Anchors. We compute the distance
d between each example in the cluster and the prototype, defining those examples with a cosine distance
d exceeding the threshold
as outlier anchors, while the remaining are treated as normal anchors.
In this case, and correspond to the outlier anchor set and the set of normal anchors, respectively.
Prototype Updating. In each step of the model learning process, we employ an exponentially weighted average formula, a weighted average strategy [
26]. This involves initializing an average value and a decay factor (
), and at each step, a new value is obtained and combined with the previous average using the decay factor. The exponentially weighted average formula assigns decreasing weights to the values over time, giving more importance to recent values. By applying this formula and considering the neural network’s improved updates, which lead to better and more accurate representations of the normal anchors obtained in the current step, we fuse the normal anchor and prototype within the mini-batch, resulting in the update of the prototype vector, capturing more precise representations.
where
f represents the backbone,
i refers to the
i-th class,
j signifies the
j-th example, and
is the feature vector of the normal anchors
. The term
is the prototype of the
i-th class after
t iterations, while
is the adjustment factor controlling the proportion of
in the updated prototype
. This prevents the proportion of prototype
from becoming too large, which could result in over-compaction of clusters and over-fitting. The prototype derived via pSEM prior to each epoch serves as the initial value of
for the current epoch.
Outlier Anchor Correction. By employing the exponentially weighted average formula, we generate a corrected normal anchor, denoted as
, through the weighted averaging of updated prototypes
and detected outlier anchors
, where the prototypes are derived from historical normal anchors. The correction process imbues the outlier anchors with the intrinsic structure that is characteristic of normal anchors—a structure that encompasses inherent patterns, similarities, and clusters within the data. Consequently, this process propels outlier anchors closer to both normal anchors and cluster centers, thereby enabling better alignment with the inherent structure of the data. As a result, this methodology markedly enhances the representation quality of outlier anchors within the intrinsic structure.
Formulation of pTriplet Loss. Given the above discussion, the Prototype-based Triplet Loss can be formalized as follows:
where
D represents the Euclidean distance and
and
are the features of positive and negative examples, respectively.
is either the corrected outlier anchor
or the normal anchor
, and
m represents a margin between positive and negative pairs. The Batch Size is
. pTriplet Loss seeks to minimize the distance between the anchor and the positive example and maximize the distance between the anchor and the negative example during training within the mini-batch.