1. Introduction
The rapid progress of deep learning in computer vision, natural-language processing, and various other fields is primarily attributed to advancements in hardware computing capabilities and the evolution of big-data technology. Presently, cloud-based services [
1,
2,
3], which harness high computing power and big data, represent the primary form of artificial intelligence. Nevertheless, in the era of the Internet of Everything, smart terminal applications, which must adapt to complex and dynamic real-world environments [
4,
5], often rely on limited power and a scarcity of training samples [
6]. Maintaining high performance in Few-Shot Learning scenarios is a crucial aspect of deep-learning technology on its path toward Artificial General Intelligence (AGI).
Classical Few-Shot Learning (FSL) models involve pretraining on a base set and subsequent evaluation on a novel set. Conventionally, these two sets are subsets derived from the same dataset, with non-overlapping categories [
7,
8,
9,
10,
11]. This results in similar image styles, minimal domain distances, and comparable classification difficulties. In practical scenarios, however, the data distribution of downstream tasks and the difficulty of classification can be highly unpredictable. As a result, models may encounter significant performance degradation when confronted with cross-domain data [
11,
12]. Studies [
11,
13] have highlighted that the performance of FSL methods [
7,
8,
14] is often inferior to traditional batch-trained models in cross-domain scenarios. Therefore, investigating Cross-Domain FSL (CD-FSL) not only enhances our understanding of the fundamentals of deep model generalization but also contributes to advancing the practical applications of Few-Shot Learning [
15].
Whether a model can generalize to target data and effectively solve target tasks is determined by the distance between the target domain and the source domain [
16,
17]. A smaller distance indicates a better representation of features for the target task, resulting in good classification performance. Conversely, a larger distance leads to poorer feature representation, making it challenging to directly apply the model to target tasks. This phenomenon arises because models are designed to maximize the discrimination between source categories. Consequently, these models are highly sensitive to the features of classes within the source domain [
18], which we term “source class features”. This sensitivity causes the feature vectors to exhibit significant magnitudes in source class-feature directions and minimal magnitudes in others when extracting target domain features. This results in a non-uniform feature space, leading to performance degradation in cross-domain scenarios.
Furthermore, the difficulty of target tasks plays a crucial role in determining the model’s performance [
16,
17]. If the categories in the target dataset are linearly separable, an optimal decision surface exists. The degree of linear separability in the target data establishes the upper limit for the model’s performance in FSL tasks. When the target data have poor linear separability, incorporating more training data is necessary to globally finetune the model to the target task.
Therefore, the success of FSL is primarily constrained by the domain distance from the source dataset and the difficulty of the target dataset. By measuring and partitioning target datasets based on these two dimensions and employing tailored strategies for different tasks in a divide-and-conquer manner, the model’s ability can be enhanced. In the study of domain distance metrics, Xu et al. [
12] introduced a definition based on images and their marginal distributions, laying the foundation for subsequent research. Zhang et al. [
19] proposed two measurement methods: one based on the support set and another based on the mean vector of the dataset. However, these methods lacked comprehensive information on data distribution. The metric proposed by Oh et al. [
17] utilized class labels that did not conform to the definitions put forward by Xu et al. [
12]. Therefore, current research lacks reliable quantitative metrics to perform a Divide-and-Conquer Strategy. In this paper, we introduce novel quantitative metrics to assess domain distance and task difficulty. Domain distance is evaluated by computing the difference in covariance matrices between target and source data. Task difficulty is determined by fitting the target data with a linear classifier and calculating the global linear separability among different categories.
Figure 1 illustrates the domain distance and difficulty of 15 target datasets on the Meta-Dataset [
20] and BSCD-FSL [
16] benchmarks, using the base set of mini-ImageNet [
7] as the source dataset. The horizontal axis represents domain distance, while the vertical axis indicates difficulty. The detailed methodology for these metrics is outlined in
Section 3.
Based on the distribution of target datasets depicted in
Figure 1, we categorize them into three groups and devise strategies, respectively:
Near-Domain Tasks: For tasks in the near domain, the target-data distribution closely resembles the source distribution. The class features learned from the source domain remain crucial in the target tasks. In such cases, the classical FSL algorithm can be directly applied, as it effectively leverages these shared features.
Far-Domain Low-Difficulty Tasks: In far-domain tasks, the class features learned from the source domain are significantly attenuated for the target tasks. It becomes essential to mine information embedded in non-class features. To address this challenge, we propose a whitened-PCA method aimed at reconstructing an isotropic feature space. This approach normalizes the magnitude of all principal components in the high-dimensional space, balancing the contribution of each component. This method enhances performance for all far-domain tasks without any additional cost, as it neither introduces new data nor requires additional parameters.
Far-Domain High-Difficulty Tasks: For tasks in the far domain with high difficulty, the model no longer achieves linear separability with respect to the target dataset. To adapt the model to these complex tasks, we employ global finetuning techniques. This approach allows the model to learn task-specific features and adapt its decision boundaries to better fit the target-data distribution.
The proposed Divide-and-Conquer Strategy (DCS) outlined in this paper is visualized in
Figure 2. This approach involves dividing various target tasks based on domain distance and task difficulty and formulating tailored feature optimization strategies to enhance the model’s performance in FSL. Our primary contributions are three-fold:
(1) We introduce a quantitative metric for assessing domain distance and task difficulty within target datasets. This metric facilitates the categorization of target tasks into three distinct groups: near-domain tasks, far-domain low-difficulty tasks, and far-domain high-difficulty tasks.
(2) We present the DCS framework, which is tailored to address the challenges associated with different target task categories. For near-domain tasks, we employ classical FSL algorithms. For far-domain tasks, we introduce a whitened-PCA method to enhance the feature representation of data from distant domains. Furthermore, for far-domain tasks with high difficulty, we leverage a limited amount of labeled data to globally finetune the model, thereby improving its alignment with the target tasks.
(3) Our experimental results, conducted on 15 target datasets, demonstrate the compatibility and efficacy of DCS when combined with classical FSL algorithms. The comprehensive experimental outcomes demonstrate the improved performance, offering a novel approach for future research in the field of few-shot learning.
The remainder of this paper is organized as follows.
Section 2 briefly reviews the related works on CD-FSL, feature post-processing, and domain metrics, highlighting our innovations and contributions.
Section 3 presents the problem definition and introduces methods for measuring domain distance and difficulty; it also delineates detailed learning strategies tailored for near-domain, far-domain low-difficulty, and far-domain high-difficulty tasks based on these metrics.
Section 4 conducts measurements of domain distance and difficulty across 15 target datasets, comprehensively evaluates the effectiveness of the DCS through experiments, and provides an analysis and discussion of the experimental results. Finally,
Section 5 draws the main conclusions.
2. Related Works
In this paper, we propose metrics for domain distance and task difficulty, enabling the categorization of target datasets from different domains. Based on the categorization results, we have designed a Divide-and-Conquer Strategy for Few-Shot Learning to address tasks with varying domain distances and difficulties. Additionally, for far-domain tasks, our proposed whitened PCA serves as a feature post-processing method, constructing an isotropic feature space without increasing training data or computational costs. The topics covered in this work include CD-FSL, feature post-processing, and domain metrics, which will be introduced separately below.
FSL & CD-FSL: Few-shot learning (FSL) aims to rapidly acquire the ability to recognize novel categories with minimal training samples, focusing on the recognition problem across different classes within the same domain. It primarily encompasses three approaches: optimization-based methods [
21,
22,
23], metric-based methods [
8,
9,
10,
24,
25,
26], and transfer learning-based methods [
11,
18,
27,
28]. Among these, optimization-based and metric-based methods introduce meta-training or episode-training to maintain a consistent paradigm between training and evaluation. Despite the remarkable progress and achievements made by FSL, it often encounters a decline in effectiveness when confronted with cross-domain tasks [
12]. This challenge has spurred the emergence of Cross-Domain FSL (CD-FSL) methodologies, encompassing three primary approaches: instance-guided approaches, parameter-based approaches, and feature post-processing approaches. Instance-guided approaches [
17,
29,
30,
31] aim to guide the model towards acquiring cross-domain generalization abilities by incorporating a diverse array of target domain samples. These methods typically necessitate extensive training with auxiliary unlabeled samples from the target domain. Parameter-based approaches [
13,
32,
33] focus on optimizing the model’s parameters to reduce the hypothesis space, thereby minimizing the number of training samples required. Feature post-processing approaches [
19,
34,
35,
36,
37,
38,
39] endeavor to derive a feature mapping function capable of transferring features from the source domain to the target domain. This facilitates rapid adaptation of features to the task, resulting in enhanced feature representation. The far-domain strategy proposed in this paper falls under the category of feature post-processing methods.
Feature post-processing approaches: Shallow features often possess stronger migration capabilities compared to deep features. As a result, the fusion of features extracted from various layers is typically advantageous in boosting the model’s generalization ability while preserving high-level semantic information. CHEF [
35] achieved this by unifying multiple abstraction layers of a neural network into a cohesive feature representation. Zou et al. [
38] took a different approach, learning distinctive information unique to each sample by merging intermediate layers of features. Du et al. [
34] designed a hierarchical prototype network, integrating information from each layer into the final prototype features. Furthermore, techniques that involve reweighting the feature vector output from the network can enhance feature representation in the target domain. MemREIN [
36], for instance, proposed a method that combines memorization, restitution, and instance regularization to mitigate cross-domain incompatibilities of features. Li et al. [
37] introduced a nonlinear subspace and hyperbolic tangent transformation technique to minimize task-irrelevant features while preserving migratory dissimilarity features. Song et al. [
18] designed a class-feature subspace-mapping method based on the centroid vectors of base classes, which effectively enhanced the model’s performance on near-domain data. CIM [
39] offers a straightforward feature transformation function, aiming to compress feature components with large amplitudes and expand those with smaller amplitudes, facilitating global adaptation to all target domains. The far-domain strategy presented in this paper aligns with the feature reweighting approach. However, our method distinguishes itself by utilizing Principal Component Analysis (PCA) in the source domain to derive weighting coefficients for different directions. This allows us to normalize the magnitude of each component, thereby uncovering cross-domain information implicitly embedded within the features.
Domain Metric: Guo et al. [
16] introduced perspective distortion, semantic contents, and color depth as criteria for distinguishing domain differences. These criteria, however, heavily rely on subjective researcher perceptions and lack objective, quantitative measurements. Zhang et al. [
19] proposed two methods for measuring domain distance: Wasserstein Distance for Measuring Domain Shift (WDMDS) and Maximum Mean Discrepancy for Measuring Domain Shift (MMDMDS). WDMDS is computationally intensive and limited to small sample sizes, making it unsuitable for evaluating entire datasets. MMDMDS assesses differences between the mean features of two domains, overlooking distributional information within each dataset. Oh et al. [
17] evaluated target datasets based on domain similarity and FSL difficulty. They employed the Earth mover distance to measure distributional disparities between source and target domain prototypes. This approach, however, fails to capture marginal distributions due to its reliance on class labels. Furthermore, its definition of difficulty is solely based on the target dataset, overlooking the importance of the source domain perspective.
3. Main Approaches
3.1. Problem Definition
Before delving into the specifics of the Divide-and-Conquer Strategy, it is essential to first establish the task construction and symbolic representation of CD-FSL. Following the conventions in FSL literature [
8,
9,
10,
18,
24,
39,
40], the model undergoes pretraining on a large-scale source dataset
to accumulate prior knowledge. Once this pretraining is complete, the model’s performance is evaluated through a series of FSL tasks sampled from the target dataset
.
The FSL task construction involves randomly selecting N classes from , with each class containing K labeled training samples and q test samples. This forms a support set and a query set , which constitutes an N-way K-shot task. It is important to note that the source dataset and the target dataset have non-overlapping classes and exhibit distinct data distributions.
3.2. Metric Approaches
The challenge in CD-FSL stems from the diversity of the target-data distribution, which involves varying degrees of domain distances and task difficulties. Historically, researchers’ assessments of domain distance and task difficulty have been largely influenced by personal subjective perceptions of the data, lacking objective and quantitative metrics [
19]. Most FSL methods demonstrate effectiveness primarily on near-domain or partially far-domain data. Consequently, there is a pressing need to develop a metric that can objectively quantify the domain distance and task difficulty of the target dataset relative to the source data. Such a metric would enable a more precise understanding of the distributional differences between the source and target data, thereby informing the design of more effective CD-FSL strategies.
According to the definition provided by Xu et al. [
12], the domain of a dataset is solely determined by the image samples and their marginal distributions, independent of the class labels. Conversely, task difficulty is dictated by the class labels and their conditional distributions, which correspond to the degree of linear separability of the samples in the feature space. This definition effectively decouples the target dataset into two dimensions: domain distance and task difficulty.
3.2.1. Domain Distance
The discrepancy in data distribution between the target and source domains is a crucial factor that can significantly impact model performance, commonly referred to as domain distance. A smaller domain distance indicates a higher similarity in data distributions, indicating the relevance of source class features in the target domain. Conversely, a larger domain distance signifies greater variability in data distributions, diminishing the utility of source class features and potentially causing them to become interference. Quantifying this difference in data distribution is instrumental in assessing the significance of source class features in the target domain, facilitating feature optimization.
The objective of the domain distance metric is to capture the direction and shape of the distributions of target datasets, thereby facilitating feature transformation within the Divide-and-Conquer Strategy. This PCA-based transformation is highly sensitive to the directions of data distributions in the feature space. Identifying the disparities in distribution directions between the target and source datasets is crucial for selecting an appropriate feature mapping strategy. Since the covariance matrix serves as an effective representation of the direction and shape of data distributions, we define domain distance by calculating the difference between the covariance matrices of the data from the two domains, which is computed as follows:
where
and
are the covariance matrices of
and
.
d is the dimension of the feature space.
is the Frobenius paradigm of the matrix. The numerator characterizes the difference between two distributions. The denominator, with
as a scaling factor, achieves normalization across spatial dimensions. In this formula, only the distribution of image features is considered, regardless of their class labels.
It is important to highlight that the proposed metric reflects the perspective of the source domain for a given model. It cannot directly capture the distance between any two arbitrary target domains. For instance, even if two separate target domains, and , exhibit the same distance to , it does not imply that and are proximate to each other; in fact, their actual distance can be significant. Furthermore, variations in network architectures can lead to disparities in generalization abilities, ultimately resulting in diverse metric outcomes.
3.2.2. Difficulty
Different datasets often have varying classification criteria and granularity, leading to diverse challenges for models when applied to the target domains. In few-shot scenarios, finetuning the model can often result in significant overfitting. Hence, it is essential for the model to possess a fundamental capability for linear separability regarding target tasks. The level of linear separability exhibited by the model on the target dataset dictates the upper limit of the few-shot classification ability. A lower degree of linear separability corresponds to increased difficulty. To quantify the difficulty of
, we append a learnable linear layer to the end of the parameter-frozen feature extractor and use it to fit the entire
via gradient descent. Subsequently, we perform inference on all samples in
to obtain the classification accuracy
. The task difficulty of
is then defined by the following formula:
where
is the operator to calculate the classification accuracy.
denotes the number of classes in
. The denominator represents the total information content determined by the number of classes, while the numerator signifies the loss of information content resulting from misclassification. The ratio between these two quantities reflects the proportion of information loss incurred by the model in recognizing target data. Evidently, a higher ratio of information loss indicates greater difficulty.
The difficulty of a target task is influenced by the model’s capability for linear division of target data. In cases of low-difficulty tasks, linear optimization techniques can often refine the feature representation and bolster the efficacy of few-shot classification. Conversely, for high-difficulty tasks, linear methods fall short in addressing the problem fundamentally, which is attributed to the model’s inadequate representation of target data. Global finetuning serves to augment the model’s nonlinear representation abilities and enhance its suitability for target tasks.
The backbone and source data play a crucial role in determining the
metric values (as detailed in
Section 4.4). To ensure meaningful results, it is crucial to use the same source data and backbone for training, metric, and few-shot classification when implementing our DCS. Typically, the initial step involves selecting appropriate source data and pretraining the network on it. Following this,
is evaluated in terms of domain distance and task difficulty using the pretrained model. Finally, the strategy outlined below is implemented based on the results obtained from the metrics.
3.3. Near-Domain Strategy
The distribution of near-domain data closely resembles that of the source data, with source class features exerting a positive influence on the target data. Therefore, traditional FSL algorithms, such as ProtoNet [
8], DeepBDC [
9], and FRN [
10], inherently serve as excellent models for implementing the near-domain strategy. However, these methods either employ episodic training or introduce multiple iterations of self-distillation, rendering the training process complex and time-consuming. In this paper, we introduce a simple class-feature subspace (CFS-space) [
18] mapping technique for batch-trained models (e.g., RFS), serving as an efficient implementation of the near-domain strategy.
Models that are pretrained on the source dataset exhibit an inherent sensitivity towards source class features. This sensitivity manifests in the directional nature of the source data distribution within the feature space. When Principal Component Analysis (PCA) is applied to this distribution, a notable trend emerges: the variance ratio decreases at a rapid pace as the principal component order increases. For instance, within the base set of mini-Imagenet, the first 61 principal components account for 95% of the variance ratio, which closely corresponds to the number of base classes. This observation implies that the source data are primarily distributed within a low-dimensional subspace. By mapping near-domain data into this subspace, it becomes possible to eliminate non-class principal components.
Suppose the feature matrix of the source features is denoted as
, and that of the target features is denoted as
. Here,
m and
n represent the number of feature vectors in the corresponding domains, respectively, while
d denotes the dimension of the feature space, which satisfies the condition
. The Principal Component Analysis (PCA) of the source data can be computed using the following formulas:
where
is the average vector of
.
is the operation of singular decomposition.
,
and
denote the singular decomposition of
. The diagonal elements of
represent the singular values of
, which correspond to the standard deviations of the principal components. Subsequently, we select the initial
k rows of
U, denoted as
, and employ the following formulas to execute the feature mapping of both the source and target features:
where
and
represent the projections of the source and target samples in the CFS-space, respectively.
and
denote the number of source and target samples, respectively.
For the mapped features
in multi-shot tasks, we adopt a simple Logistic Regression (LR) classifier to adapt to the support set and directly apply it for predicting the query samples. However, for 1-shot tasks, the representativeness of features can be further enhanced by incorporating appearance-similar samples from the source dataset and performing feature fusion on the prototypes [
18,
27]. This is attributed to the fact that a single training sample is prone to be interfered with by intra-class variations and has a high likelihood of deviating from the true class center. Mapping features to CFS-space not only mitigates randomness in the direction of non-class features but also, in conjunction with feature fusion, facilitates feature approximation toward the class center. Specifically, we search for appearance-similar samples in the source domain in order to calibrate the prototypes. To this end, for the
n-th class in the given task, we search for the
p nearest neighbors corresponding to the prototype
within
. Subsequently, we merge these features using the formula below:
where
is the prototype of class
n.
is one of the nearest neighbors of
.
is the calibrated prototype, which facilitates the model to fit and acquire a more optimal decision boundary.
Our near-domain strategy can be directly applied to pretrained models downloaded from the community, facilitating researchers and engineers to implement few-shot classification with zero training.
3.4. Far-Domain Strategy
The sensitivity of the model to source domain class features is primarily characterized by the principal components of the source data. Generally, the first k components account for a significant portion of the variance ratio. As the component order increases, the variance ratio decreases rapidly. The fact that the first k principal components effectively represent the class attribution of the source data leads the model to exhibit a preference toward certain feature directions. When extracting features, the model automatically amplifies the magnitude in these preferred directions while compressing it in others, thereby increasing the inter-class variance of the source data. However, for far-domain datasets, the data distributions differ significantly. By treating directions differently, the model not only fails to extract valuable information but also introduces source domain bias into target tasks. Therefore, optimizing the representation of target data necessitates mitigating the impact of source domain bias.
In
Section 3.3, Equations (
3) and (
4) present the method for performing PCA using samples from the source domain. For far-domain tasks, to eliminate the feature bias of the source domain, the principal components are normalized by dividing them by their corresponding standard deviations, thereby ensuring that the feature space becomes isotropic. To achieve this, we truncate the first
d rows of
to construct a diagonal square matrix
. Subsequently, we apply the following transformation to the target domain matrix:
This approach referred to as whitened PCA, effectively reduces the model’s sensitivity to source class features. As a result, the previously suppressed principal components are amplified back, leading to an enhanced representation of far-domain data.
In the computation of whitened PCA, the most time-consuming step is Equation (
4), which has a computational complexity of
.
represents a diagonal matrix. Thus, the complexity of its inversion operation is negligible. Equation (
8) involves two matrix multiplications, with computational complexities of
and
, respectively. Given that
(i.e., the number of samples significantly exceeds the dimensionality), the overall computational complexity of whitened PCA is primarily determined by
.
3.5. Finetuning Strategy for Difficult Tasks
Certain target datasets, such as ChestX [
41], present significant challenges. The poor linear separability on this dataset greatly limits the few-shot classification performance of the model. This can be attributed to two primary factors: (1) the lack of distinctive features in the source data that adequately represent the target classes; (2) the absence of mutual information between the source and target classes [
42], leading the model to disregard these features entirely. To address this, finetuning the model with a limited amount of training data can enhance its nonlinear representation capabilities. Although there is a risk of overfitting, this is not a primary concern due to the inherently low classification accuracies of the model on such challenging tasks. In fact, finetuning is often beneficial when dealing with difficult tasks [
43].
In summary, this section comprehensively elucidates the methodology and details of the Divide-and-Conquer Strategy.
Figure 3 presents a flowchart of this strategy, serving as a guide for implementation.
4. Experiment
4.1. Implementation Details
Network: ResNet-12 is the most widely utilized network in the field of Few-Shot Learning. To achieve compatibility with a diverse range of models, we have selected this network as the backbone. In our experiments, we removed the classification header from ResNet-12 and retained only the feature extraction portion. The input size was consistently scaled to 84 × 84 × 3, resulting in a final 640-dimensional feature vector.
Training: To provide a concise and consistent comparison of the effects of DCS, we re-implemented five models: ProtoNet [
8], Baseline [
11], RFS [
28], DeepBDC [
9], and FRN [
10]. They are classical high-performance FSL models that are frequently selected for comparison in the literature. Among them, ProtoNet and FRN are representative metric-based approaches, whereas Baseline, RFS, and DeepBDC are representative methods grounded in transfer learning. We utilized the repository proposed by Xie et al. [
9] as the training source code. The Baseline, RFS, and DeepBDC were based on the standard batch training paradigm, employing the Cross-Entropy (CE) loss. An SGD optimizer with a momentum of 0.9 and a weight decay of 0.0001 was used. The initial learning rate was set to 0.05 and decayed by a factor of 10 at epochs 100 and 150, respectively. A total of 170 epochs were trained. Additionally, both RFS and DeepBDC performed three iterations of self-distillation after pretraining [
9,
28]. For ProtoNet and FRN, we utilized the meta-training paradigm with the same training settings as mentioned above. The base set of mini-ImageNet was used as the pretraining set, and standard data augmentations such as random cropping, color jitter, and random horizontal flipping were applied.
Evaluation: 5-way 1-shot, 5-shot and 20-shot were chosen as representative tasks. The top-1 classification accuracy was calculated over 600 rounds of randomized tasks.
4.2. Datasets
Meta-Dataset [
20] and BSCD-FSL [
16] are commonly used benchmarks in CD-FSL research. Meta-Dataset encompasses data from 10 diverse domains, including ImageNet [
44], Omniglot [
45], Aircraft [
46], CUB [
47], Textures [
48], Quick Draw [
49], Fungi [
50], VGG Flower [
51], Traffic Signs [
52], and MSCOCO [
53]. BSCD-FSL, on the other hand, spans five datasets: ImageNet [
44], Crop Disease [
54], ISIC [
55], EuroSAT [
56], and ChestX [
41].
To facilitate a comparison with other FSL algorithms, we substituted ImageNet with mini-ImageNet. Specifically, we utilized the base set (miniIN-B) as the source dataset and employed both the validation set (miniIN-V) and the novel set (miniIN-N) as the target dataset. This approach allows us to assess both classical FSL and CD-FSL performance in a unified framework.
A brief introduction to these datasets is provided below, with representative images from each dataset extracted and presented in
Figure 4.
mini-ImageNet [
7] is a subset of ILSVRC-12, comprising a total of 100 classes with 600 labeled samples per class and an image size of 84 × 84 × 3. As per the division suggested by Ravi S et al., the dataset is segmented into 64 base classes (miniIN-B), 16 validation classes (miniIN-V), and 20 novel classes (miniIN-N). For model pretraining, miniIN-B serves as the source dataset, while miniIN-V and miniIN-N are employed as near-domain target datasets to assess FSL performance.
Omniglot [
45] is a dataset that comprises handwritten characters from 50 distinct alphabets, totaling 1623 unique characters. Each character is represented by 20 handwriting samples.
Aircraft [
46] is a dataset that includes a diverse collection of aircraft images. It comprises 102 distinct classes or variants, with each class containing 100 samples. For this paper, the original images were utilized for feature extraction without any cropping of the target region based on the bounding box information provided in the dataset.
CUB [
47] is a fine-grained dataset specializing in bird categories. It encompasses 200 distinct bird species and includes a total of 11,788 image samples. In our study, we utilized the original images directly for feature extraction, without resorting to cropping the target region based on the provided bounding box information.
Textures [
48] is a dataset that comprises various texture patterns. It includes 47 distinct categories and a total of 5640 samples.
Quick Draw [
49] is a dataset that consists of black-and-white sketches created by players of the “Quick, Draw!” game. This dataset encompasses 345 different categories and boasts a total of 50 million samples.
Fungi [
50] is a comprehensive dataset that comprises a wide array of mushroom images. It includes 1394 distinct mushroom categories and boasts a total of 89,761 image samples, making it a rich resource for fungal research and classification.
VGG Flower [
51] comprises 8189 images of flowers distributed across 102 classes.
Traffic Signs [
52] is a dataset that consists of 50,000 samples of German road signs, which are divided into 43 distinct categories.
MSCOCO (Microsoft Common Objects in Context) [
53] is a large-scale dataset that comprises 1.5 million target instances extracted from Flickr. It includes 80 diverse categories, with each instance labeled by a bounding box. In this paper, we selected the Val2017 subset as our target dataset. Notably, we extracted features directly from the original images without cropping the target region based on the provided bounding box information.
Crop Disease [
54] is a dataset that comprises leaf images of diseased plants. It includes 38 distinct categories of crop diseases and boasts a total of 108,610 samples, making it a valuable resource for research in plant pathology and crop health management.
EuroSAT [
56] is a dataset that comprises remote sensing images captured by satellites. It includes 10 distinct categories and boasts a total of 27,000 image samples, making it a valuable resource for various applications in the field of remote sensing and Earth observation.
ISIC (International Skin Imaging Collaboration) [
55] is a large-scale dermatologic image classification dataset published by the eponymous collaboration. It comprises 33,126 images of both benign and malignant skin lesions sourced from 2056 patients. These images are categorized into seven dermatologic categories, providing a comprehensive resource for research and development in the field of dermatology.
ChestX [
41] is a medical dataset that comprises human lung X-ray images. It is a multi-category dataset that includes seven distinct disease categories. For Few-Shot Learning, we have collected the single-category images from this dataset, which constitute the target dataset for our study.
Among the above datasets, mini-ImageNet stands out as the most extensively employed benchmark in conventional FSL research, encompassing a diverse range of common categories. Utilizing mini-B as the base set enables the model to acquire rich prior knowledge. The remaining datasets broadly cover specialized domains such as characters, textures, stick figures, remote sensing, and medicine, encompassing both coarse-grained and fine-grained classification images. The image types involve a variety, including high-resolution and low-resolution images, grayscale images, and color images. This diversity in target domains and richness in data types allow for an effective evaluation of cross-domain strategies under various conditions.
4.3. Metric Results
Using the domain distance and difficulty formula provided in
Section 3.2, we extracted features for the source dataset and the 15 target datasets using the RFS model.
Table 1 shows the statistical results of the domain distance and difficulty. As expected, miniIN-B has a domain distance of zero from itself and exhibits the lowest difficulty level. Both miniIN-V and miniIN-N, which are derived from the same dataset, have relatively small domain distances from
. The domain distances for the other datasets gradually increase as the discrepancy in data distribution widens.
Based on the metric results, the target datasets are arranged in ascending order of domain distance as follows: miniIN-N < miniIN-V < MSCOCO < Textures <Fungi < CUB < VGG Flower < Crop Disease < Traffic Signs < EuroSAT < Aircraft < Omniglot < ISIC < QuickDraw < ChestX. We use the midpoint of the distance as the dividing line between near-domain and far-domain datasets, and the calculation formula is as follows:
where
represents the least domain distance to miniIN-B, and
represents the largest domain distance to miniIN-B. This calculation only involves target datasets except for miniIN-B. Using 0.7775 as the dividing line, datasets with distances less than 0.7775 are categorized as near-domain datasets, while those with distances greater than 0.7775 are considered to be far-domain datasets. Consequently, for tasks involving miniIN-N, miniIN-V, and MSCOCO, the near-domain strategy outlined in
Section 3.3 is applied. For all other tasks, the far-domain strategy detailed in
Section 3.4 is employed.
The target datasets are ranked in ascending order of difficulty as follows: Crop Disease < VGG Flower < EuroSAT < Traffic Signs < miniIN-V < miniIN-N < Omniglot < Textures < Aircraft < ISIC < CUB < QuickDraw < MSCOCO < Fungi < ChestX. We employ the midpoint of the distance as the dividing line between low-difficulty and high-difficulty datasets, which is calculated as follows:
where
is the minimum value of difficulty,
is the maximum value of difficulty. This calculation involves datasets except for miniIN-B. Using 0.3179 as the dividing line, datasets with difficulty less than 0.3179 are categorized as low-difficulty datasets, while those with values greater than 0.3179 are considered to be high-difficulty datasets. Notably, only ChestX stands out as a far-domain dataset with high difficulty. To enhance the model’s adaptability to the target tasks, we will finetune it using the strategy outlined in
Section 3.5.
Figure 1 illustrates the distribution of the target datasets, with domain distance plotted on the horizontal axis and difficulty on the vertical axis. These findings are based on miniIN-B with the RFS model. It is important to note that altering the source dataset or model would yield different outcomes. The distribution results obtained using various source datasets and networks can be found in
Section 4.4.
4.4. Influence of Source Data and Backbone
The metrics for domain distance and difficulty, outlined in
Section 3.2, rely on the source data and the network. The source data provide an empirical perspective of view, while the network serves as a metric tool. These metrics are interpreted from the viewpoint of the pretrained model, implying that variations in the source data or modifications to the network structure would yield distinct metric results.
To validate the above perspective, we altered the neural network to Conv-4 and ResNet-50 while keeping the source dataset as miniIN-B constant. Following the pretraining procedure outlined in
Section 4.1, we re-assessed the domain distance and difficulty for the 15 target datasets. The results are presented in
Figure 5 and
Figure 6. Notably, due to disparities in the dimensions of feature vectors extracted by the two networks, there are slight variations in the metric values. The relative positions of most target datasets remain largely unchanged, reflecting the inherent relationship of the target domains relative to the source domain. However, a few datasets exhibited shifts in their positions, indicating that networks with distinct architectures possess varying degrees of generalization capabilities.
If the source dataset is substituted with CUB while maintaining the neural network as ResNet-12, the resulting distribution of domain distance and difficulty for the target datasets is illustrated in
Figure 7. It is evident that with CUB serving as the empirical origin, the distribution of the target datasets undergoes significant alterations. Datasets that were originally considered far-domain relative to miniIN-B, such as Fungi and Aircraft, transition to being classified as near-domain. Conversely, VGG Flower, which was moderately distant from miniIN-B, now exhibits the greatest distance from CUB. Additionally, the level of difficulty associated with the target datasets undergoes some degree of change.
In conclusion, both the model and the source data play a pivotal role in determining the metric results. When implementing DCS, it is imperative to ensure consistency in the model used for domain metrics and few-shot classification. Generally speaking, the source data should be carefully chosen, and model pretraining should be carried out using this dataset. Subsequently, the domain distance and difficulty should be measured for the target datasets using this pretrained model. Finally, DCS can be effectively applied based on these metric results.
4.5. Effectiveness of DCS
To assess the efficacy of DCS, we evaluated the performance of both the near-domain and far-domain strategies on the target datasets using RFS. The results are summarized in
Table 2. The black values represent classification accuracies, while the values to their right indicate changes relative to Simple LR, with red signifying an increase and green a decrease. The datasets are organized in order of increasing domain distance.
The experimental results demonstrate that the near-domain strategy enhances performance on near-domain tasks but detracts from performance on far-domain tasks. Conversely, the far-domain strategy shows significant improvement on far-domain tasks but falls short on near-domain tasks. The complementary nature of these two strategies across different domains underscores the varying requirements for feature representation in different task domains. Specifically, source class features are instrumental in aiding near-domain data to effectively convey class information, whereas whitened-PCA features are more apt for representing far-domain data.
Although the far-domain strategy consistently enhances the model’s performance on far-domain tasks, the classification accuracy remains significantly low for high-difficulty tasks, namely ChestX. This dataset encompasses X-ray medical images of human lungs depicting a variety of lung diseases, which exhibit substantial disparities compared to the distribution of conventional images. For non-medical experts, discerning differences among these images is typically arduous, posing significant classification challenges. Models trained on miniIN-B exhibit exceedingly low linear separability when applied to ChestX. As evident from the results in
Table 1, even when utilizing the entire ChestX dataset, the global classification accuracy (across 7 classes) is merely 30.1%. Consequently, to achieve further improvements on ChestX, it is imperative to finetune the model in order to enhance its nonlinear representation capabilities.
For 5-way K-shot tasks on ChestX, we experimentally conducted research on global finetuning to enhance the model’s representation capabilities.
K was set from 1 to 200 to validate the trend of our method as
K changes. For comparison, with the parameters of the backbone frozen, we evaluated the few-shot classification accuracy of RFS [
28], Baseline [
11], ProtoNet [
8], FRN [
10], and DeepBDC [
9] in the same way. The experimental results are shown in
Figure 8. It can be seen that global finetuning can significantly improve the model’s ability to adapt to difficult tasks, making its accuracies on ChestX surpass the second-best model (DeepBDC) by more than 40%. Furthermore, as
K increases, the advantages of global finetuning become significant. However, other models show obvious performance saturation as
K increases. Evidently, for far-domain high-difficulty FSL tasks, global finetuning consistently leads to improved performance without concerns of overfitting due to the model’s inadequate nonlinear representation.
4.6. Impact of Different Principal Components
The model, due to its sensitivity towards source class features, tends to prioritize the extraction of the source domain principal components while suppressing non-principal components, disregarding the target-data distribution. This trait poses no significant impediment when dealing with near-domain data. However, in the context of far-domain data, the principal components that are crucial for representing the target classes are assigned minimal significance, leading to a notable lack of saliency. By applying PCA on the source domain and executing feature mapping on the target domain, we can establish a correlation between classification accuracy and the principal components number
n.
Figure 9 illustrates the experimental findings. The solid lines represent results obtained from the far-domain datasets, and the dashed lines correspond to the near-domain datasets.
Figure 9a presents the outcomes of conventional PCA. Classification accuracy improves with an increase in
n until it reaches a saturation point at approximately
. This saturation occurs because as
n grows larger, the variance ratio of the component diminishes, leading to a reduced impact on classification accuracy. When all principal components are normalized, as outlined in
Section 3.4, the resulting whitened-PCA outcomes are depicted in
Figure 9b. Notably, the near-domain performance, denoted by dashed lines, initially rises and then declines as
n increases. Conversely, the far-domain dataset demonstrates consistent improvement with n and does not exhibit saturation as observed in
Figure 9a.
These results illustrate the varying significance of principal components for different target domains. The foremost components predominantly represent source class features and hold a dominant influence on the classification of near-domain data. Conversely, the subsequent components act as interference for near-domain data but serve valuable information for far-domain data, aiding in decision-making.
4.7. Comprehensive Experiment
To thoroughly assess the compatibility of DCS with conventional FSL algorithms, we integrate DCS into RFS [
28], DeepBDC [
9], and FRN [
10] and compare them with both classical FSL methods [
8,
9,
10,
11,
28] and CD-FSL algorithms [
39]. The results for 5-way 1-shot, 5-shot, and 20-shot tasks are listed in
Table 3,
Table 4 and
Table 5. Due to the limited number of samples per class in Aircraft and Omniglot, these two datasets are excluded from
Table 5. Methods appended with the suffix ++ signify the results achieved by integrating DCS into classical FSL algorithms. Specifically, for RFS, we employed CFS-space mapping and feature fusion as the near-domain strategy and whitened PCA as the far-domain strategy, resulting in improved RFS++. As for DeepBDC and FRN, which are already excellent near-domain algorithms, we maintained the original algorithms for near-domain tasks while applying whitened PCA for far-domain tasks, yielding DeepBDC++ and FRN++.
The experimental results demonstrate the compatibility of DCS with classical FSL models. Compared to RFS, RFS++ exhibits improved performance in both near-domain and far-domain tasks. DeepBDC++ and FRN++, while maintaining the advantages of the original algorithms in near-domain tasks, significantly enhance their performance in far-domain tasks. Compared to ProtoNet, Baseline, and CIM, our strategy demonstrates superior performance across all datasets. Furthermore, although RFS++ slightly lags behind DeepBDC++ and FRN++ in terms of classification accuracy, it adopts the batch training paradigm, resulting in higher training efficiency. DeepBDC++ and FRN++ are comparable in performance in both near-domain tasks and far-domain tasks. Despite minor differences in the results of the three models, they all demonstrate broad adaptability to various domains overall.
In addition, it seems that the generalization ability of FSL models is positively correlated with the number of near-domain datasets it has. As evident in
Table 3,
Table 4 and
Table 5, RFS has 3 near-domain datasets, DeepBDC has 5~6, and FRN has 8. And in most cases, FRN exhibits stronger cross-domain generalization capabilities.
4.8. Discussion
Rationality of Results: Guo et al. [
16] provided a ranking of domain distances for the four datasets in the BD-CDFSL benchmark based on three criteria: perspective distortion, semantic content, and color depth. The order of proximity they obtained is: Crop Disease < EuroSAT < ISIC < ChestX, which aligns with subjective human perception (as shown in
Figure 4). In
Section 4.3, we present experimental results that are consistent with the BD-CDFSL benchmark, highlighting the superiority of our proposed metric. There are also some studies on metric domain distances. Oh et al. [
17] reported metric results indicating that the domain distance of Crop Disease is greater than that of EuroSAT. Zhang et al. [
19] proposed the WDMDS and MMDMDS metrics, which suggest that the domain distance of Crop Disease is greater than both EuroSAT and ISIC. These results are contradictory to human empirical cognition.
Practicability: To compute the domain distance
and
using Equations (
1) and (
2), it is theoretically necessary to employ all images and labels from
. However, this requirement often poses practical challenges, thereby somewhat limiting the direct application of our method in real-world scenarios. Nevertheless, as datasets continue to proliferate, the community can leverage Equations (
1) and (
2) to accumulate metrics and intuitive insights regarding domain distance and difficulty over time. With these accumulated insights, researchers can estimate the approximate range of the target domain based on support samples, thereby facilitating the selection of an appropriate Few-Shot Learning strategy. Furthermore, the difficulty metric can be both dataset-dependent and task-dependent. The difficulty of a specific few-shot task provides a more refined characterization. However, directly applying Equation (
2) to the support set is infeasible since the results would be severely distorted due to overfitting. If we predict the distribution of the target data and sample a sufficient number of high-quality virtual samples from this distribution, it would become possible to utilize Equation (
2) to quantify the difficulty of a specific task. This approach may constitute one feasible path for assessing task-dependent difficulty in the future.
Strategy Switching: The experimental findings reveal that various domain tasks exhibit complementary preferences for features. Specifically, near-domain tasks tend to utilize source class features, whereas far-domain tasks show a stronger preference for whitened-PCA features. Conceivably, there should be intermediate tasks that fall between the near and far domains in terms of distance. For such tasks, a hard-switching strategy like DCS might not be optimized. We aim to further explore a soft switching strategy in the future. This could involve determining an optimal feature transformation for any given domain task based on continuous statistical measures.