1. Introduction
With the rapid advancement of artificial intelligence (AI) technology, deep learning techniques are continuously evolving and improving. Deep learning, a crucial component of AI, finds widespread application across various domains, including image recognition [
1,
2], video retrieval [
3,
4], and object detection [
5]. Developing a robust deep learning model typically necessitates a substantial volume of data to attain optimal performance. Insufficient data may lead to overfitting, compromising the model’s generalization capability. To address this challenge and enable models to learn from limited examples, researchers have explored leveraging prior knowledge acquired from previous tasks and transferring this knowledge to new tasks. This concept has inspired the development of FSL [
6], which aims to facilitate rapid learning with minimal data.
In recent years, numerous meta-learning methods have been proposed to address few-shot problems. Meta-learning, also referred to as “learn-to-learn”, seeks to facilitate rapid learning in new tasks for models. Among the various meta-learning algorithms, metric-based methods [
7,
8,
9,
10], initialization-based methods [
11,
12,
13], and transfer learning methods [
14,
15,
16,
17] are widely utilized. Metric-based methods focus on acquiring a well-defined embedding space and learning feature representations within this space through complex algorithms. Initialization-based methods seek to initialize the meta-learning parameters for other few-shot tasks. Transfer learning methods aim to obtain a pre-trained model that can be better generalized to new tasks through fine-tuning or no fine-tuning. Chen et al. [
14] argued that initializing model parameters performs slightly worse than meta-learning algorithms, and they pre-trained the model with base class data and extracted novel class features by fine-tuning the pre-trained model. Tian et al. [
15] contended that learning a good feature embedding model is more efficient than complex meta-learning algorithms, and they achieved the few-example learning of novel classes by training a deep learning model on the base class data without fine-tuning, using the trained model as a base learner.
However, pre-trained models lack the ability to actively focus on the key information within the feature maps. To address this limitation, it has been proposed to incorporate attention mechanisms into deep learning models [
18,
19,
20,
21]. This enhancement enables models to selectively emphasize important information. Chen et al. [
18] proposed a mutual correlation network (MCNet) to explore the global correlation between feature maps using the self-attention module, which improves the FSL performance through the powerful global information capturing ability of the self-attention module. Zhao et al. [
19] proposed FSL-PRS based on prototype modification using a self-attention module to extract task-relevant features from a pre-trained network. Liu et al. [
20] leveraged an SE module to enhance feature extraction through channel correlations. Zhu et al. [
21] utilized a CBAM module to capture channel and spatial information, enabling the model to learn channel and spatial information autonomously.
All of the above methods use the attention mechanism to improve FSL performance. However, the self-attention module achieves high accuracy through complex attention computations, resulting in significant computational overhead that severely impacts model inference speed. The SE module focuses solely on channel information while disregarding spatial details, leading to suboptimal performance. The CBAM module addresses long-range spatial dependencies using large kernel convolutions, but this approach comes at the cost of extensive computational time. To effectively leverage channel and spatial information within feature maps without compromising accuracy and to alleviate computational burden, we propose a dimensionally enhanced attention (DEA) module. Initially, the feature map is decomposed into 1D tensors along different directions using strip pooling. Subsequently, 1D convolutions with adaptive kernel sizes are employed to collaboratively learn feature information across these directions. Finally, attention weights for different directions are derived using sigmoid functions to emphasize original information. Unlike common attention mechanisms, the DEA module achieves substantial improvements in model performance with minimal additional computational overhead.
In FSL, knowledge distillation has emerged as an effective approach for model compression [
15,
22,
23,
24,
25]. Tian et al. [
15] proposed a simple baseline algorithm that incorporates knowledge distillation to enhance model performance. Similarly, Rizve et al. [
22] introduced a novel training mechanism that enables the model to learn joint feature representations that are both invariant and isovariant. Their approach leverages knowledge distillation to further enhance training and achieve significant improvements in FSL performance.
However, while these knowledge distillation methods have shown promise in boosting FSL performance, they do have certain limitations. Specifically, existing research has demonstrated that logit-based knowledge distillation methods often share the temperature between the teacher and student models during the distillation process. This sharing of temperature results in the exact matching of logits but disregards their differences, thereby limiting the potential of the student model. To address this issue, Sun et al. [
23] proposed an alternative approach. They suggest setting the temperature as a weighted standard deviation of the logits and employing logit standardization as a pre-processing step before applying the softmax function. Inspired by their work, a logit standardization self-distillation method for FSL is constructed. In this method, standardized pre-processing of the logit during self-distillation is first performed. Subsequently, the pre-processed logit is transformed into a probability vector using a softmax function with temperature. Finally, the distillation effect is enhanced by minimizing the Kullback–Leibler (KL) divergence, thereby improving the accuracy of few-shot image classification.
In summary, the contributions of our paper are as follows:
An efficient dimensionally enhanced attention (DEA) module for FSL is proposed, enabling the model to highlight and emphasize critical information. Meanwhile, a multi-dimensional collaborative learning strategy is proposed to characterize information of different dimensions by 1D convolutions with adaptive kernel sizes.
A logit standardization self-distillation method applicable to FSL is constructed. This method enhances distillation effects by standardizing logits during the self-distillation process, leading to a significant improvement in few-shot image classification accuracy.
Extensive experiments on several benchmark datasets demonstrate that the proposed method achieves competitive performance and effectively enhances FSL.
The remainder of this article is organized as follows.
Section 2 summarizes the related work.
Section 3 describes the proposed methodologies in detail.
Section 4 presents the details and results of the experimental and ablation studies.
Section 5 shows the t-SNE visualization results. Finally,
Section 6 summarizes our work.
3. Methodology
3.1. Problem Definition
FSL consists of two learning phases where the dataset is partitioned into two disjointed class sets, the base classes
and the novel classes
. In the first phase, the base classes
are used to perform standard supervised learning to train the model, aiming at obtaining a pre-trained model with good generalization performance and adapted to the new task, where
is the set of images and
is the corresponding set of labels. In the second phase, we follow the general setup of meta-learning methods, i.e., the
N-way
K-shot setup, where we randomly sample from the novel classes
and divide the randomly selected samples into support and query sets, where the support set
consists of
categories, each containing
-labeled samples, and in general,
is relatively small, e.g., 1, 5. The query set
contains
samples from the same
categories. The second phase aims to correctly categorize the samples of the query set. We summarize the notations for convenience as shown in
Table 1.
3.2. Method Pipeline
Our overall approach can be represented in three phases: pre-training on the base classes, self-distillation on the base classes, and testing on the novel classes.
Figure 1 illustrates the overall framework of the algorithm.
The classical ResNet12 [
1] is used as the feature extractor. ResNet12 consists of four consecutive residual blocks, each with three convolutional layers of convolutional kernel size 3 × 3, and at the end of the residual block, a 2 × 2 max pooling layer is applied. After the four residual blocks, there is an adaptive average pooling layer for generating feature embeddings.
The feature extractor is trained using data from the base classes. Due to the small number of samples in the novel classes, it is challenging for the pre-trained model to learn a generalized feature representation; in addition, the pre-trained model struggles to focus on the salient features in the feature map, which can easily lead to a deterioration in the generalization ability for the novel classes. In order to give the model the ability to highlight salient features, a dimensionally enhanced attention (DEA) module is proposed as a plug-and-play module, which is embedded into each residual block of ResNet12 for model performance enhancement. The whole pre-training is shown in
Figure 1a, and the model is optimized using a cross-entropy loss function.
where
denotes the number of categories,
denotes the
i-th image in the base classes,
denotes the true label, and
denotes the predicted label.
To further enhance model performance, we incorporate self-distillation for retraining, depicted in
Figure 1b. In this phase, a new technique is introduced, called logit standardization, to mitigate performance degradation caused by temperature sharing between the teacher and student models. Based on this work, a logit standardization self-distillation method for FSL is constructed. Specifically, the pre-trained model is used as the teacher model, and the knowledge is transferred from the teacher model to the structurally identical student model via self-distillation. During knowledge transfer, the temperature is set as a weighted standard deviation of logits, allowing the student model to focus more on the underlying logit relationships of the teacher model. During self-distillation, the model is trained jointly with the cross-entropy loss function and the self-distillation loss function. The entire process can be expressed as
where
and
are weighting factors, and we set
to 0.1 and
to 0.9;
is cross-entropy loss; and
is knowledge distillation loss.
In the testing phase, we train logistic regression (LR) as a classifier through the support set, which in turn predicts the category of the query image and obtains the corresponding predicted labels, as illustrated in
Figure 1c.
3.3. Dimensionally Enhanced Attention (DEA) Module
Before introducing the DEA module, we revisit the popular attention mechanisms. We find that the SE module [
29] focuses only on inter-channel feature information and fails to model spatial location information, leading to a catastrophic loss of information as shown in
Figure 2a. The CBAM module [
30] attempts to map spatial information using large kernel convolutions, but this introduces serious redundancy and additional computational complexity, as shown in
Figure 2c. The CA module [
31] employs strip pooling to encode feature maps along both horizontal and vertical directions, aiming to map spatial information to the channel dimension for modeling. However, the repeated use of 2D convolutions introduces substantial computational complexity, which hinders efficient model inference, as shown in
Figure 2b. Additionally, we note that SE [
29], CBAM [
30], and CA [
31] employ dimensionality reduction operations, which can result in serious information loss.
To address the above problems of the attention mechanisms, an efficient DEA module is designed, as shown in
Figure 2d. Specifically, the DEA module can be viewed as a computational unit that describes the transformation process of a feature, transforming an input feature tensor into an enhanced tensor representation. The input feature tensor size is assumed to be
, where
,
, and
denote channel, height, and width, respectively. The DEA module first uses strip pooling to decompose
along the three directions: channel, height, and width. This decomposition allows the feature information to be learned independently across different dimensions, while also capturing the dependencies within the dimensions, and modeling feature information across different dimensions. By dimensionally decomposing
, we can obtain the channel descriptor
, the height descriptor
, and the width descriptor
. Mathematically, this process can be expressed as
After obtaining feature descriptors for the channel, height, and width dimensions, a multi-dimensional collaborative learning strategy is proposed to fully utilize information from these dimensions. In this strategy, we employ a simple and efficient method for attention computation using 1D convolution instead of traditional 2D convolution to enhance feature representation. This choice is motivated by treating the feature descriptor obtained from the aforementioned decomposition as a 1D sequential signal, where 1D convolution excels in processing sequential signals compared to 2D convolution, while also being computationally lighter. Additionally, within this strategy, the size of the convolution kernel in 1D convolution is adaptive. It dynamically adjusts the kernel size based on the number of channels, allowing the model to flexibly determine the extent of neighborhood interaction. This adaptive approach effectively reduces computational complexity while minimizing redundant information.
Taking the channel and height dimensions as examples, in the channel dimension, we first transform the obtained 1D channel tensor into a tensor suitable for 1D convolution. Subsequently, we consider the relationship between each channel and its neighbors, employing nonlinear local cross-channel interactions to learn correlations between channels. In the height dimension, the obtained 1D height tensor is transformed into a tensor, considering the information differences in the height of the feature map, and we learn the correlation between heights through cross-height interactions, which can effectively avoid the interference of irrelevant information on heights and can effectively reduce redundancy. The learning process for the width dimension and the learning process for both the channel and height dimensions will be synchronized and similar in principle. Unlike the SE, CBAM, and CA modules, the DEA module utilizes 1D convolution to facilitate information interaction across dimensions, which can effectively avoid the loss of information caused by dimensionality reduction. Moreover, the DEA module involves only a minimal number of parameters, ensuring computational efficiency without complexity.
In the process of multi-dimensional collaborative learning, the size of the convolution kernel will adaptively change according to the number of channels. This adaptation allows layers with a larger number of channels to perform more cross-dimensional interactions, thereby enriching the feature information. The specific calculations are as follows:
where we set
to 1.5,
to 1, and
to denote the nearest odd number less than or equal to
.
Learning feature information in various dimensions through 1D convolutional collaboration with an adaptive convolutional kernel size avoids dimensionality reduction and effectively captures cross-dimensional information interactions. Simultaneously, the model refines the feature representation through calculating attention across different dimensions, thereby mitigating information redundancy. The interplay of information across channels directs the model’s focus towards objects within the feature map. Similarly, information exchange across heights and widths guides the model to emphasize object positional information, facilitating the establishment of robust spatial dependencies. During the 1D convolution learning process, the following representation is employed:
where
denotes the 1D convolution operation and
is calculated by Equation (6) and represents the size of the adaptive convolution kernel.
Finally, the obtained 1D tensor is transformed into the tensor size after strip pooling. The weights for the different dimensions are then obtained using the sigmoid function, which are multiplied with the input tensor to produce an enhanced feature representation. The process can be represented as follows:
where
denotes the sigmoid activation function.
We embed the DEA module into the residual block of ResNet12, as shown in
Figure 3b. Additionally, we provide the pseudo-code for the DEA module to facilitate easy reproduction, as shown in Algorithm 1.
Algorithm 1. PyTorch code for our proposed DEA module |
def DEA(x, channel, lambd = 1, gamma = 1.5): |
# x: input features with shape [N, C, H, W] |
# lambd, gamma: parameters of the mapping function in Equation (6) |
N, C, H, W = x.size() |
kernel = int(abs((math.log(channel, 2) − lambd)/gamma)) |
kernel_size = kernel if kernel % 2 else kernel −1 |
conv = nn.Conv1d(1, 1, kernel_size = kernel_size, padding = (kernel_size − 1)/2, bias = False) |
sigmoid = nn.Sigmoid() |
out_c = x.mean(dim = 2, keepdim = True).mean(dim = 3, keepdim = True) |
out_c = out_c.view(N, C, 1).permute(0, 2, 1) |
out_c = sigmoid(conv(out_c).permute(0, 2, 1).view(N, C, 1, 1)) |
out_h = x.mean(dim = 1, keepdim = True).mean(dim = 3, keepdim = True) |
out_h = out_h.view(N, 1, H) |
out_h = sigmoid(conv(out_h).view(N, 1, H, 1)) |
out_w = x.mean(dim = 1, keepdim = True).mean(dim = 2, keepdim = True) |
out_w = out_w.view(N, 1, W) |
out_w = sigmoid(conv(out_w).view(N, 1, 1, W)) |
return x × out_c × out_h × out_w |
3.4. Logit Standardization Self-Distillation
Conventional knowledge distillation involves training a teacher model with a large amount of data and then training a smaller student model under the guidance of the teacher model [
37]. However, this method requires a substantial volume of data for model training. In FSL environments, the scarcity of data poses challenges when using conventional knowledge distillation to train student models, which often leads to difficulties in adequately training models and effectively characterizing knowledge. Therefore, the conventional knowledge distillation method is not suitable for FSL.
To address these challenges, the student model is trained using a self-distillation method. In self-distillation, both the teacher model and the student model share the same architecture, mitigating the risk of model complexity. The approach allows for using a smaller teacher model to obtain a student model with the same architecture, which effectively overcomes the data scarcity challenge in few-shot environments.
In the self-distillation process, we assume that the teacher model is denoted by
and the student model is denoted by
. The model is trained on the base classes
, where
and
denote the image and the corresponding label, respectively. Given the input
, teacher
and student
predict the logit vectors
and
, respectively, and the process can be represented as
where
are the base class images,
is the teacher model, and
is the student model.
The logit is then converted into a probability vector using a softmax function with temperature
, where the
k-th item is expressed as follows:
where
and
denote the
k-th item of
and
, respectively. Finally, the student model is trained by minimizing the KL divergence, which can be expressed as
where
and
denote the prediction logit vectors for the teacher and student models, respectively.
The temperature
is shared between both the teacher and student models in the logit-based self-distillation method, disregarding the potential variance in temperature values across KL divergence, which could vary between the teacher and student models, as well as across different samples. To tackle this problem, a weighted logit standard deviation is introduced as an adaptive temperature, and logit standardization is used as a pre-processing step to map an arbitrary range of logits into a bounded range [
23]. This approach allows for arbitrary ranges and shared variances in student logits, effectively maintaining the inherent relationships with the teacher logits. Specifically, we compute the mean and variance of the logit vector
, which can be expressed as
where
denotes the logit vector.
By calculating the mean and variance of any logit vector
through Equations (16) and (17), the logit standardization can be expressed as
.
The standardized logit is then converted into a probability vector using a softmax function with temperature
, which can be expressed as
where
is a fundamental temperature parameter and
denotes the
obtained from Equation (18).
Based on the above theoretical analysis, a logit standardization self-distillation method suitable for FSL is constructed.
Figure 4 illustrates this process, where the teacher model is trained on the base class data through the cross-entropy loss function in the first phase, and then the knowledge is transferred from the teacher model to the student model through self-distillation. The student model undertakes two tasks in the logit standardization self-distillation process. The first task involves calculating the categorical loss directly via the cross-entropy loss, which helps the model converge quickly. The student obtaining a prediction is referred to as hard prediction, where there is no warming of the softmax function, i.e., the temperature
. In the second task, the logit vectors from both the teacher and student models are standardized. These standardized logits are then converted into a probability vector using the softmax function with temperature. The teacher’s prediction results are called soft labels, and the student’s results are referred to as soft predictions. Finally, the KL divergence is minimized to compute the self-distillation loss, ensuring the predictions of the student are closer to those of the teacher. Throughout the process, we jointly optimize the model with the cross-entropy loss in the first task and the self-distillation loss in the second task. These two losses are balanced using hyper-parameters
and
. For Equation (2), it can be rewritten as
where
and
are weighting factors, and we set
to 0.1 and
to 0.9;
is a fundamental temperature parameter;
is cross-entropy loss; and
is KL divergence.
4. Experiments
4.1. Datasets
We conduct experiments on two benchmark datasets, miniImageNet [
39] and CIFAR-FS [
40].
miniImageNet. The miniImageNet dataset is derived from ImageNet and consists of 60,000 RGB images distributed across 100 categories. Each category contains 600 images, with each image exhibiting a resolution of 84 × 84 pixels. We follow [
41] to divide the dataset into 64 training sets, 16 validation sets, and 20 test sets.
CIFAR-FS. The CIFAR-FS dataset is derived from CIFAR-100 and comprises 100 categories, each with 600 images. The images in this dataset are smaller with a resolution size of 32 × 32 pixels, which presents a challenge for classification tasks. We randomly divide it into 64 training sets, 16 validation sets, and 20 test sets.
4.2. Implementation Details
Our experimental platform runs on a computer with Windows 10 and an NVIDIA GeForce GTX 1080 Ti GPU (MSI, Taiwan, China). All code is implemented in PyTorch version 1.10.1. We use ResNet12 [
1] as the backbone network and optimize the model using the SGD optimizer with a momentum of 0.9 and weight decay set to 5 × 10
−4. During training on all datasets, we train the model for 100 epochs with a batch size of 64. The learning rate is initialized at 0.05 and is decayed by a factor of 0.1 at the 60th and 80th epochs. In the self-distillation phase, we maintain the same experimental setup and set
to 0.1 and
to 0.9. For the testing phase, we randomly sample 600 tasks to train logistic regression as a classifier. We use two popular FSL settings, the 5-way 1-shot and the 5-way 5-shot, and report classification results with 95% confidence intervals.
4.3. Comparison with State-of-the-Art Methods
Following the experimental setup, we conduct experiments on the miniImageNet and CIFAR-FS datasets. The results are shown in
Table 2. We selected PrototNet [
7], MAML [
11], Baseline++ [
14], MetaOptNet [
13], MTL [
16], Hyper ProtoNet [
26], RFS-simple [
15], RFS-distill [
15], DSN [
27], Meta-Baseline [
42], TAS-simple [
25], TAS-distill [
25], MIAN [
34], SENet [
43], KT [
44], and EFTS [
45] as comparison methods. Among them, PrototNet [
7], Hyper ProtoNet [
26], and DSN [
27] are the metric-based methods; Baseline++ [
14], MTL [
16], RFS-simple [
15], and Meta-Baseline [
42] are the commonly used baseline methods in FSL, similar to transfer learning principles; MIAN [
34] employs an attention mechanism to improve the model performance, similar to our approach; MetaOptNet [
13] and MAML [
11] are initialization-based methods; and SENet [
43], KT [
44], and EFTS [
45] are all state-of-the-art FSL methods from recent years. For fairness, the results of these comparison methods are derived from those reported in the literature. As in previous works, the average accuracy is adopted to assess the effectiveness of all the FSL methods in 5-way 1-shot and 5-way 5-shot settings. The results of the comparison experiments are shown in
Table 2.
First, compared to the four baseline methods—Baseline++ [
14], MTL [
16], RFS-simple [
15], and Meta-Baseline [
42]—our method improves performance by 13.35%, 6.12%, 5.3%, and 4.15%, respectively, in the 5-way 1-shot setting on miniImageNet. In the 5-way 5-shot setting, the improvements are 6.86%, 7.26%, 3.12%, and 3.5%, respectively. Similar performance gains are observed on CIFAR-FS. PrototNet [
7], Hyper ProtoNet [
26], and DSN [
27] aim to learn a distance metric suitable for FSL, and our method achieves the best average accuracy on miniImageNet and CIFAR-FS compared to these metric-based methods. MetaOptNet [
13] and MAML [
11] improve FSL performance by initializing task-dependent parameters, and our approach outperforms these initialization-based methods, achieving the best average accuracy on both datasets.
Second, our method achieves the best average accuracy on both datasets compared to RFS-distill [
15] and TAS-distill [
25], two methods using knowledge distillation. MIAN [
34] employs a self-attention mechanism to enhance FSL performance, but our DEA module is simple and effective compared to self-attention. In terms of classification accuracy, it surpasses that achieved by MIAN.
Finally, our method is superior to the state-of-the-art methods SENet [
43], KT [
44], and EFTS [
45]. SENet [
43] balances prototype and example representations through spectral filtering, KT [
44] proposes an effective data enhancement strategy for improving FSL performance, and EFTS [
45] introduces an episodic free task selection strategy for choosing tasks with the highest affinity scores for co-training a meta-learner. Compared to these state-of-the-art methods, our method remains highly competitive, achieving the best average accuracy on both datasets.
4.4. Ablation Studies
We conduct ablation studies to evaluate core components of our methods: the DEA module and logit standardization self-distillation.
Table 3 shows the experimental results of the ablation experiments on miniImageNet and CIFAR-FS under 5-way 1-shot and 5-way 5-shot settings.
Through these ablation experiments, we can find that the DEA module can effectively improve the model performance. In the 5-way 1-shot setting, average accuracies on miniImageNet and CIFAR-FS are improved by 2.47% and 1.60%, respectively. In the 5-way 5-shot setting, average accuracies are improved by 1.61% and 1.05%, respectively. The DEA module can significantly improve the FSL performance by focusing on learning channel and spatial information in the full dimension, thereby enhancing feature representation. Additionally, the logit standardization self-distillation method also improves model performance. In the 5-way 1-shot setting, the average accuracies on both datasets are improved by 3.03% and 2.12%, respectively. In the 5-way 5-shot setting, the average accuracies are improved by 1.93% and 1.38%, respectively. These ablation experiments demonstrate the effectiveness of the DEA module and logit standardization self-distillation.
Most importantly, combining the two methods further demonstrates that the DEA module does not conflict with logit standardization self-distillation. Instead, using both methods simultaneously achieves maximum performance gains. In the 5-way 1-shot setting, average accuracies on both datasets are improved by 4.01% and 3.71%, respectively. In the 5-way 5-shot setting, average accuracies are improved by 3.03% and 2.29%, respectively, which is a significant performance improvement.
4.5. Performance Analysis of DEA Module
This section consists of two subsections in which we compare the DEA module with state-of-the-art attention methods and validate the effect of different components of the DEA module on model performance.
4.5.1. Comparison with State-of-the-Art Attention Mechanisms
To validate the effectiveness of the proposed DEA module, we compared the DEA module with several state-of-the-art attention methods, including the SE [
29], CBAM [
30], CA [
31], and EMA [
32] modules. For experimental fairness, we used the same experimental environment and experimental setup for all experiments. To ensure experimental soundness, we set the reduction rates of the SE [
29], CBAM [
30], CA [
31] and EMA [
32] modules to 4, 32, 16, and 8, respectively, and use the same ResNet12 as the backbone network. We report the 1-shot accuracy, 5-shot accuracy, number of parameters, FLOPs, and inference speed. The experimental results are shown in
Table 4 and
Table 5. From the experimental results, it can be seen that in both datasets, adding an attention mechanism to the baseline improves the model performance on both datasets, indicating that attention mechanisms are applicable to FSL and can enhance classification accuracy. And our proposed DEA module achieves the best results, with an overall performance improvement on both datasets. On miniImageNet, the DEA module improves 1-shot accuracy by approximately 1.13%, 0.89%, 1.69%, and 1.88% over the SE [
29], CBAM [
30], CA [
31], and EMA [
32] methods, respectively. In 5-shot accuracy, the DEA module improves by 0.84%, 0.65%, 0.71%, and 1.18%, respectively. Additionally, the DEA module hardly increases the number of parameters of the model and results in the least number of FLOPs, indicating that the DEA module has significant advantages.
It is important to note that the SE module [
29], lacking attention to spatial information, achieves faster inference but at the cost of lower accuracy. In contrast, the CBAM [
30], CA [
31], EMA [
32], and DEA modules all incorporate spatial information, which reduces inference speed. Among these, the DEA module demonstrates the fastest inference speed, suggesting it as the most efficient in attention computation. The CBAM [
30], CA [
31] and EMA [
32] modules reduce channel dimensions through reduction rates during feature mapping, potentially leading to information loss due to reduced information encoding, while the DEA module employs multi-dimensional collaborative learning, learning different dimensions of information separately, which can effectively reduce the redundancy of the information and prevent information loss caused by dimensionality reduction. Moreover, the DEA module utilizes 1D convolution, reducing computational complexity compared to the CBAM module [
30] and CA module [
31] which use 2D convolutions. Notably, 1D convolution in the DEA module allows adaptive convolution kernel sizes, accommodating various receptive field sizes during feature map compression. This capability helps model long-range spatial dependencies while reducing memory costs.
To visually demonstrate the DEA module’s advantages, we employ the Grad-CAM tool [
46] to generate gradient backpropagation heat maps.
Figure 5 shows the visualization results of different modules under the 5-way 1-shot setting. The visualization shows that the DEA module outperforms others by accurately focusing on regions of interest across different categories. It effectively highlights and emphasizes broader areas compared to other modules, which contributes to improving the model performance.
In summary, the proposed DEA module enhances feature representation across different dimensions efficiently and performs attention computation effectively. Importantly, it achieves this with a minimal increase in model parameters, effectively reducing model complexity.
4.5.2. Ablation Analysis of Different Components in DEA Module
To verify the effectiveness of multi-dimensional collaborative learning in the DEA module, we conduct ablation experiments on its different components. We decompose the DEA module into two dimensions of attention: channel and spatial. The channel dimension attention consists of channel attention, while the spatial dimension attention includes attention in both the height and width directions. The results of these experiments on miniImageNet and CIFAR-FS are presented in
Table 6.
Our findings indicate that channel dimension attention improves model performance, albeit modestly. In the 5-way 1-shot setting, average accuracies on both datasets are improved by 0.18% and 0.46%, respectively. In the 5-way 5-shot setting, average accuracies are improved by 0.14% and 0.50%, respectively. Although these gains are not substantial, they suggest that channel attention has a positive impact. However, spatial dimension attention improves the model performance significantly. In the 5-way 1-shot setting, average accuracies on both datasets are improved by 0.78% and 1.05%, respectively. In the 5-way 5-shot setting, average accuracies are improved by 0.52% and 0.73%, respectively. These results underscore the critical importance of focusing on spatial information for improving model performance. Further, we synergize the channel dimension and spatial dimension information for learning and improve the average accuracy by 2.47% and 1.60% in both datasets in the 5-way 1-shot setting, and 1.61% and 1.05% in the 5-way 5-shot setting. These results clearly demonstrate that integrating information from both channel and spatial dimensions is essential and that a multi-dimensional collaborative learning strategy is highly effective.
To provide a more intuitive representation of the influence of different dimensions of information attention on model performance,
Figure 6 shows the heat map generated by the Grad-CAM tool [
46], visualizing different dimensions of attention under the 5-way 1-shot setting. The visualization indicates that a singular focus on one dimension of information leads the model to concentrate on non-comprehensive information, potentially resulting in information loss. Conversely, a comprehensive focus on all dimensions enriches the feature information, enabling the model to better highlight and focus on salient regions.
4.6. Logit Standardization Self-Distillation Experimental Analysis
This section comprises three subsections. First, we verify the effect of the temperature on self-distillation. Next, we explore the validity of the logit standardization. Finally, we investigate the impact of different values of the hyper-parameters and in the loss on the experimental results.
4.6.1. Effect of Temperature T on the Effect of Self-Distillation
To analyze the effect of the temperature
on the model performance during the logit standardization self-distillation process, we conduct ablation experiments demonstrated on miniImageNet and CIFAR-FS datasets. The experimental results are shown in
Figure 7. We observe that logit standardization self-distillation is insensitive to the temperature
. The best performance on miniImageNet is achieved when the temperature
, while the optimal performance on CIFAR-FS occurs at
. Based on these findings, we set
to 4 for miniImageNet and 6 for CIFAR-FS, unless otherwise stated.
4.6.2. Effect of Logit Standardization on Self-Distillation
To validate the effect of logit standardization on the distillation effect during the self-distillation, we conduct experimental validation on miniImageNet and CIFAR-FS. The experimental results are shown in
Table 7. According to our findings, the inclusion of logit standardization leads to performance improvements on both datasets. On mimiImageNet, the 1-shot accuracy increased by 0.76% and the 5-shot accuracy increased by 1.04%. On CIFAR-FS, 1-shot and 5-shot accuracy improved by 0.96% and 0.89% respectively. These results demonstrate that logit standardization significantly enhances the distillation effect and improves the model performance. Moreover, self-distillation based on logit standardization is suitable for FSL and can substantially enhance the classification accuracy of few-shot.
4.6.3. Validity Analysis of Hyper-Parameters
To investigate the impact of hyper-parameters
and
in the loss function
of logit standardization self-distillation on model performance, we conduct experimental validation on miniImageNet and CIFAR-FS. As shown in
Figure 8, we show the results for different values of the hyper-parameter
. Note that all experiments follow this setup with
. As the value of
increases, the value of
is decreasing, which indicates that the weight factor controlling the cross-entropy loss gradually increases, while the weight factor controlling the self-distillation loss gradually decreases. We observe that the model’s performance fluctuates during this process. However, overall, as
keeps increasing, the model’s performance exhibits a decreasing trend. This suggests that in the logit standardization self-distillation process, the self-distillation loss has a greater impact on model performance than the cross-entropy loss. Eventually, we can observe that for both datasets, the best results are obtained when
and
.
6. Conclusions
Few-shot learning (FSL) is a crucial approach for addressing data scarcity and expensive data labeling. FSL aims to quickly learn from limited labeled data, and transfer learning methods are commonly employed in FSL. To enhance the performance of pre-trained models, we propose an efficient dimensionally enhanced attention (DEA) module. This DEA module decomposes the feature map into tensors of different orientations through strip pooling, allowing the model to capture feature information efficiently and highlight salient features using a multi-dimensional collaborative learning strategy. This strategy learns cross-dimensional information interactions through 1D convolutions with adaptive kernel sizes. Furthermore, we construct a logit standardization self-distillation method applicable to FSL. By employing logit standardization, we avoid exact logit matching caused by shared temperature, thereby enhancing the self-distillation effect. We conduct comprehensive experiments on few-shot benchmark datasets to evaluate the feasibility of the proposed method. The results demonstrate that our method significantly improves FSL performance. However, the proposed method is implemented by three stages, which are pre-training on the base classes, self-distillation on the base classes, and testing on the novel classes, and the process can seem cumbersome; furthermore, our method has not been validated in other few-shot domains. In future work, we will optimize the proposed method. In addition, the proposed method may benefit other FSL tasks, and we plan to explore the feasibility of the proposed method in cross-domain few-shot image classification tasks.