1. Introduction
Rapid advancements in remote sensing technology have spurred the development of diverse algorithms designed to efficiently manage extensive Earth observation data. Scene classification, a core component of remote sensing image analysis, is crucial for accurately interpreting land use changes [
1,
2,
3,
4], optimizing agricultural practices [
5,
6,
7,
8], managing forest resources [
9,
10], and monitoring hydrological dynamics [
11].
Traditional machine learning methods like SIFT [
12,
13], HOG [
14], and LBP [
15] have been employed to determine land cover types. While effective, these methods often require expert knowledge and manual feature extraction, making them costly and inefficient. In contrast, recent advances in deep neural networks have significantly enhanced the performance of tasks such as image classification [
16,
17,
18], object recognition and tracking [
19,
20,
21], and person re-identification [
22,
23,
24]. Specifically, deep learning has revolutionized remote sensing image (RSI) scene classification, yielding substantial performance improvements [
25,
26,
27,
28,
29,
30].
Despite their effectiveness, high-performance models often entail significant computational and storage demands, posing substantial challenges for deployment in real-world applications. To mitigate these issues, knowledge distillation has been adopted. This method facilitates the transfer of knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). Through this technique, the student model can reach performance levels comparable to that of the teacher by minimizing the Kullback–Leibler divergence.
However, RSI scene classification is marked by high inter-class similarity and large intra-class variability, presenting significant challenges to existing knowledge distillation methods. In particular, samples with high inter-class similarity often yield soft labels with diminished confidence in the target class, potentially leading to misguidance of the student model.
Conversely, samples with low inter-class similarity may lead to overly sharp soft target distributions, which suppress the effectiveness of logit knowledge distillation [
31,
32] (the reason overly sharp soft target distributions can diminish the potential benefits of knowledge distillation is explained in
Appendix A). Similar to one-hot labels, these soft labels exhibit low confidence for non-target classes and lack rich category knowledge. This underscores the importance of non-target logits, which may contain valuable “dark knowledge” essential for effective model training, as illustrated by the blue dashed box in
Figure 1.
Additionally, as shown in the green dashed box within
Figure 1, large intra-class variability can compel the student model to develop poor decision boundaries to accommodate diverse within-class samples, adversely affecting the model’s performance during testing.
To address these challenges, we propose a straightforward and efficient logits-based distillation technique termed instance-level scaling and dynamic margin-alignment knowledge distillation (ISDM) tailored for RSI scene classification. ISDM scales the target class of the teacher’s soft label at instance level via an entropy regularization loss. As shown in the blue solid box within
Figure 1, for hard samples with high inter-class similarity, the instance-level scaling can improve the teacher model’s confidence in the target class to prevent inadvertent misguidance of the student. For easy samples with low inter-class similarity, the instance-level scaling reduces the excessive confidence of the target class to amplify the difference of the non-target class and preserve the difference of the target class simultaneously. Moreover, a dynamic margin is adopted for alignment to accommodate the large intra-class variability, as shown in the green solid box within
Figure 1. It can dynamically yield more rational and stable decision boundaries based on sample differences.
It should be noted that this article is an expanded version of the paper “Instance-level Scaling and Dynamic Margin-alignment Knowledge Distillation” [
33] accepted for presentation at the 7th Chinese Conference on Pattern Recognition and Computer Vision (PRCV 2024). It analyzes the impact of RSI data on existing knowledge distillation methods, introduces a framework for RSI scene classification, and presents additional datasets and experimental validation to demonstrate the effectiveness of the proposed ISDM approach.
In summary, our contributions are as follows:
We scale the target class of the teacher’s soft label at instance level via an entropy regularization loss to address the high inter-class similarity inherent in RSI.
We introduce dynamic margin alignment for the probabilistic prediction scores, allowing the student model to establish more logical and adaptive decision boundaries, effectively addressing large intra-class variability.
We propose an effective logits-based distillation method named ISDM, which achieves state-of-the-art performances across all datasets with minimal additional computational costs.
2. Related Work
2.1. Remote Sensing Image Scene Classification
The primary goal of scene classification is to automatically determine land cover types within remote sensing image patches. Scene classification using supervised learning algorithms can be divided into three categories: low-level, mid-level, and high-level methods. Low-level methods focus on extracting handcrafted features, such as SIFT (scale-invariant feature transform) [
12,
13], HOG (histogram of oriented gradients) [
14], and LBP (local binary patterns) [
15], and use classifiers like support vector machines (SVM) [
34] or K-nearest neighbors (KNN) [
35] for scene classification. Although these methods work well for specific structures and arrangements, they often fall short when dealing with the varied and intricate spatial distributions present in images. Mid-level methods create scene representations by encoding low-level local features. The Bag-of-Visual-Words (BoVW) [
36] model is frequently utilized and is often augmented with different low-level descriptors [
37], along with Gaussian mixture models (GMMs) [
38] and pyramid-based approaches [
39,
40,
41]. Moreover, topic models are used to integrate higher-order spatial relationships between local visual words [
42,
43,
44]. High-level methods leverage deep learning models, which have set new benchmarks in image recognition, speech recognition, semantic segmentation, and remote sensing scene classification. Prominent deep learning models, such as VGG [
45], ResNet [
16], and WRN [
46], have demonstrated superior performance in remote sensing scene classification, surpassing both shallow models and low-level techniques by extracting deep visual features from large-scale training datasets.
However, deep neural networks require substantial computational resources and storage due to their large number of parameters. This makes them impractical for resource-constrained environments like embedded systems or real-time processing. To address this, knowledge distillation techniques can be used to compress models, balancing efficiency with performance.
2.2. Knowledge Distillation
As an effective model compression method, knowledge distillation is firstly proposed by [
31], which aims to transfer the knowledge of the teacher model to the student model. Based on the type of transferred knowledge, knowledge distillation is predominantly divided into two categories: (1) logits-based distillation methods and (2) feature-based distillation methods.
The logits-based methods merely employ the teacher model’s logits to transfer the knowledge. The origin knowledge distillation [
31] transfers the knowledge by minimizing the Kullback–Leibler (KL) divergence between the probabilistic prediction scores of the teacher and student, which is adopted by many subsequent works due to its simplicity and efficiency. To use the knowledge better, much research attention is drawn to transfer the knowledge from the deep intermediate layers [
47,
48,
49,
50]. Thus, the feature-based methods are normally well performing. However, they are always infeasible when confronted with many practical problems. In some practical applications, the intrinsic architecture and the intermediate features of the teacher model may not be available due to safety and privacy issues. Moreover, feature-based methods are computationally expensive.
Research on applying knowledge distillation to RSI scene classification is still limited. A key challenge is that existing KD methods struggle with the significant intra-class variability and inter-class similarity present in RSI data. Traditional distillation techniques often generate low-quality soft labels because they do not effectively manage these variations and similarities, which adversely affects the performance of the student model. Additionally, the limited capacity of student models can exacerbate this issue, resulting in overly complex decision boundaries and diminished accuracy. To address these challenges, we propose an effective approach that combines improved soft label techniques with dynamic-alignment distillation methods. This approach aims to enhance both the quality of the soft labels and the decision boundaries in RSI scene classification.
2.3. Optimization of Soft Labels
Soft labels are the predicted probability scores obtained by applying softmax to logits at temperature , where the target class reflects the confidence in the correct category and the non-target classes reflect the confidence in other categories. In logit-based methods, student models rely heavily on soft labels for training, making their quality crucial. Enhancing soft label quality has, thus, become a key area of research.
When soft labels are overly confident in the target class, they degenerate into one-hot labels, preventing full utilization of the knowledge in non-target classes. Many studies have addressed this issue [
51,
52,
53,
54,
55,
56,
57]. For smoother soft labels, ATS [
51] and NTCE-KD [
52] reduce the target class values, while SFKD [
53] uses attention mechanisms for smoothing. Label smoothing research [
54,
55,
56,
57] shows that it provides benefits similar to knowledge distillation (KD) in optimizing soft labels. For instance, [
57] applied label smoothing to achieve softer labels, improving student model performance. On the other hand, insufficient confidence in the target class also hinders student training [
58], as it misleads the student model into incorrect classifications. CKD [
58] addresses this by replacing incorrect soft labels with hard labels, avoiding the transfer of erroneous knowledge.
However, existing studies have primarily focused on either excessive or insufficient confidence in soft labels, often relying on manual optimizations. To address this, we propose an IS module that comprehensively optimizes soft labels at instance level, ensuring that they capture both the correct knowledge from the target class and rich information from the non-target classes.
4. Results
4.1. Datasets
We evaluate our method on three popular RSI scene classification benchmark datasets and two widely-used general image classification benchmark datasets.
4.1.1. NWPU-RESISC45 Dataset
The NWPU-RESISC45 dataset [
59] is a comprehensive resource for remote sensing image classification, featuring 31,500 images from over 100 countries. It includes 45 scene categories with 700 images per category, each image sized at 256 × 256 pixels in RGB format. The dataset’s challenge lies in its varying spatial resolutions (300 cm to 20 cm per pixel), which can result in significant inter-class similarities, necessitating advanced classification methods.
4.1.2. Aerial Image Dataset (AID)
The AID dataset [
60] contains 10,000 high-resolution aerial images across 30 scene types. Each type is represented by 200 to 400 images at a resolution of 600 × 600 pixels in RGB format. The images have varying spatial resolutions (800 cm to 50 cm per pixel), adding complexity and relevance for testing sophisticated aerial image classification algorithms.
4.1.3. UC Merced Land-Use Dataset (UCM)
The UCM dataset [
61] comprises 2100 images across 21 land-use categories, with 100 images per category. The images are uniformly sized at 256 × 256 pixels and are in RGB format. With a consistent spatial resolution of 30 cm per pixel, this dataset simplifies analysis while providing detailed land-use patterns for research purposes.
4.2. Settings and Implementation Details
For the three remote sensing (RS) datasets, ResNet34 is used as the teacher model and ResNet18 as the student model. We apply two data splitting ratios: one with 80% of the data for training and 20% for testing, and another with 50% for both training and testing, to evaluate the model performance with less training data. The top-1 and top-5 accuracy on the test set are used as evaluation metrics.
We compare our methods with various SOTA methods, including logit-based methods, such as KD [
31], DKD [
32], MLLD [
62], LS [
63], and SDD [
64], and feature-based methods, such as FitNet [
47], AT [
50], RKD [
65], OFD [
66], CRD [
67], ReviewKD [
49], and CAT [
68].
We set the training batch size to 64 and the testing batch size to 128. The temperature parameter is set as 4. The initial learning rate is 0.1, with a total of 200 epochs. The learning rate is reduced by a factor of 10 at 60, 120, and 160 epochs. We use the SGD optimizer with a momentum of 0.9. We apply a weight decay of 5 × 5 × . The base weight for the cross-entropy loss is set to 1, and the base weight for the knowledge distillation loss is set to 4. The base balance weight for the entropy regularization loss , as defined in Equation (8), is set to 0.1. Experiments are conducted using Python 3.7 with PyTorch on an NVIDIA V100 GPU.
4.3. Main Results
In our experiments across the NWPU-RESISC45, AID, and UCM datasets, the ISDM method consistently outperforms other techniques, using ResNet34 as the teacher network and ResNet18 as the student network. On the NWPU-RESISC45 dataset presented in
Table 1, ISDM achieves the highest top-1 and top-5 accuracy across both split ratios, demonstrating its robustness in knowledge distillation.
Similarly, on the AID dataset in
Table 2, ISDM surpasses all competing methods, including those with feature-based and logits-based distillation, with significant improvements in top-1 accuracy (95.55%) and top-5 accuracy (99.75%) under the 8:2 split ratio.
The results are even more pronounced on the UCM dataset in
Table 3, where ISDM sets new benchmarks with a top-1 accuracy of 92.62% and a top-5 accuracy of 99.76%, surpassing the second-best method by 1.43%. This highlights its ultimate efficacy in leveraging teacher–student networks. This superior performance across multiple datasets underscores ISDM’s effectiveness in distilling knowledge.
4.4. Ablation Study
The results of ablation experiments are shown in
Table 4. The first row presents the experimental results of ISDM. Removing the IS component from ISDM, there is a performance decrease of 0.65% (see ① and ②). When the DM component is removed, it results in a performance drop of 0.73% (see ① and ③), and removing both IS and DM components leads to a significant performance decrease of 2.79% (see ① and ④).
Under equivalent conditions, we also compare DM and FM (fixed margin-alignment). We set
(seen in Equation (
9)) fixed as 3, 5, and 7 in FM and contrast the best result with
as 5 against DM (see ① and ⑤). The results indicate that DM shows performance improvement over FM, proving its superiority over FM.
4.5. Sensitivity of Hyperparameters
The selection of loss weights
and
is essential for effectively balancing the cross-entropy loss
and distillation loss
. Based on previous studies [
31,
32,
62,
64], we keep
constant at 1.0. To find the best value for
, we conduct a systematic grid search, testing various options:
. We choose the option that results in the highest accuracy, as shown in
Table 5. Notably,
achieves the highest accuracy, indicating that increasing
from 1.0 to 4.0 improves performance, but further increases lead to smaller gains. This highlights the need for careful tuning of
to optimize knowledge distillation.
In
Table 6, we observe that with an 8:2 dataset split, the model’s accuracy peaks at
. For a 5:5 split, the best performance is at
. Keeping
below 1.0 generally maintains good performance, but it drops sharply if
exceeds 1.0. This suggests that higher
values may overly rely on cross-entropy loss, especially with limited data in a 5:5 split, raising the risk of overfitting. Hence, setting
to 1.0 is a sensible choice.
The results in
Table 7 indicate that the parameter
achieves optimal performance at a value of 0.1, with the model attaining accuracies of 92.62% and 88.29% for the 8:2 and 5:5 split ratios, respectively. Although slight variations in accuracy are observed across different
values, the overall performance remains relatively stable. This suggests a degree of robustness in the performance across various settings of
.
4.6. Motivation Validation
To better understand the characteristics of the RS dataset, we performed experiments to measure feature similarity between categories in the UCM dataset. We trained ResNet18 models on the UCM dataset using different methods. We then calculated the average logits for each category to use as category centers and computed the cosine similarity between each sample’s logits and all category centers.
Figure 4b–d show the inter-class similarity results for the UCM dataset using models trained with Vanilla (“Vanilla” represents the standard ResNet18 model training with only the cross-entropy loss), KD, and MISD methods, respectively.
Figure 4a shows the results from Vanilla on the CIFAR-100 dataset. For comparison, we only used data from 21 out of 100 classes.
Comparing
Figure 4a,b, the RS dataset has more complex category similarity patterns than CIFAR-100. Some categories, like 12, 19, and 20, are quite similar to other categories, while others like 5 and 7 differ more with others. In contrast, CIFAR-100 has a more balanced similarity across categories, which helps the model learn features better. As shown in
Figure 4b–d, KD can somewhat reduce the negative effects of the RS dataset, but ISDM can largely remove these effects, coming close to the results seen with CIFAR-100.
4.7. Effect of Instance-Level Scaling
To further evaluate the effects of our proposed IS module,
Figure 5 visualizes the soft labels processed by the IS module for both easy and hard samples, respectively.
Figure 5a illustrates the scaling process of soft labels from easy images in the beach and tennis-court classes. The second column describes the values of the target class before and after scaling for an easy sample. The third column shows the original soft labels; it can generate a high value on the target class due to its low inter-class similarity, thus causing the depreciation of the information about the non-target class. The last column shows the soft labels processed by our IS module. Through the IS module, the overconfidence effect on the target class can be relieved and the information about the non-target class can be enhanced.
Figure 5b illustrates the scaling process of soft labels from easy images in the sparse-residential and baseball-diamond classes. The third column illustrates the raw soft labels, which have a small target class value due to their high inter-class similarity. The last column shows the soft labels processed by our IS module, demonstrating how the insufficiency of the target class can be alleviated via the IS module.
4.8. Effect of Dynamic Margin-Alignment
To assess the effectiveness of DM, we employ visualization of t-SNE.
Figure 6a,b showcase features learned by KD and ISDM, respectively.
Figure 6a shows that KD results in the features of most samples being mixed together, making them hard to distinguish, with more complex decision boundaries and smaller inter-class distances. As demonstrated in
Figure 6b, dynamic margin sets greater and negative margins for simple and difficult samples, respectively. It achieves more dispersed inter-class distances for the vast majority of samples, resulting in more marginal decision boundaries.
4.9. Distillation Fidelity
To provide a comprehensive understanding of distillation fidelity, we follow [
32,
49] and present our visualizations in
Figure 7. Specifically, when focusing on the ResNet34–ResNet18 model pair trained on the UCM dataset, we calculate the absolute distance between the correlation matrices of the teacher and student models. Our findings show that ISDM enhances the alignment of the student model’s predictions with those of the teacher model. Specifically, ISDM results in a maximum difference of 1.71 and a mean difference of 0.35, whereas KD has a maximum difference of 1.95 and a mean difference of 0.42. The lower difference metrics for ISDM indicate that it achieves better alignment between the student and teacher models compared to KD.
4.10. Generalization Exploration
In evaluating the generalization of our ISDM across different datasets, we observed its robust performance on two general image classification datasets, beyond its initial remote sensing application.
Our method consistently outperforms or matches state-of-the-art methods across various teacher–student pairs shown in
Table 8, demonstrating competitive results against both feature-based and logits-based methods. ISDM shows notable improvements over other approaches in most configurations, especially in the context of ResNet-56 to ResNet-20 and ResNet-32×4 to ResNet-8×4. This suggests that ISDM not only generalizes well within the CIFAR-100 dataset but also performs comparably to, or better than, existing methods that leverage intermediate feature representations and logits.
On the large-scale ImageNet-1k dataset shown in
Table 9, ISDM continues to exhibit superior performance. It outperforms both feature-based and logits-based methods across different teacher–student configurations, including ResNet34 to ResNet18 and ResNet50 to MobileNetV2. This consistent performance across datasets of varying sizes and complexities indicates that ISDM’s effectiveness extends well beyond the original remote sensing tasks, highlighting its strong generalization capability.
In summary, these results confirm that ISDM is not only effective in its primary domain but also shows impressive versatility and robustness in other diverse and challenging scenarios, reinforcing its broad applicability.
4.11. Training Efficiency
We evaluate the training overhead and accuracy for SOTA methods shown in
Figure 8. Our approach involves improvements to KD by enhancing soft labels and alignment strategy. Consequently, it exhibits a similar time overhead to KD, providing a substantial advantage over other methods while achieving the highest model performance.
Furthermore, the ISDM method introduces only a perceptron with extra parameters less than 0.01 M, which is negligible compared to the trainable parameters during distillation. Notably, this perceptron is exclusively used to optimize soft labels in training. It does not participate in student’s inference, thereby incurring no overhead.
5. Conclusions
In this paper, we addressed the challenge of high inter-class similarity and large intra-class variance in remote sensing datasets by proposing a distillation method named ISDM. This method optimizes teacher soft labels through instance-level scaling and employs a margin-alignment strategy during the distillation process to enhance model generalization. The ISDM method showed significant improvements on the NWPU-RESISC45, AID, and UCM datasets while maintaining lower costs.
Additionally, we validated the effectiveness of our approach through extensive experiments and demonstrated its generalizability on standard datasets such as CIFAR-100 and ImageNet-1k. We hope that this paper will contribute to advancements in scene classification for remote sensing images and improvements in logits-based distillation methods.