Instance-Level Scaling and Dynamic Margin-Alignment Knowledge Distillation for Remote Sensing Image Scene Classification

Li, Chuan; Teng, Xiao; Ding, Yan; Lan, Long

doi:10.3390/rs16203853

Open AccessArticle

Instance-Level Scaling and Dynamic Margin-Alignment Knowledge Distillation for Remote Sensing Image Scene Classification^†

by

Chuan Li

,

Xiao Teng

,

Yan Ding

and

Long Lan

^*

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

^†

It should be noted that this article is a revised and expanded version of a paper entitled ’Instance-level Scaling and Dynamic Margin-alignment Knowledge Distillation’ which has been accepted for presentation at the 7th Chinese Conference on Pattern Recognition and Computer Vision (PRCV 2024), to be held in Urumqi, China, from 18–20 October 2024. The proceedings of the conference are not yet available. Our initial conference paper did not address the scene classification task for Remote Sensing Images (RSI). This manuscript provides a detailed analysis of the impact of RSI data on existing knowledge distillation methods and demonstrates how the Instance-level Scaling and Dynamic Margin-alignment (ISDM) approach effectively addresses this issue. Additionally, the paper constructs a framework for RSI scene classification based on knowledge distillation methods, analyzes related work in RSI scene classification, and introduces more datasets along with further experimental validation of the proposed method’s superiority.

Remote Sens. 2024, 16(20), 3853; https://doi.org/10.3390/rs16203853

Submission received: 9 September 2024 / Revised: 2 October 2024 / Accepted: 11 October 2024 / Published: 17 October 2024

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Remote sensing image (RSI) scene classification aims to identify semantic categories in RSI using neural networks. However, high-performance deep neural networks typically demand substantial storage and computational resources, making practical deployment challenging. Knowledge distillation has emerged as an effective technique for developing compact models that maintain high classification accuracy in RSI tasks. Existing knowledge distillation methods often overlook the high inter-class similarity in RSI scenes, leading to low-confidence soft labels from the teacher model, which can mislead the student model. Conversely, overly confident soft labels may discard valuable non-target information. Additionally, the significant intra-class variability in RSI contributes to instability in the model’s decision boundaries. To address these challenges, we propose an efficient method called instance-level scaling and dynamic margin-alignment knowledge distillation (ISDM) for RSI scene classification. To balance the target and non-target class influence, we apply an entropy regularization loss to scale the teacher model’s target class at the instance level. Moreover, we introduce dynamic margin alignment between the student and teacher models to improve the student’s discriminative capability. By optimizing soft labels and enhancing the student’s ability to distinguish between classes, our method reduces the effects of inter-class similarity and intra-class variability. Experimental results on three public RSI scene classification datasets (AID, UCMerced, and NWPU-RESISC) demonstrate that our method achieves state-of-the-art performance across all teacher–student pairs with lower computational costs. Additionally, we validate the generalization of our approach on general datasets, including CIFAR-100 and ImageNet-1k.

Keywords:

knowledge distillation; scaling distillation; scene classification; model compression; deep learning

1. Introduction

Rapid advancements in remote sensing technology have spurred the development of diverse algorithms designed to efficiently manage extensive Earth observation data. Scene classification, a core component of remote sensing image analysis, is crucial for accurately interpreting land use changes [1,2,3,4], optimizing agricultural practices [5,6,7,8], managing forest resources [9,10], and monitoring hydrological dynamics [11].

Traditional machine learning methods like SIFT [12,13], HOG [14], and LBP [15] have been employed to determine land cover types. While effective, these methods often require expert knowledge and manual feature extraction, making them costly and inefficient. In contrast, recent advances in deep neural networks have significantly enhanced the performance of tasks such as image classification [16,17,18], object recognition and tracking [19,20,21], and person re-identification [22,23,24]. Specifically, deep learning has revolutionized remote sensing image (RSI) scene classification, yielding substantial performance improvements [25,26,27,28,29,30].

Despite their effectiveness, high-performance models often entail significant computational and storage demands, posing substantial challenges for deployment in real-world applications. To mitigate these issues, knowledge distillation has been adopted. This method facilitates the transfer of knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). Through this technique, the student model can reach performance levels comparable to that of the teacher by minimizing the Kullback–Leibler divergence.

However, RSI scene classification is marked by high inter-class similarity and large intra-class variability, presenting significant challenges to existing knowledge distillation methods. In particular, samples with high inter-class similarity often yield soft labels with diminished confidence in the target class, potentially leading to misguidance of the student model.

Conversely, samples with low inter-class similarity may lead to overly sharp soft target distributions, which suppress the effectiveness of logit knowledge distillation [31,32] (the reason overly sharp soft target distributions can diminish the potential benefits of knowledge distillation is explained in Appendix A). Similar to one-hot labels, these soft labels exhibit low confidence for non-target classes and lack rich category knowledge. This underscores the importance of non-target logits, which may contain valuable “dark knowledge” essential for effective model training, as illustrated by the blue dashed box in Figure 1.

Additionally, as shown in the green dashed box within Figure 1, large intra-class variability can compel the student model to develop poor decision boundaries to accommodate diverse within-class samples, adversely affecting the model’s performance during testing.

To address these challenges, we propose a straightforward and efficient logits-based distillation technique termed instance-level scaling and dynamic margin-alignment knowledge distillation (ISDM) tailored for RSI scene classification. ISDM scales the target class of the teacher’s soft label at instance level via an entropy regularization loss. As shown in the blue solid box within Figure 1, for hard samples with high inter-class similarity, the instance-level scaling can improve the teacher model’s confidence in the target class to prevent inadvertent misguidance of the student. For easy samples with low inter-class similarity, the instance-level scaling reduces the excessive confidence of the target class to amplify the difference of the non-target class and preserve the difference of the target class simultaneously. Moreover, a dynamic margin is adopted for alignment to accommodate the large intra-class variability, as shown in the green solid box within Figure 1. It can dynamically yield more rational and stable decision boundaries based on sample differences.

It should be noted that this article is an expanded version of the paper “Instance-level Scaling and Dynamic Margin-alignment Knowledge Distillation” [33] accepted for presentation at the 7th Chinese Conference on Pattern Recognition and Computer Vision (PRCV 2024). It analyzes the impact of RSI data on existing knowledge distillation methods, introduces a framework for RSI scene classification, and presents additional datasets and experimental validation to demonstrate the effectiveness of the proposed ISDM approach.

In summary, our contributions are as follows:

We scale the target class of the teacher’s soft label at instance level via an entropy regularization loss to address the high inter-class similarity inherent in RSI.
We introduce dynamic margin alignment for the probabilistic prediction scores, allowing the student model to establish more logical and adaptive decision boundaries, effectively addressing large intra-class variability.
We propose an effective logits-based distillation method named ISDM, which achieves state-of-the-art performances across all datasets with minimal additional computational costs.

2. Related Work

2.1. Remote Sensing Image Scene Classification

The primary goal of scene classification is to automatically determine land cover types within remote sensing image patches. Scene classification using supervised learning algorithms can be divided into three categories: low-level, mid-level, and high-level methods. Low-level methods focus on extracting handcrafted features, such as SIFT (scale-invariant feature transform) [12,13], HOG (histogram of oriented gradients) [14], and LBP (local binary patterns) [15], and use classifiers like support vector machines (SVM) [34] or K-nearest neighbors (KNN) [35] for scene classification. Although these methods work well for specific structures and arrangements, they often fall short when dealing with the varied and intricate spatial distributions present in images. Mid-level methods create scene representations by encoding low-level local features. The Bag-of-Visual-Words (BoVW) [36] model is frequently utilized and is often augmented with different low-level descriptors [37], along with Gaussian mixture models (GMMs) [38] and pyramid-based approaches [39,40,41]. Moreover, topic models are used to integrate higher-order spatial relationships between local visual words [42,43,44]. High-level methods leverage deep learning models, which have set new benchmarks in image recognition, speech recognition, semantic segmentation, and remote sensing scene classification. Prominent deep learning models, such as VGG [45], ResNet [16], and WRN [46], have demonstrated superior performance in remote sensing scene classification, surpassing both shallow models and low-level techniques by extracting deep visual features from large-scale training datasets.

However, deep neural networks require substantial computational resources and storage due to their large number of parameters. This makes them impractical for resource-constrained environments like embedded systems or real-time processing. To address this, knowledge distillation techniques can be used to compress models, balancing efficiency with performance.

2.2. Knowledge Distillation

As an effective model compression method, knowledge distillation is firstly proposed by [31], which aims to transfer the knowledge of the teacher model to the student model. Based on the type of transferred knowledge, knowledge distillation is predominantly divided into two categories: (1) logits-based distillation methods and (2) feature-based distillation methods.

The logits-based methods merely employ the teacher model’s logits to transfer the knowledge. The origin knowledge distillation [31] transfers the knowledge by minimizing the Kullback–Leibler (KL) divergence between the probabilistic prediction scores of the teacher and student, which is adopted by many subsequent works due to its simplicity and efficiency. To use the knowledge better, much research attention is drawn to transfer the knowledge from the deep intermediate layers [47,48,49,50]. Thus, the feature-based methods are normally well performing. However, they are always infeasible when confronted with many practical problems. In some practical applications, the intrinsic architecture and the intermediate features of the teacher model may not be available due to safety and privacy issues. Moreover, feature-based methods are computationally expensive.

Research on applying knowledge distillation to RSI scene classification is still limited. A key challenge is that existing KD methods struggle with the significant intra-class variability and inter-class similarity present in RSI data. Traditional distillation techniques often generate low-quality soft labels because they do not effectively manage these variations and similarities, which adversely affects the performance of the student model. Additionally, the limited capacity of student models can exacerbate this issue, resulting in overly complex decision boundaries and diminished accuracy. To address these challenges, we propose an effective approach that combines improved soft label techniques with dynamic-alignment distillation methods. This approach aims to enhance both the quality of the soft labels and the decision boundaries in RSI scene classification.

2.3. Optimization of Soft Labels

Soft labels are the predicted probability scores obtained by applying softmax to logits at temperature

τ

, where the target class reflects the confidence in the correct category and the non-target classes reflect the confidence in other categories. In logit-based methods, student models rely heavily on soft labels for training, making their quality crucial. Enhancing soft label quality has, thus, become a key area of research.

When soft labels are overly confident in the target class, they degenerate into one-hot labels, preventing full utilization of the knowledge in non-target classes. Many studies have addressed this issue [51,52,53,54,55,56,57]. For smoother soft labels, ATS [51] and NTCE-KD [52] reduce the target class values, while SFKD [53] uses attention mechanisms for smoothing. Label smoothing research [54,55,56,57] shows that it provides benefits similar to knowledge distillation (KD) in optimizing soft labels. For instance, [57] applied label smoothing to achieve softer labels, improving student model performance. On the other hand, insufficient confidence in the target class also hinders student training [58], as it misleads the student model into incorrect classifications. CKD [58] addresses this by replacing incorrect soft labels with hard labels, avoiding the transfer of erroneous knowledge.

However, existing studies have primarily focused on either excessive or insufficient confidence in soft labels, often relying on manual optimizations. To address this, we propose an IS module that comprehensively optimizes soft labels at instance level, ensuring that they capture both the correct knowledge from the target class and rich information from the non-target classes.

3. Materials and Methods

3.1. Preliminaries

In an image classification task with C classes, the logits can be denoted as

Z = [z_{1}, z_{2}, \dots, z_{i}, \dots, z_{t}, \dots, z_{C}] \in R^{1 \times C}

, where

z_{i}

represents the logit of the i-th class and

z_{t}

represents the target class. The probabilistic prediction scores

P

are defined by applying the softmax function to the logits as follows:

\begin{matrix} p_{i} = \frac{exp (z_{i} / τ)}{\sum_{j = 1}^{C} (exp (z_{j} / τ))}, \end{matrix}

(1)

where

p_{i}

is the i-th probabilistic prediction score and

τ

is the temperature hyperparameter to scale the soft labels. In original knowledge distillation, the student model is forced to mimic the teacher model’s behavior by minimizing the KL divergence between the probabilistic prediction scores of the teacher model and the student model:

\begin{matrix} L_{K D} = & K L (P^{t e a} | | P^{s t u}) = \sum_{i = 1}^{C} p_{i}^{t e a} log (p_{i}^{t e a} / p_{i}^{s t u}), \end{matrix}

(2)

where

L_{K D}

is the knowledge distillation loss and

p_{i}^{t e a}

and

p_{i}^{s t u}

are the i-th class of the probabilistic prediction scores of the teacher model and the student model. And the

P^{t e a}

are known as soft labels.

3.2. Instance-Level Scaling (IS)

In practical scenarios, traditional knowledge distillation methods often produce low-quality soft labels due to inter-class similarity and intra-class variability in remote sensing imagery. Furthermore, due to their limited capacity, student models may struggle to effectively manage intra-class variability, leading to complex decision boundaries and, consequently, reduced accuracy of the student network.

Therefore, we propose a novel instance-level scaling (IS) method to tailor an improved soft label for each instance, as shown in Figure 2.

Specifically, the IS module, a single-layer perceptron, generates a new target class of logits

{\tilde{z}}_{t}

as follows:

\begin{matrix} \tilde{z_{t}} = & F_{I S} (θ_{I S}, Z) . \end{matrix}

(3)

Let optimized logits

\tilde{Z} = [z_{1}, z_{2}, \dots, z_{i}, \dots, {\tilde{z}}_{t}, \dots, z_{C}] \in R^{1 \times C}

, where

z_{t}

is replaced by

\tilde{z_{t}}

. We normalize

\tilde{Z}

through the softmax function and obtain the scaled soft labels

\tilde{P}

:

\begin{matrix} \tilde{p_{i}} = & S o f t m a x (\tilde{Z}) = \{\begin{matrix} \frac{exp (z_{i} / τ)}{\sum_{j = 1, j \neq t}^{C} (exp (z_{j} / τ)) + exp (\tilde{z_{t}} / τ)} & for & i \neq t, \\ \frac{exp \tilde{z_{t}} / τ}{\sum_{j = 1, j \neq t}^{C} (exp (z_{j} / τ)) + exp (\tilde{z_{t}} / τ)} & for & i = t, \end{matrix} \end{matrix}

(4)

where t is the target index of a training sample. Different from existing methods that utilize the same soft label scaling strategy for all samples in the dataset, our proposed IS module generates an instance-level target class for each sample based on the perceptron. Therefore, the design of the perceptron loss is crucial to our IS module, and we carefully analyze the optimization objectives of the IS module in the following two aspects.

For easy samples, the teacher model generates a high value on the target class due to overconfidence. Therefore, directly aligning the output probability of teacher and student models through KL divergence may lead to the depreciation of the information about the non-target class. To tackle this, we aim to increase the entropy of the soft labels to enrich the information about the non-target class. In the implementation, this can be achieved by minimizing the negative entropy of the soft labels outputted by the IS module as follows:

\begin{matrix} L_{N E} = & min_{θ_{I S}} - (- P log P^{t e a}) = min_{θ_{I S}} \sum_{i = 1}^{C} p_{i}^{t e a} log p_{i}^{t e a}, \end{matrix}

(5)

where

θ_{I S}

is parameters of the IS module.

For hard samples, the teacher model tends to generate a relatively small value on the target class, which is insufficient for the student model to obtain discriminative representations. To tackle this, the cross-entropy loss is utilized as the optimization objective for the IS module to further enhance the discrimination of its soft labels as follows:

\begin{matrix} L_{C E} = & min_{θ_{I S}} - Y log P^{t e a} = min_{θ_{I S}} \sum_{i = 1}^{C} - y_{i} log p_{i}^{t e a}, \end{matrix}

(6)

where

y_{i}

is the one-hot representation of the ground truth:

\begin{matrix} y_{i} = & \{\begin{matrix} 0 & for & i \neq t, \\ 1 & for & i = t . \end{matrix} \end{matrix}

(7)

To address both of the above situations, we optimize the soft labels via an entropy regularization loss consisting of

L_{N E}

and

L_{C E}

:

\begin{matrix} L_{I S} = & L_{N E} + ω L_{C E} = \sum_{i = 1}^{C} p_{i}^{t e a} log p_{i}^{t e a} + ω \sum_{i = 1}^{C} y_{i} log p_{i}^{t e a}, \end{matrix}

(8)

where

ω

is the balance weight.

In summary, we design the entropy regularization loss to obtain a better target class of logits

{\tilde{z}}_{t}

and then obtain an optimized soft label

\tilde{P}

via softmax function.

3.3. Dynamic Margin-Alignment (DM)

Benefitting from the IS module, the refined soft labels from the teacher model can balance the effect of the target and non-target classes for each sample, thereby mitigating the impact of high inter-class similarity to some extent. High intra-class variability can cause smaller student models to struggle with fitting most of the samples effectively. To address this issue, we propose the dynamic margin-alignment (DM) module (see Figure 2), which generates specific margins for each sample to create a more reasonable boundary. Specifically, for simple samples, the margin can be large to ensure sufficient discrimination by the student model, while for hard samples that are difficult to classify, the margin is relatively small to prevent overfitting.

3.3.1. Margin-Alignment for KD

For better understanding, we first introduce the margin-alignment module, which explicitly encourages intra-class compactness and inter-class separability and produces a reasonable decision boundary for the target class with a marginal region.

Firstly, we introduce a hyperparameter

Δ

for setting the marginal region and subtract

Δ

from the target class within logits while keeping the non-target class unchanged.

\begin{matrix} {\hat{z}}_{i}^{s t u} = & \{\begin{matrix} z_{i} & for & i \neq t, \\ z_{t} - Δ & for & i = t . \end{matrix} \end{matrix}

(9)

And the margin-enhanced probabilistic prediction scores of the student model

{\tilde{P}}^{s t u}

can be calculated via the softmax function.

\begin{matrix} {\hat{P}}^{s t u} = & S o f t m a x ({\hat{Z}}^{s t u}), \end{matrix}

(10)

where

{\hat{z_{i}}}^{s t u}

is logits’ value on the i-th class and

{\hat{P}}^{s t u}

is the probability vector.

Secondly, we align the margin-enhanced student’s probabilistic prediction scores

{\hat{P}}^{s t u}

with optimized soft labels

{\tilde{P}}^{t e a}

via the KL divergence.

\begin{matrix} L_{K D} = & K L ({\tilde{P}}^{t e a} | | {\hat{P}}^{s t u}) = \sum_{i = 1}^{C} {\tilde{p_{i}}}^{t e a} log ({\hat{p_{i}}}^{t e a} / {\hat{p_{i}}}^{s t u}) . \end{matrix}

(11)

3.3.2. Dynamic Adjustment of Margin

Taking into account the influence of the difficulty level of samples on the decision boundary, the adaptive adjustment of the hyperparameter

Δ

is adopted for dynamic margin.

In order to clearly explore the relationship between the margin and the difficulty of samples, we conduct a geometric analysis on binary classification tasks focusing on both easy and hard samples. Considering features

x

of a sample belonging to class 1, the fixed alignment criterion is used to ensure that

z_{1} - Δ > z_{2}

(in Equation (9)) where

z_{i} = W_{i}^{T} x

and

W_{i}

is the parameters of classifier for class i, that is,

\begin{matrix} ∥ W_{2} ∥ ∥ x ∥ cos (θ_{2}) & < ∥ W_{1} ∥ ∥ x ∥ cos (θ_{1}) - Δ \\ = ∥ W_{1} ∥ ∥ x ∥ (cos (θ_{1}) - \frac{Δ}{∥ W_{1} ∥ ∥ x ∥}) . \end{matrix}

(12)

For simplicity, we analyze the scenario when

∥ W_{1} ∥ = ∥ W_{2} ∥

, as shown in Figure 3. Thus, the marginal region m can be calculated as follows:

\begin{matrix} m = \frac{Δ}{∥ W_{1} ∥ ∥ x ∥} < cos (θ_{1}) - cos (θ_{2}) . \end{matrix}

(13)

Specifically, for easy samples, a fixed margin can result in a decision boundary, as shown in Figure 3a; the model can classify them correctly but lacks sufficient discrimination. Thus, the margin needs to be increased to further reduce the intra-class distance and expand the inter-class distance, which can yield a more stringent decision boundary (in Figure 3b), expressed as

\begin{matrix} cos (θ_{1}) - m > cos (θ_{1}) - m^{'} > cos (θ_{2}) . \end{matrix}

(14)

For hard samples, the model struggles with accurate classification. Maintaining the margin fixed in such a case may lead to model overfitting:

\begin{matrix} cos (θ_{1}) - m < cos (θ_{1}) < cos (θ_{2}), \end{matrix}

(15)

resulting in overlapping decision boundaries, as depicted in Figure 3c. Therefore, it becomes necessary to set the margin negative for hard samples. This adjustment encourages the model to sacrifice the accuracy of certain hard samples,

\begin{matrix} cos (θ_{1}) - m^{″} > cos (θ_{2}), \end{matrix}

(16)

ultimately enhancing overall accuracy during the testing phase, as illustrated in Figure 3d.

In summary, due to

m \propto Δ

, increasing

Δ

for easy samples makes

{\tilde{z}}_{t}

greater than

z_{t}

. For difficult samples, relaxing

Δ

makes

{\tilde{z}}_{t}

approaching or less than

z_{t}

.

Because the target class value reflects the difficulty of a sample [32], we use its variation as the delta value. In the implementation,

Δ

(in Equation (9)) can be adjusted dynamically as follows:

\begin{matrix} Δ_{j} = {\tilde{z}}_{t, j} - z_{t, j}, \end{matrix}

(17)

where t is the target class index and j is the sample index.

The overall workflow of our proposed ISDM can be found in Algorithm 1.

Algorithm 1 Instance-level scaling and dynamic margin-alignment distillation

Require:: Dataset D, Teacher model $F_{T}$
Ensure:: Student model $F_{S}$ , IS module $F_{I S}$
1:: Initialize student model $F_{S}$ , IS module $F_{I S}$ with parameters $θ_{S}$ , $θ_{I S}$
2:: Set hyper-parameters: learning rate $η$ , batch size B, number of epochs E, weight of KD $α$ , weight of Entrop $ω$ ,
3:: for $e p o c h = 1$ to E do
4:: Shuffle dataset D
5:: for each $(x$ , $y)$ in D with size B do
6:: % Forward pass
7:: $Z^{t e a} \leftarrow F_{T} (x)$ , $Z^{s t u} \leftarrow F_{S} (x)$
8:: % Instance-level Scaling
9:: ${\tilde{z}}_{t}^{s t u} \leftarrow F_{I S} (Z^{t e a})$
10:: ${\tilde{Z}}^{t e a} \leftarrow C o n c a t (Z_{n}^{t e a}, {\tilde{z}}_{t}^{s t u})$
11:: ${\tilde{P}}^{t e a} \leftarrow S o f t m a x ({\tilde{Z}}^{t e a})$ , $P^{t e a} \leftarrow S o f t m a x (Z^{t e a})$
12:: % Dynamic Margin-alignment
13:: $Δ \leftarrow {\tilde{z}}_{t a r}^{s t u} - z_{t a r}^{t e a}$
14:: ${\hat{z}}_{t a r}^{s t u} \leftarrow z_{t a r}^{s t u} - Δ$
15:: ${\hat{Z}}^{s t u} \leftarrow C o n c a t (Z_{n o n}^{s t u}, {\hat{z}}_{t a r}^{s t u})$
16:: ${\hat{P}}^{s t u} \leftarrow S o f t m a x ({\hat{Z}}^{s t u})$
17:: % Compute loss and update parameters
18:: $L_{I S} \leftarrow L_{N E} + ω L_{C E}$
19:: $L_{C E} \leftarrow C E (P^{t e a}, y)$ , $L_{K D} \leftarrow K L ({\tilde{P}}^{t e a}, {\hat{P}}^{s t u})$
20:: $L_{t o t a l} \leftarrow L_{C E} + α L_{K D}$
21:: $θ_{S} \leftarrow θ_{S} - η \nabla L_{t o t a l}$ , $θ_{I S} \leftarrow θ_{I S} - η \nabla L_{I S}$
22:: end for
23:: end for
24:: Return trained student model $F_{S}$

4. Results

4.1. Datasets

We evaluate our method on three popular RSI scene classification benchmark datasets and two widely-used general image classification benchmark datasets.

4.1.1. NWPU-RESISC45 Dataset

The NWPU-RESISC45 dataset [59] is a comprehensive resource for remote sensing image classification, featuring 31,500 images from over 100 countries. It includes 45 scene categories with 700 images per category, each image sized at 256 × 256 pixels in RGB format. The dataset’s challenge lies in its varying spatial resolutions (300 cm to 20 cm per pixel), which can result in significant inter-class similarities, necessitating advanced classification methods.

4.1.2. Aerial Image Dataset (AID)

The AID dataset [60] contains 10,000 high-resolution aerial images across 30 scene types. Each type is represented by 200 to 400 images at a resolution of 600 × 600 pixels in RGB format. The images have varying spatial resolutions (800 cm to 50 cm per pixel), adding complexity and relevance for testing sophisticated aerial image classification algorithms.

4.1.3. UC Merced Land-Use Dataset (UCM)

The UCM dataset [61] comprises 2100 images across 21 land-use categories, with 100 images per category. The images are uniformly sized at 256 × 256 pixels and are in RGB format. With a consistent spatial resolution of 30 cm per pixel, this dataset simplifies analysis while providing detailed land-use patterns for research purposes.

4.2. Settings and Implementation Details

For the three remote sensing (RS) datasets, ResNet34 is used as the teacher model and ResNet18 as the student model. We apply two data splitting ratios: one with 80% of the data for training and 20% for testing, and another with 50% for both training and testing, to evaluate the model performance with less training data. The top-1 and top-5 accuracy on the test set are used as evaluation metrics.

We compare our methods with various SOTA methods, including logit-based methods, such as KD [31], DKD [32], MLLD [62], LS [63], and SDD [64], and feature-based methods, such as FitNet [47], AT [50], RKD [65], OFD [66], CRD [67], ReviewKD [49], and CAT [68].

We set the training batch size to 64 and the testing batch size to 128. The temperature parameter is set as 4. The initial learning rate is 0.1, with a total of 200 epochs. The learning rate is reduced by a factor of 10 at 60, 120, and 160 epochs. We use the SGD optimizer with a momentum of 0.9. We apply a weight decay of 5 × 5 ×

10^{- 4}

. The base weight for the cross-entropy loss is set to 1, and the base weight for the knowledge distillation loss

α

is set to 4. The base balance weight for the entropy regularization loss

ω

, as defined in Equation (8), is set to 0.1. Experiments are conducted using Python 3.7 with PyTorch on an NVIDIA V100 GPU.

4.3. Main Results

In our experiments across the NWPU-RESISC45, AID, and UCM datasets, the ISDM method consistently outperforms other techniques, using ResNet34 as the teacher network and ResNet18 as the student network. On the NWPU-RESISC45 dataset presented in Table 1, ISDM achieves the highest top-1 and top-5 accuracy across both split ratios, demonstrating its robustness in knowledge distillation.

Similarly, on the AID dataset in Table 2, ISDM surpasses all competing methods, including those with feature-based and logits-based distillation, with significant improvements in top-1 accuracy (95.55%) and top-5 accuracy (99.75%) under the 8:2 split ratio.

The results are even more pronounced on the UCM dataset in Table 3, where ISDM sets new benchmarks with a top-1 accuracy of 92.62% and a top-5 accuracy of 99.76%, surpassing the second-best method by 1.43%. This highlights its ultimate efficacy in leveraging teacher–student networks. This superior performance across multiple datasets underscores ISDM’s effectiveness in distilling knowledge.

4.4. Ablation Study

The results of ablation experiments are shown in Table 4. The first row presents the experimental results of ISDM. Removing the IS component from ISDM, there is a performance decrease of 0.65% (see ① and ②). When the DM component is removed, it results in a performance drop of 0.73% (see ① and ③), and removing both IS and DM components leads to a significant performance decrease of 2.79% (see ① and ④).

Under equivalent conditions, we also compare DM and FM (fixed margin-alignment). We set

Δ

(seen in Equation (9)) fixed as 3, 5, and 7 in FM and contrast the best result with

Δ

as 5 against DM (see ① and ⑤). The results indicate that DM shows performance improvement over FM, proving its superiority over FM.

4.5. Sensitivity of Hyperparameters

The selection of loss weights

α

and

β

is essential for effectively balancing the cross-entropy loss

L_{C E}

and distillation loss

L_{K D}

. Based on previous studies [31,32,62,64], we keep

α

constant at 1.0. To find the best value for

β

, we conduct a systematic grid search, testing various options:

{1.0, 2.0, 4.0, 6.0, 8.0}

. We choose the option that results in the highest accuracy, as shown in Table 5. Notably,

β = 4.0

achieves the highest accuracy, indicating that increasing

β

from 1.0 to 4.0 improves performance, but further increases lead to smaller gains. This highlights the need for careful tuning of

β

to optimize knowledge distillation.

In Table 6, we observe that with an 8:2 dataset split, the model’s accuracy peaks at

α = 1.0

. For a 5:5 split, the best performance is at

α = 0.5

. Keeping

α

below 1.0 generally maintains good performance, but it drops sharply if

α

exceeds 1.0. This suggests that higher

α

values may overly rely on cross-entropy loss, especially with limited data in a 5:5 split, raising the risk of overfitting. Hence, setting

α

to 1.0 is a sensible choice.

The results in Table 7 indicate that the parameter

ω

achieves optimal performance at a value of 0.1, with the model attaining accuracies of 92.62% and 88.29% for the 8:2 and 5:5 split ratios, respectively. Although slight variations in accuracy are observed across different

ω

values, the overall performance remains relatively stable. This suggests a degree of robustness in the performance across various settings of

ω

.

4.6. Motivation Validation

To better understand the characteristics of the RS dataset, we performed experiments to measure feature similarity between categories in the UCM dataset. We trained ResNet18 models on the UCM dataset using different methods. We then calculated the average logits for each category to use as category centers and computed the cosine similarity between each sample’s logits and all category centers. Figure 4b–d show the inter-class similarity results for the UCM dataset using models trained with Vanilla (“Vanilla” represents the standard ResNet18 model training with only the cross-entropy loss), KD, and MISD methods, respectively. Figure 4a shows the results from Vanilla on the CIFAR-100 dataset. For comparison, we only used data from 21 out of 100 classes.

Comparing Figure 4a,b, the RS dataset has more complex category similarity patterns than CIFAR-100. Some categories, like 12, 19, and 20, are quite similar to other categories, while others like 5 and 7 differ more with others. In contrast, CIFAR-100 has a more balanced similarity across categories, which helps the model learn features better. As shown in Figure 4b–d, KD can somewhat reduce the negative effects of the RS dataset, but ISDM can largely remove these effects, coming close to the results seen with CIFAR-100.

4.7. Effect of Instance-Level Scaling

To further evaluate the effects of our proposed IS module, Figure 5 visualizes the soft labels processed by the IS module for both easy and hard samples, respectively.

Figure 5a illustrates the scaling process of soft labels from easy images in the beach and tennis-court classes. The second column describes the values of the target class before and after scaling for an easy sample. The third column shows the original soft labels; it can generate a high value on the target class due to its low inter-class similarity, thus causing the depreciation of the information about the non-target class. The last column shows the soft labels processed by our IS module. Through the IS module, the overconfidence effect on the target class can be relieved and the information about the non-target class can be enhanced.

Figure 5b illustrates the scaling process of soft labels from easy images in the sparse-residential and baseball-diamond classes. The third column illustrates the raw soft labels, which have a small target class value due to their high inter-class similarity. The last column shows the soft labels processed by our IS module, demonstrating how the insufficiency of the target class can be alleviated via the IS module.

4.8. Effect of Dynamic Margin-Alignment

To assess the effectiveness of DM, we employ visualization of t-SNE. Figure 6a,b showcase features learned by KD and ISDM, respectively. Figure 6a shows that KD results in the features of most samples being mixed together, making them hard to distinguish, with more complex decision boundaries and smaller inter-class distances. As demonstrated in Figure 6b, dynamic margin sets greater and negative margins for simple and difficult samples, respectively. It achieves more dispersed inter-class distances for the vast majority of samples, resulting in more marginal decision boundaries.

4.9. Distillation Fidelity

To provide a comprehensive understanding of distillation fidelity, we follow [32,49] and present our visualizations in Figure 7. Specifically, when focusing on the ResNet34–ResNet18 model pair trained on the UCM dataset, we calculate the absolute distance between the correlation matrices of the teacher and student models. Our findings show that ISDM enhances the alignment of the student model’s predictions with those of the teacher model. Specifically, ISDM results in a maximum difference of 1.71 and a mean difference of 0.35, whereas KD has a maximum difference of 1.95 and a mean difference of 0.42. The lower difference metrics for ISDM indicate that it achieves better alignment between the student and teacher models compared to KD.

4.10. Generalization Exploration

In evaluating the generalization of our ISDM across different datasets, we observed its robust performance on two general image classification datasets, beyond its initial remote sensing application.

Our method consistently outperforms or matches state-of-the-art methods across various teacher–student pairs shown in Table 8, demonstrating competitive results against both feature-based and logits-based methods. ISDM shows notable improvements over other approaches in most configurations, especially in the context of ResNet-56 to ResNet-20 and ResNet-32×4 to ResNet-8×4. This suggests that ISDM not only generalizes well within the CIFAR-100 dataset but also performs comparably to, or better than, existing methods that leverage intermediate feature representations and logits.

On the large-scale ImageNet-1k dataset shown in Table 9, ISDM continues to exhibit superior performance. It outperforms both feature-based and logits-based methods across different teacher–student configurations, including ResNet34 to ResNet18 and ResNet50 to MobileNetV2. This consistent performance across datasets of varying sizes and complexities indicates that ISDM’s effectiveness extends well beyond the original remote sensing tasks, highlighting its strong generalization capability.

In summary, these results confirm that ISDM is not only effective in its primary domain but also shows impressive versatility and robustness in other diverse and challenging scenarios, reinforcing its broad applicability.

4.11. Training Efficiency

We evaluate the training overhead and accuracy for SOTA methods shown in Figure 8. Our approach involves improvements to KD by enhancing soft labels and alignment strategy. Consequently, it exhibits a similar time overhead to KD, providing a substantial advantage over other methods while achieving the highest model performance.

Furthermore, the ISDM method introduces only a perceptron with extra parameters less than 0.01 M, which is negligible compared to the trainable parameters during distillation. Notably, this perceptron is exclusively used to optimize soft labels in training. It does not participate in student’s inference, thereby incurring no overhead.

5. Conclusions

In this paper, we addressed the challenge of high inter-class similarity and large intra-class variance in remote sensing datasets by proposing a distillation method named ISDM. This method optimizes teacher soft labels through instance-level scaling and employs a margin-alignment strategy during the distillation process to enhance model generalization. The ISDM method showed significant improvements on the NWPU-RESISC45, AID, and UCM datasets while maintaining lower costs.

Additionally, we validated the effectiveness of our approach through extensive experiments and demonstrated its generalizability on standard datasets such as CIFAR-100 and ImageNet-1k. We hope that this paper will contribute to advancements in scene classification for remote sensing images and improvements in logits-based distillation methods.

Author Contributions

Conceptualization, C.L. and X.T.; methodology, C.L.; software, C.L.; validation, C.L. and X.T.; formal analysis, C.L.; investigation, C.L.; resources, L.L.; data curation, C.L.; writing—original draft preparation, C.L.; writing—review and editing, X.T. and Y.D.; visualization, C.L.; supervision, L.L.; project administration, L.L.; funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62376282).

Data Availability Statement

The five datasets (NWPU-RESISC45, AID, UCM, CIFAR-100, and ImageNet-1k) used to illustrate and evaluate the proposed method are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RSI	Remote sensing image
KD	Knowledge distillation
ISDM	Instance-level scaling and dynamic margin-alignment knowledge distillation
KL	Kullback–Leibler
SIFT	Scale-invariant feature transform
HOG	Histogram of oriented gradients

Appendix A

The success of knowledge distillation hinges on the quality of the soft labels used. When these soft target distributions are overly sharp, they may inadvertently stifle the potential benefits of distillation. In this discussion, we will explore this issue from various perspectives, highlighting how the entropy of soft labels and the non-target class elements within them influence the effectiveness of distillation.

In the paper “Distilling the Knowledge in a Neural Network”, Hinton points out that soft labels with higher entropy provide more information, while sharper soft label distributions have lower entropy, yielding less informative signals for distillation [31]. This is encapsulated in the following description:

“When the soft targets have high entropy, they provide much more information per training case than hard targets and much less variance in the gradient between training cases, so the small model can often be trained on much less data than the original cumbersome model and using a much higher learning rate.”

Furthermore, Hinton observed that using softmax outputs as soft labels that are too sharp hampers the model’s performance. To address this, he proposed a temperature parameter to smooth the soft labels, demonstrating that smoother soft labels can enhance the performance of knowledge distillation. This further substantiates the assertion that overly sharp soft label distributions can indeed hinder the potential of knowledge distillation. The relevant description states:

“For tasks like MNIST in which the cumbersome model almost always produces the correct answer with very high confidence, much of the information about the learned function resides in the ratios of very small probabilities in the soft targets. For example, one version of a 2 may be given a probability of $10^{- 6}$ of being a 3 and $10^{- 9}$ of being a 7, whereas for another version it may be the other way around. This is valuable information that defines a rich similarity structure over the data (i.e., it says which 2’s look like 3’s and which look like 7’s), but it has very little influence on the cross-entropy cost function during the transfer stage because the probabilities are so close to zero. Our more general solution, called ‘distillation,’ is to raise the temperature of the final softmax until the cumbersome model produces a suitably soft set of targets. We then use the same high temperature when training the small model to match these soft targets. We show later that matching the logits of the cumbersome model is actually a special case of distillation.”

In the paper “Decoupled Knowledge Distillation”, the authors demonstrate that a significant component of the effectiveness of knowledge distillation lies in the non-target class elements of the soft labels [32]. The alignment weight for the non-target class elements is given by

1 - p_{t}

, where

p_{t}

is the probability of the target class. Thus, as the soft labels become sharper, the probability

p_{t}

increases, resulting in a smaller alignment weight for the non-target class elements and consequently diminishing the effectiveness of knowledge distillation. This is articulated in the following description:

“To revitalize logits-based methods, we start this work by delving into the mechanism of KD. Firstly, we divide a classification prediction into two levels: (1) a binary prediction for the target class and all the non-target classes, and (2) a multi-category prediction for each non-target class. Based on this, we reformulate the classical KD loss into two parts, as shown in Figure 1b. One is a binary logit distillation for the target class and the other is a multi-category logit distillation for non-target classes. For simplification, we respectively name them as target classification knowledge distillation (TCKD) and non-target classification knowledge distillation (NCKD). The reformulation allows us to study the effects of the two parts independently. TCKD transfers knowledge via binary logit distillation, which means only the prediction of the target class is provided while the specific prediction of each non-target class is unknown. A reasonable hypothesis is that TCKD transfers knowledge about the ‘difficulty’ of training samples, i.e., the knowledge describes how difficult it is to recognize each training sample. To validate this, we design experiments from three aspects to increase the ‘difficulty’ of training data, i.e., stronger augmentation, noisier labels, and inherently challenging datasets. NCKD only considers the knowledge among non-target logits. Interestingly, we empirically prove that applying NCKD alone achieves comparable or even better results than classical KD, indicating the vital importance of knowledge contained in non-target logits, which could be the prominent ‘dark knowledge’.”

References

Estoque, R.C.; Murayama, Y.; Akiyama, C.M. Pixel-based and object-based classifications using high-and medium-spatial-resolution imageries in the urban and suburban landscapes. Geocarto Int. 2015, 30, 1113–1129. [Google Scholar] [CrossRef]
Helber, P.; Bischke, B.; Dengel, A.; Borth, D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2217–2226. [Google Scholar] [CrossRef]
Zhang, X.; Wang, Q.; Chen, G.; Dai, F.; Zhu, K.; Gong, Y.; Xie, Y. An object-based supervised classification framework for very-high-resolution remote sensing images using convolutional neural networks. Remote Sens. Lett. 2018, 9, 373–382. [Google Scholar] [CrossRef]
Chen, G.; Zhang, X.; Wang, Q.; Dai, F.; Gong, Y.; Zhu, K. Symmetrical dense-shortcut deep fully convolutional networks for semantic segmentation of very-high-resolution remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2018, 11, 1633–1644. [Google Scholar] [CrossRef]
Gualtieri, J.A.; Cromp, R.F. Support vector machines for hyperspectral remote sensing classification. In Proceedings of the 27th AIPR Workshop: Advances in Computer-Assisted Recognition, Washington, DC, USA, 14–16 October 1998; SPIE: Bellingham, WA, USA, 1999; Volume 3584, pp. 221–232. [Google Scholar]
Duro, D.C.; Franklin, S.E.; Dubé, M.G. A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using SPOT-5 HRG imagery. Remote Sens. Environ. 2012, 118, 259–272. [Google Scholar] [CrossRef]
Cheriyadat, A.M. Unsupervised feature learning for aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2013, 52, 439–451. [Google Scholar] [CrossRef]
Peña, J.M.; Gutiérrez, P.A.; Hervás-Martínez, C.; Six, J.; Plant, R.E.; López-Granados, F. Object-based image classification of summer crops with machine learning methods. Remote Sens. 2014, 6, 5019–5041. [Google Scholar] [CrossRef]
Lu, D.; Li, G.; Moran, E.; Kuang, W. A comparative analysis of approaches for successional vegetation classification in the Brazilian Amazon. Giscience Remote Sens. 2014, 51, 695–709. [Google Scholar] [CrossRef]
De Chant, T.; Kelly, M. Individual object change detection for monitoring the impact of a forest pathogen on a hardwood forest. PHotogrammetr. Eng. Remote Sens. 2009, 75, 1005–1013. [Google Scholar] [CrossRef]
Dribault, Y.; Chokmani, K.; Bernier, M. Monitoring seasonal hydrological dynamics of minerotrophic peatlands using multi-date GeoEye-1 very high resolution imagery and object-based classification. Remote Sens. 2012, 4, 1887–1912. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Comparing SIFT descriptors and Gabor texture features for classification of remote sensed imagery. In Proceedings of the 2008 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12–15 October 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 1852–1855. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Lan, L.; Wang, X.; Zhang, S.; Tao, D.; Gao, W.; Huang, T.S. Interacting tracklets for multi-object tracking. IEEE Trans. Image Process. 2018, 27, 4585–4597. [Google Scholar] [CrossRef]
Lan, L.; Wang, X.; Hua, G.; Huang, T.S.; Tao, D. Semi-online multi-people tracking by re-identification. Int. J. Comput. Vis. 2020, 128, 1937–1955. [Google Scholar] [CrossRef]
Tan, H.; Zhang, X.; Zhang, Z.; Lan, L.; Zhang, W.; Luo, Z. Nocal-siam: Refining visual features and response with advanced non-local blocks for real-time siamese tracking. IEEE Trans. Image Process. 2021, 30, 2656–2668. [Google Scholar] [CrossRef]
Lan, L.; Teng, X.; Zhang, J.; Zhang, X.; Tao, D. Learning to purification for unsupervised person re-identification. IEEE Trans. Image Process. 2023, 32, 3338–3353. [Google Scholar] [CrossRef]
Teng, X.; Lan, L.; Zhao, J.; Li, X.; Tang, Y. Highly Efficient Active Learning With Tracklet-Aware Co-Cooperative Annotators for Person Re-Identification. IEEE Trans. Neural Netw. Learn. Syst. 2023, 1–14. [Google Scholar] [CrossRef]
Teng, X.; Li, C.; Li, X.; Liu, X.; Lan, L. TIG-CL: Teacher-Guided Individual and Group Aware Contrastive Learning for Unsupervised Person Re-Identification in Internet of Things. IEEE Internet Things J. 2024. [Google Scholar] [CrossRef]
Hu, F.; Xia, G.S.; Hu, J.; Zhang, L. Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Marmanis, D.; Wegner, J.D.; Galliani, S.; Schindler, K.; Datcu, M.; Stilla, U. Semantic segmentation of aerial images with an ensemble of CNSS. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 473–480. [Google Scholar] [CrossRef]
Nogueira, K.; Penatti, O.A.; Dos Santos, J.A. Towards better exploiting convolutional neural networks for remote sensing scene classification. Pattern Recognit. 2017, 61, 539–556. [Google Scholar] [CrossRef]
Liu, Y.; Huang, C. Scene classification via triplet networks. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 11, 220–237. [Google Scholar] [CrossRef]
Li, W.; Fu, H.; Yu, L.; Gong, P.; Feng, D.; Li, C.; Clinton, N. Stacked Autoencoder-based deep learning for remote-sensing image classification: A case study of African land-cover mapping. Int. J. Remote Sens. 2016, 37, 5632–5646. [Google Scholar] [CrossRef]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
Li, C.; Teng, X.; Yan, D.; Lan, L. Instance-level Scaling and Dynamic Margin-alignment Knowledge Distillation. In Proceedings of the 7th Chinese Conference on Pattern Recognition and Computer Vision (PRCV 2024), Urumqi, China, 18–20 October 2024. Accepted for presentation; proceedings not yet available. [Google Scholar]
Cortes, C. Support-Vector Networks. Mach. Learn. 1995. [Google Scholar] [CrossRef]
Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, CA, USA, 3–5 November 2010; pp. 270–279. [Google Scholar]
Chen, L.; Yang, W.; Xu, K.; Xu, T. Evaluation of local features for scene classification using VHR satellite images. In Proceedings of the 2011 Joint Urban Remote Sensing Event, Munich, Germany, 11–13 April 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 385–388. [Google Scholar]
Perronnin, F.; Dance, C. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; IEEE: Piscataway, NJ, USA, 2007; pp. 1–8. [Google Scholar]
Yang, Y.; Newsam, S. Spatial pyramid co-occurrence for image classification. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1465–1472. [Google Scholar]
Lazebnik, S.; Schmid, C.; Ponce, J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 2, pp. 2169–2178. [Google Scholar]
Chen, Y.; Zhao, X.; Jia, X. Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2381–2392. [Google Scholar] [CrossRef]
Bosch, A.; Zisserman, A.; Munoz, X. Scene classification via pLSA. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Proceedings, Part IV 9. Springer: Berlin/Heidelberg, Germany, 2006; pp. 517–530. [Google Scholar]
Lienou, M.; Maitre, H.; Datcu, M. Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geosci. Remote Sens. Lett. 2009, 7, 28–32. [Google Scholar] [CrossRef]
Zhong, Y.; Zhu, Q.; Zhang, L. Scene classification based on the multifeature fusion probabilistic topic model for high spatial resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6207–6222. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
Adriana, R.; Nicolas, B.; Ebrahimi, K.S.; Antoine, C.; Carlo, G.; Yoshua, B. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
Komodakis, N.; Zagoruyko, S. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In Proceedings of the ICLR, Toulon, France, 24–26 April 2017. [Google Scholar]
Chen, P.; Liu, S.; Zhao, H.; Jia, J. Distilling Knowledge via Knowledge Review. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar]
Li, X.C.; Fan, W.S.; Song, S.; Li, Y.; Shao, Y.; Zhan, D.C. Asymmetric temperature scaling makes larger networks teach well again. Adv. Neural Inf. Process. Syst. 2022, 35, 3830–3842. [Google Scholar]
Li, C.; Teng, X.; Ding, Y.; Lan, L. NTCE-KD: Non-Target-Class-Enhanced Knowledge Distillation. Sensors 2024, 24, 3617. [Google Scholar] [CrossRef]
Yuan, M.; Lang, B.; Quan, F. Student-friendly knowledge distillation. Knowl.-Based Syst. 2024, 296, 111915. [Google Scholar] [CrossRef]
Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? Adv. Neural Inf. Process. Syst. 2019, 32. [Google Scholar]
Yuan, L.; Tay, F.E.; Li, G.; Wang, T.; Feng, J. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3903–3911. [Google Scholar]
Chandrasegaran, K.; Tran, N.T.; Zhao, Y.; Cheung, N.M. Revisiting label smoothing and knowledge distillation compatibility: What was missing? In Proceedings of the International Conference on Machine Learning. PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 2890–2916. [Google Scholar]
Li, C.; Cheng, G.; Han, J. Boosting knowledge distillation via intra-class logit distribution smoothing. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
Meng, Z.; Li, J.; Zhao, Y.; Gong, Y. Conditional teacher-student learning. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6445–6449. [Google Scholar]
Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Zou, Q.; Ni, L.; Zhang, T.; Wang, Q. Deep learning based feature selection for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 2015, 12, 2321–2325. [Google Scholar] [CrossRef]
Jin, Y.; Wang, J.; Lin, D. Multi-Level Logit Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24276–24285. [Google Scholar]
Sun, S.; Ren, W.; Li, J.; Wang, R.; Cao, X. Logit Standardization in Knowledge Distillation. arXiv 2024, arXiv:2403.01427. [Google Scholar]
Wei, S.; Luo, C.; Luo, Y. Scaled Decoupled Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 15975–15983. [Google Scholar]
Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3967–3976. [Google Scholar]
Heo, B.; Kim, J.; Yun, S.; Park, H.; Kwak, N.; Choi, J.Y. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–29 October 2019; pp. 1921–1930. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive Representation Distillation. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Guo, Z.; Yan, H.; Li, H.; Lin, X. Class Attention Transfer Based Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11868–11877. [Google Scholar]

Figure 1. Motivation (dashed boxes) and main idea (solid boxes). Blue dashed box: high and low inter-class similarities lead to poor soft labels. Green dashed box: large intra-class variability leads to a poor decision boundary. Blue solid box: instance-level scaling can optimize the poor soft labels. Green solid box: dynamic margin-alignment can result in a more rational decision boundary. Pink circles and yellow triangles are samples from two classes. Orange arrows represent the margin.

Figure 2. Overview of our method. The left side depicts the overall framework of ISDM: the instance-level scaling module optimizes the teacher model’s soft labels at instance-level, followed by distillation with a dynamic margin-alignment strategy. The right side illustrates the specific implementations of IS and DM. Notably, different shapes represent sample categories, and color variation indicates soft label optimization in IS. Different shapes represent samples from different classes. Blue and green represent the outputs of the student model and the teacher model respectively.

Figure 3. Examples of fixed and dynamic margin for easy and hard samples. Orange circles and blue circles are samples from two classes.

Figure 4. Motivation validation. (a): Visualization of inter-class similarity for ResNet18 model trained with Vanilla method on CIFAR-100 dataset. (b–d): Visualization of inter-class similarity for ResNet18 model trained with Vanilla, KD, and ISDM methods on UCM datasets.

Figure 5. Visualizations illustrating the scaling process of soft labels from both easy sample (top) and hard sample (bottom). (a) Scaling process of soft label from easy samples; (b) scaling process of soft label from hard samples.

Figure 6. t-SNE of features learned by KD (left) and ISDM (right). (a) KD; (b) ISDM.

Figure 7. Difference of student and teacher logits. ResNet34–ResNet18 as the teacher–student pair on the UCM dataset. (a) KD (max diff: 1.95; mean diff: 0.42); (b) ISDM (max diff: 1.71; mean diff: 0.35).

Figure 8. Training time vs. top-1 accuracy on CIFAR-100 with ResNet34 as teacher and ResNet18 as student.

Table 1. Results of NWPU-RESISC45 validation. ResNet34 and ResNet18 are adopted as the teacher network and the student network. The best and second-best results are emphasized in bold and underlined cases.

Method	Teacher/Student	Split Ratio (8:2)		Split Ratio (5:5)
Method	Teacher/Student	Top 1	Top 5	Top 1	Top 5
	Teacher	96.13	99.92	94.18	99.64
	Student	93.48	98.54	91.86	98.47
Feature	AT (ICLR17)	94.71	99.81	92.23	99.53
	RKD (CVPR19)	95.02	99.67	93.20	99.56
	ReviewKD (CVPR21)	94.84	99.81	93.25	99.57
Logits	KD (NIPS14)	95.63	99.89	94.11	99.59
	DKD (CVPR22)	95.79	99.84	94.25	99.64
	MLLD (CVPR23)	95.00	99.73	93.47	99.16
	LSKD (CVPR24)	95.97	99.79	94.28	99.59
	SDD (CVPR24)	96.05	99.81	94.17	99.66
	ISDM (ours)	96.27	99.96	94.58	99.68

Table 2. Results of AID validation. ResNet34 and ResNet18 are adopted as the teacher network and the student network. The best and second-best results are emphasized in bold and underlined cases.

Method	Teacher/Student	Split Ratio (8:2)		Split Ratio (5:5)
Method	Teacher/Student	Top 1	Top 5	Top 1	Top 5
	Teacher	95.00	99.40	91.66	99.24
	Student	92.95	99.25	88.86	98.56
Feature	AT (ICLR17)	94.15	99.45	91.08	99.12
	RKD (CVPR19)	93.70	99.50	88.84	98.88
	ReviewKD (CVPR21)	93.85	99.65	90.94	99.02
Logits	KD (NIPS14)	94.55	99.55	91.84	99.24
	DKD (CVPR22)	94.60	99.55	92.14	99.16
	MLLD (CVPR23)	93.25	99.30	90.16	98.84
	LSKD (CVPR24)	94.30	99.55	91.76	99.12
	SDD (CVPR24)	94.60	99.45	92.06	99.18
	ISDM (ours)	95.55	99.75	92.92	99.34

Table 3. Results of UCM validation. ResNet34 and ResNet18 are adopted as the teacher network and the student network. The best and second-best results are emphasized in bold and underlined cases.

Method	Teacher/Student	Split Ratio (8:2)		Split Ratio (5:5)
Method	Teacher/Student	Top 1	Top 5	Top 1	Top 5
	Teacher	90.95	99.52	89.71	99.71
	Student	88.10	99.29	83.81	98.45
Feature	AT (ICLR17)	90.81	99.05	85.90	99.24
	RKD (CVPR19)	89.29	99.29	84.33	98.38
	ReviewKD (CVPR21)	91.19	99.52	84.64	98.57
Logits	KD (NIPS14)	90.24	99.29	86.86	98.76
	DKD (CVPR22)	90.76	99.52	87.42	99.05
	MLLD (CVPR23)	86.14	98.81	85.38	98.95
	LSKD (CVPR24)	87.33	98.80	84.71	97.24
	SDD (CVPR24)	90.48	99.29	87.95	99.14
	ISDM (ours)	92.62	99.76	88.29	99.43

Table 4. Results of ablation study on NWPU-RESISC45 with split ratio of 8:2. ResNet34 and ResNet18 are adopted as the teacher network and the student network. IS: instance-level scaling. DM: dynamic margin-alignment. FM: fixed margin-alignment. ✔ and × present whether the module is adopted or not.

#	Ablation	IS	DM	FM	Acc
①	ISDM	✔	✔	×	96.27
②	DM only	×	✔	×	95.62
③	IS only	✔	×	×	95.54
④	Baseline	×	×	×	93.48
⑤	IS + FM	✔	×	✔	95.95

Table 5. Results of different

β

values. The experiments are conducted on UCM, with ResNet34 as the teacher and ResNet18 as the student. The parameter

α

is set to 1.0 and

ω

is set to 0.1. The best results are emphasized in bold.

Table 5. Results of different

β

values. The experiments are conducted on UCM, with ResNet34 as the teacher and ResNet18 as the student. The parameter

α

is set to 1.0 and

ω

is set to 0.1. The best results are emphasized in bold.

Split Ratio	$β = 1.0$	$β = 2.0$	$β = 4.0$	$β = 6.0$	$β = 8.0$
8:2	90.71	92.14	92.62	91.24	90.71
5:5	87.81	87.62	88.29	87.23	86.57

Table 6. Results of different

α

values. The experiments are conducted on UCM, with ResNet34 as the teacher and ResNet18 as the student. The parameter

β

is set to 4.0 and

ω

is set to 0.1. The best results are emphasized in bold.

Table 6. Results of different

α

values. The experiments are conducted on UCM, with ResNet34 as the teacher and ResNet18 as the student. The parameter

β

is set to 4.0 and

ω

is set to 0.1. The best results are emphasized in bold.

Split Ratio	$α = 0.2$	$α = 0.5$	$α = 1.0$	$α = 1.5$	$α = 2.0$
8:2	90.95	91.43	92.62	91.19	89.29
5:5	88.00	88.38	88.29	87.14	81.33

Table 7. Results of different

ω

values. The experiments are conducted on UCM, with ResNet34 as the teacher and ResNet18 as the student. The parameter

α

is set to 1.0 and

β

is set to 4.0. The best results are emphasized in bold.

Table 7. Results of different

ω

values. The experiments are conducted on UCM, with ResNet34 as the teacher and ResNet18 as the student. The parameter

α

is set to 1.0 and

β

is set to 4.0. The best results are emphasized in bold.

Split Ratio	$ω = 0.05$	$ω = 0.1$	$ω = 0.2$	$ω = 0.3$	$ω = 0.4$
8:2	91.19	92.62	91.43	90.48	90.71
5:5	87.90	88.29	87.33	87.05	87.90

Table 8. Results of CIFAR-100 validation. Teachers and students are in the same architecture. The best and second-best results are emphasized in bold and underlined cases.

Method	Teacher	ResNet-56	ResNet-32×4	WRN-40-2	VGG13	ResNet-32×4	VGG13
	Teacher	72.34	79.42	75.61	74.64	75.61	74.64
	Student	ResNet-20	ResNet-8×4	WRN-16-2	VGG8	ShuffleNetV1	MobileNetV2
	Student	69.06	72.50	73.26	70.36	70.50	64.60
Feature	FitNet (ICLR15)	69.21	73.50	73.58	71.02	73.59	64.14
	AT (ICLR17)	70.55	73.44	74.08	71.43	71.73	59.40
	RKD (CVPR19)	69.61	71.90	73.35	71.48	72.28	64.52
	ReviewKD (CVPR21)	71.89	75.63	76.12	74.84	77.45	70.37
	CAT (CVPR23)	71.62	76.91	75.60	74.65	78.26	69.13
Logits	KD (NIPS14)	70.66	73.33	74.92	72.98	74.07	67.37
	DKD (CVPR22)	71.42	75.97	75.77	74.55	76.23	69.58
	CTKD (AAAI23)	71.19	73.39	75.45	73.52	74.48	68.46
	ISDM(Ours)	72.02	76.98	76.26	74.87	76.72	70.26

Table 9. Results of ImageNet-1k validation. The best and second-best results are emphasized in bold and underlined cases.

Method	Teacher/Student	ResNet34/ResNet18		ResNet50/MN-V1
Method	Teacher/Student	Top 1	Top 5	Top 1	Top 5
	Teacher	73.31	91.42	76.16	92.86
	Student	69.75	89.07	68.87	88.76
Feature	AT(ICLR17)	70.69	90.01	69.56	89.33
	OFD(ICCV19)	70.81	89.98	71.25	90.34
	CRD(ICLR20)	71.17	90.13	71.37	90.41
	ReviewKD(CVPR21)	71.61	90.51	72.56	91.00
	CAT(CVPR23)	71.26	90.45	72.24	91.13
Logits	KD(NIPS14)	70.66	89.88	68.58	88.98
	DKD(CVPR22)	71.70	90.41	72.05	91.05
	LS(CVPR24)	71.42	90.29	72.18	90.80
	ISDM (ours)	71.87	90.60	72.98	91.14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, C.; Teng, X.; Ding, Y.; Lan, L. Instance-Level Scaling and Dynamic Margin-Alignment Knowledge Distillation for Remote Sensing Image Scene Classification. Remote Sens. 2024, 16, 3853. https://doi.org/10.3390/rs16203853

AMA Style

Li C, Teng X, Ding Y, Lan L. Instance-Level Scaling and Dynamic Margin-Alignment Knowledge Distillation for Remote Sensing Image Scene Classification. Remote Sensing. 2024; 16(20):3853. https://doi.org/10.3390/rs16203853

Chicago/Turabian Style

Li, Chuan, Xiao Teng, Yan Ding, and Long Lan. 2024. "Instance-Level Scaling and Dynamic Margin-Alignment Knowledge Distillation for Remote Sensing Image Scene Classification" Remote Sensing 16, no. 20: 3853. https://doi.org/10.3390/rs16203853

APA Style

Li, C., Teng, X., Ding, Y., & Lan, L. (2024). Instance-Level Scaling and Dynamic Margin-Alignment Knowledge Distillation for Remote Sensing Image Scene Classification. Remote Sensing, 16(20), 3853. https://doi.org/10.3390/rs16203853

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Instance-Level Scaling and Dynamic Margin-Alignment Knowledge Distillation for Remote Sensing Image Scene Classification †

Abstract

1. Introduction

2. Related Work

2.1. Remote Sensing Image Scene Classification

2.2. Knowledge Distillation

2.3. Optimization of Soft Labels

3. Materials and Methods

3.1. Preliminaries

3.2. Instance-Level Scaling (IS)

3.3. Dynamic Margin-Alignment (DM)

3.3.1. Margin-Alignment for KD

3.3.2. Dynamic Adjustment of Margin

4. Results

4.1. Datasets

4.1.1. NWPU-RESISC45 Dataset

4.1.2. Aerial Image Dataset (AID)

4.1.3. UC Merced Land-Use Dataset (UCM)

4.2. Settings and Implementation Details

4.3. Main Results

4.4. Ablation Study

4.5. Sensitivity of Hyperparameters

4.6. Motivation Validation

4.7. Effect of Instance-Level Scaling

4.8. Effect of Dynamic Margin-Alignment

4.9. Distillation Fidelity

4.10. Generalization Exploration

4.11. Training Efficiency

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Instance-Level Scaling and Dynamic Margin-Alignment Knowledge Distillation for Remote Sensing Image Scene Classification^†