Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture

Umirzakova, Sabina; Abdullaev, Mirjamol; Mardieva, Sevara; Latipova, Nodira; Muksimova, Shakhnoza

doi:10.3390/electronics13224530

Open AccessArticle

Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture

by

Sabina Umirzakova

¹,

Mirjamol Abdullaev

²,

Sevara Mardieva

¹,

Nodira Latipova

³ and

Shakhnoza Muksimova

^1,*

¹

Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 461-701, Gyeonggi-do, Republic of Korea

²

Department of Information Systems and Technologies, Tashkent State University of Economics, Tashkent 100066, Uzbekistan

³

Department of Systematic and Practical Programming, Tashkent University of Information Technologies Named After Muhammad Al-Khwarizmi, Tashkent 100200, Uzbekistan

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(22), 4530; https://doi.org/10.3390/electronics13224530

Submission received: 14 October 2024 / Revised: 14 November 2024 / Accepted: 15 November 2024 / Published: 18 November 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The rapid evolution of deep learning has led to significant achievements in computer vision, primarily driven by complex convolutional neural networks (CNNs). However, the increasing depth and parameter count of these networks often result in overfitting and elevated computational demands. Knowledge distillation (KD) has emerged as a promising technique to address these issues by transferring knowledge from a large, well-trained teacher model to a more compact student model. This paper introduces a novel knowledge distillation method that simplifies the distillation process and narrows the performance gap between teacher and student models without relying on intricate knowledge representations. Our approach leverages a unique teacher network architecture designed to enhance the efficiency and effectiveness of knowledge transfer. Additionally, we introduce a streamlined teacher network architecture that transfers knowledge effectively through a simplified distillation process, enabling the student model to achieve high accuracy with reduced computational demands. Comprehensive experiments conducted on the CIFAR-10 dataset demonstrate that our proposed model achieves superior performance compared to traditional KD methods and established architectures such as ResNet and VGG networks. The proposed method not only maintains high accuracy but also significantly reduces training and validation losses. Key findings highlight the optimal hyperparameter settings (temperature T = 15.0 and smoothing factor α = 0.7), which yield the highest validation accuracy and lowest loss values. This research contributes to the theoretical and practical advancements in knowledge distillation, providing a robust framework for future applications and research in neural network compression and optimization. The simplicity and efficiency of our approach pave the way for more accessible and scalable solutions in deep learning model deployment.

Keywords:

knowledge distillation; teacher–student architecture; model compression; model optimization; hyperparameter tuning

1. Introduction

In recent years, knowledge distillation (KD) has emerged as a transformative technique in deep learning, particularly for neural network compression and optimization. However, while many KD methods achieve impressive theoretical results, their real-world applications remain constrained by practical limitations, including computational demands and scalability challenges. Addressing these gaps, our study introduces a simplified KD approach that effectively reduces the computational burden while achieving performance comparable to more complex methods. This method is especially suitable for scenarios where high accuracy is needed on resource-constrained devices, such as mobile applications and embedded systems in IoT devices, paving the way for broad adoption in academia and industry. The rapid advancements in deep learning have led to remarkable success for convolutional neural networks in a range of tasks related to computer vision [1]. In datasets related to vision, each category often encompasses features from multiple perspectives that are straightforward to classify. Basic models may only capture a subset of these features, but deep neural networks are adept at addressing this challenge. Nevertheless, as the depth and parameter count of these networks grows, there is an increased risk of overfitting, which can detrimentally impact their effectiveness. To minimize resource usage while preserving optimal performance, researchers have turned to knowledge distillation methods [2]. This approach involves transferring valuable insights from a highly trained, complex teacher network to a more compact and initially less proficient student network. The objective of KD is to enable a smaller, less-parameterized student model to acquire a level of generalization capability comparable to that of a larger, more powerful teacher model, which possesses a substantial number of parameters. KD represents a crucial technique for knowledge transfer, wherein a smaller, more efficient model assimilates valuable information from a larger, more complex model, thereby improving its performance. This architectural framework is commonly referred to as a teacher–student structure. Within this setup, the less advanced student network learns from the teacher network, which is more experienced, through the process of knowledge distillation. This transfer of valuable information enables the student network to enhance its performance. KD technique has seen considerable success across diverse applications, including in areas like object detection [3], semantic segmentation [4], and the training of transformer models [5]. A notable limitation of the standard KD approach is the substantial performance disparity that often remains between the original, more complex teacher model and the distilled, simpler student model. To address this shortcoming, several strategies have been developed in recent years, as evidenced by various studies [6].

The majority of current knowledge distillation methods [all methods] focus on guiding the student network to emulate the representational space of the teacher network, aiming to achieve a level of performance related to that of the more proficient teacher. These methods typically involve manually defining different types of knowledge derived from the responses of the teacher network. This includes softened outputs [7], attention maps [8], the flow of solution procedure [9], and various relational aspects [10]. In contrast, the DTL method in [11] transfers learned features from pre-trained models to new tasks, optimizing accuracy for specific fault detection tasks. Many of these methods gain advantages by utilizing extra supervision from the pre-trained teacher model, particularly focusing on the information provided by its intermediate layers. In addition to aligning straightforward intermediate features, current approaches often rely on intricately crafted knowledge representations. These include methods like mimicking spatial attention maps [12], replicating pairwise similarity patterns [13], or maximizing the mutual information between the features of the teacher and student networks [14]. These various forms of knowledge support the distillation process by introducing additional distillation losses, thereby enhancing the effectiveness of knowledge transfer. We recognize that the acoustic spectral imaging (ASI) approach used in [15] can be highly effective for domain-specific tasks such as incipient fault diagnosis under variable operating conditions. Moreover, the diversity in the types of knowledge being transferred complicates the formation of a unified and straightforward understanding of how these transfers ultimately contribute to the improved performance of the student model. In this paper, we introduce a straightforward knowledge distillation technique, showing that it can substantially narrow the performance gap between teacher and student models without requiring complex knowledge representations.

The paper presents a groundbreaking knowledge distillation algorithm that effectively narrows the performance gap between complex teacher models and simpler student models. This is achieved without relying on complex knowledge representations, simplifying the distillation process. Additionally, it introduces a unique teacher network architecture, deviating from traditional methods that use pre-existing models as teacher networks. This new architecture is designed to enhance the effectiveness of knowledge transfer. The paper also details a comprehensive workflow for the proposed model, including data preprocessing, training, testing, and evaluation phases, which demonstrates the practical application of the proposed method. Furthermore, the study’s extensive experimentation to determine optimal parameters for the distillation process adds a significant empirical aspect to the research. Overall, the paper contributes to both the theoretical and practical aspects of knowledge distillation in neural networks, offering insights that could be beneficial for future research and applications in the field.

The structure of this paper is outlined as follows:

Section 2 reviews prior knowledge distillation techniques and their limitations. Section 3 details the new knowledge distillation approach and the teacher–student model architecture. Section 4 describes the experimental methodology and presents the findings. Section 5 analyzes the results of previous studies and their implications.

2. Related Work

In ref. [16], the concept of a teacher guiding a student to compress models was presented, and numerous adaptations and variations have been developed using their teacher–student learning framework. To attain notable results in model compression, these studies concentrate on improving the methods for capturing the knowledge from the larger teacher network to guide the training of the smaller student network. In KD, the student model can improve its performance by leveraging guidance in various forms from the teacher model. This can involve methods like soft logit-based distillation [17], relation-based distillation [18], or intermediate feature-based distillation [19]. Among these methods, feature-based knowledge distillation offers flexibility in creating distillation mechanisms. In particular, [20] presented Interactive Knowledge Distillation (IAKD), a method enhancing traditional knowledge distillation by incorporating interactive teaching strategies. IAKD employs swapping-in operations where parts of a student network are replaced with those from a teacher network, significantly improving student performance. While IAKD is shown to be effective, the process of swapping blocks and the added interactivity potentially increase the computational overhead, affecting the scalability and efficiency of the training process, especially with very large networks and datasets. Ref. [21] introduced Semantic Calibration for Cross-Layer Knowledge Distillation (SemCKD), an approach that improves knowledge transfer between teacher and student models by automatically matching student layers to the most relevant teacher layers using an attention mechanism. The use of an attention mechanism to automatically align student and teacher layers increases the computational overhead, particularly during the initial setup phase where the optimal layer mappings are determined. Ref. [22] presents cGAN-KD, a cutting-edge knowledge distillation framework utilizing conditional generative adversarial networks (cGANs) for effective knowledge transfer in both classification and regression deep learning tasks. cGAN-KD distinguishes itself by generating cGAN-based samples for distillation, offering flexibility across different teacher–student architectures and eliminating the need for manual adjustments. The performance of cGAN-KD is sensitive to the choice of hyperparameters, including those related to the cGAN architecture and training regimen. Finding the optimal set of hyperparameters requires extensive experimentation, which can be time-consuming. Ref. [23] introduced a simple yet effective knowledge distillation approach that closes the performance gap between complex teacher models and simpler student models. By reusing the teacher model classifier and aligning features with a single ℓ2 loss, this method enables the student model to match the teacher performance when features are perfectly aligned. Relying solely on a single ℓ2 loss for feature alignment does not capture all aspects of the teacher model knowledge, potentially limiting the depth of knowledge transferred to the student model. Ref. [24] presents a method to enhance cloud-based visual analytics by optimizing image transmission from edge devices to the cloud through Split-DNN computing. It introduces bottleneck units to efficiently compress data and a neural rate-estimator for accurate compression rate estimation. The method applies to variational auto-encoder principles for rate-distortion optimization and uses distillation-based losses for unsupervised training, passing the need for labeled data. While the method shows promise for reducing the need for labeled data, deploying it in real-world scenarios reveals issues related to scalability, integration with existing cloud infrastructure, and adaptability to varying network conditions and device capabilities. Ref. [25] introduced an innovative knowledge distillation technique that emphasizes the importance of cross-level connection paths between teacher and student networks, marking a departure from conventional approaches focused on same-level feature transformations. Introducing the novel concept of cross-stage connection paths, the approach features a simple yet effective review mechanism within a compact framework that requires minimal computational overhead. Despite the framework being described as compact and requiring minimal computational overhead, the implementation of cross-stage connection paths introduces complexity in configuring and optimizing these connections, particularly for those without deep expertise in knowledge distillation techniques.

The researchers in [26] present a groundbreaking framework that merges multi-teacher knowledge distillation with network quantization to create efficient, low-bit-width DNNs. By promoting collaborative and mutual learning among quantized teacher networks and between teachers and a quantized student, the method advances beyond traditional approaches that use separately trained teachers. Integrating multiple quantized teachers and a quantized student into a cohesive learning framework significantly increases the complexity of the training process. Managing and optimizing the collaborative and mutual learning processes require sophisticated training strategies and hyperparameter tuning. Ref. [8] introduces a new approach to knowledge distillation using a frequency domain-based attention mechanism. It features a learnable global filter that aligns the student model features with those of the teacher model, capturing global information more effectively than traditional spatial domain methods. Operating in the frequency domain adds complexity to the distillation process. The transformation of features into the frequency domain and the application of the global filter require additional computational resources and expertise in signal processing. Ref. [27] proposes a Teacher–Student Collaborative Knowledge Distillation (TSKD) method, combining knowledge distillation with self-distillation to enhance model performance with limited data. It features a two-step process where the student network learns from both the teacher network and through self-teaching using a multi-exit network structure. An ensemble of sub-models in the student network is used for improved classification during inference. The combination of knowledge distillation with self-distillation and the multi-exit network structure increases the complexity of the model implementation and training process. Ref. [28] presents a simple yet effective model compression technique using knowledge distillation, focusing on optimizing the penultimate layer of the student network. The approach involves direct feature matching and a novel strategy that separates representation learning from classification, using the teacher classifier to train the student feature layer. Concentrating solely on the penultimate layer for optimization does not fully leverage the deeper, more nuanced aspects of knowledge present in other layers of the teacher network. Ref. [29] introduces Decoupled Knowledge Distillation (DKD), a novel approach in knowledge distillation that separates the process into Target Class Knowledge Distillation (TCKD) and Non-Target Class Knowledge Distillation (NCKD). This decoupling allows for more effective and flexible learning, addressing limitations in classical knowledge distillation. While DKD offers a novel perspective on logit distillation, this specific focus overlooks the potential benefits of distilling knowledge from other aspects of the teacher model, such as intermediate features or attention mechanisms. Ref. [30] presents a novel KD method that combines individual and relational knowledge through an attributed graph, using graph neural networks. This approach creates a unified graph-based embedding for more effective knowledge transfer in a contrastive manner. Building and managing the attributed graph among instances adds complexity to the KD process. This requires more computational resources and sophisticated algorithms to effectively manage the graph structure.

3. The Proposed Method

In contrast to traditional approaches that rely on pre-trained, complex teacher models like ResNet or VGG, our method leverages a unique teacher architecture designed to maximize knowledge transfer efficiency. This simplifies the distillation process and offers significant benefits in deployment, as it can be effectively implemented on devices with limited processing power. Moreover, our KD approach reduces the need for extensive computational resources during training and inference, directly addressing the industry’s increasing demand for efficient yet high-performing models Algorithm 1. Our proposed teacher–student model adopts a streamlined architecture specifically designed to optimize the knowledge distillation process while minimizing computational demands. The teacher network, responsible for transferring high-level knowledge, consists of five convolutional layers with batch normalization and max pooling layers, accompanied by ReLU activations. This configuration allows the teacher model to extract detailed features from the input data, providing a robust basis for the student network to learn from. The teacher model contains approximately 1 million trainable parameters, reflecting its ability to capture complex patterns. In contrast, the student network is intentionally designed with fewer layers and reduced parameter counts to ensure a lightweight structure that is well-suited for deployment in resource-constrained environments. The student model is composed of three convolutional layers, with batch normalization and ReLU activations, achieving a parameter count of approximately 0.5 million—significantly less than the teacher network. This reduction in trainable parameters, along with a simplified architecture, enables the student model to achieve high accuracy while maintaining computational efficiency, affirming its lightweight nature. By presenting the layer configuration and parameter counts of both the teacher and student models, we demonstrate the efficiency of the proposed approach, which achieves substantial reductions in model complexity without sacrificing performance. This information strengthens the claim that our model is not only effective but also practical for applications with limited computational resources.

Algorithm 1. Proposed Method chart.

Teacher Network Training
Initialize Teacher Network with five convolutional layers, batch normalization, max pooling, and ReLU activations
Set hyperparameters (learning rate, batch size, number of epochs)
For each epoch:
For each batch in training data:
Forward pass through Teacher Network
Calculate loss (cross-entropy)
Backpropagate and update model weights
Evaluate Teacher Network on validation data
Record validation accuracy and loss
Knowledge Distillation to Student Network
Initialize Student Network (smaller model)
Set distillation parameters (temperature T, smoothing factor α)
For each epoch:
For each batch in training data:
Forward pass through Teacher Network to get logits
Forward pass through Student Network
Calculate distillation loss using Teacher’s softened logits and Student’s predictions:
Softened logits = softmax(Teacher logits/T)
Distillation loss = cross-entropy (softened logits, Student predictions) * α + original loss * (1 − α)
Backpropagate and update Student Network weights
Evaluate Student Network on validation data
Record validation accuracy and loss
Evaluation and Sensitivity Analysis
For each model pair (e.g., ResNet152/ResNet50, ResNet152/ResNet18, etc.):
Train and validate under identical distillation parameters (T, α)
Record and compare training accuracy, validation accuracy, training loss, and validation loss
For each combination of temperature T and smoothing factor α:
Train and validate Student Network
Record performance metrics (training accuracy, validation accuracy, training loss, validation loss)
Identify optimal values of T and α for best performance

This section describes the basic idea of the knowledge distillation algorithms and introduces the proposed novel image classification model based on teacher–student network.

3.1. The Knowledge Distillation

The major of the knowledge distillation algorithm is that we use a pre-trained teacher network for transferring its logits (knowledge) to simpler student network and train it, as shown in Figure 1.

Recent works demonstrate that many of them, as pre-trained teacher networks, use already existing famous image classification models such as ResNet and VGG models, and AlexNet and ImageNet for training simpler student networks. Utilizing existing networks provides more priority to train smaller models, and overcoming memory complexity and time saving. The teacher and student networks are represented as follows, t and s, respectively.

t = f^{t} (x, w_{t})

(1)

s = f^{s} (x, w_{s})

(2)

where x describes the input of the network,

w_{t}

and

w_{s}

are the parameters of teacher–student networks. Moreover, t and s represent the logits of the outputs of either the teacher or student networks. In deep learning, neural networks perform a class of probabilities using the activation function ‘softmax’, which converts the logit,

z_{i}

into probability for each class.

q_{i} = \frac{\exp (\frac{z_{i}}{T})}{\sum_{j}^{C} \exp (\frac{z_{j}}{T})}

(3)

where T is a temperature that is normally set to 1; however, ref. [31] emphasizes that using a higher value for T assures softer logits over the classes. The temperature parameter T is used to soften the probability distribution of the teacher network output logits. In the fundamental approach to distillation, the distilled model acquires knowledge through training on a transfer set, wherein each instance within this set is associated with a soft target distribution. This distribution is generated by employing a cumbersome model operating with an elevated ‘softmax’ temperature. This elevated temperature is also applied during the training of the distilled model. However, once the training is complete, the distilled model operates with a ‘softmax’ temperature set to unity. The distance of the output layers between the teacher and student networks is measured using Kullback–Leibler.

\frac{1}{N} \sum_{j = 1}^{N} L_{K L} (f^{s} (x^{(j)}, w_{s}), f^{t} (x^{(j)}, w_{t})) = \frac{1}{N} \sum_{j = 1}^{N} L_{K L} (s, t)

(4)

L_{K L}

represents the Kullback–Leibler divergence used here to quantify the divergence between the probability distributions of the teacher and student networks, where

x^{(j)}

the j-th input image within the N-th sample of the dataset, ensuring that its role is clear in the context of input data. Cross-entropy loss is denoted as

L_{C E}

, with

y_{j}

symbolizing the authentic label of the j-th input image. The discrepancy between the student network predicted value and the actual label is characterized as follows:

\frac{1}{N} \sum_{j = 1}^{N} L_{C E} (f^{s} (x^{(j)}, w_{s}), y_{j}) = \frac{1}{N} \sum_{j = 1}^{N} L_{C E} (s, y_{j})

(5)

The primary objective in knowledge distillation is to reduce the disparity between the output of the student and the prediction of the teacher network, alongside narrowing the difference between the student’s output and the actual label.

{a r g m i n}_{w_{s}} \sum ({α T}^{2} \cdot L_{K L} + (1 - α) \cdot L_{C E})

(6)

Expression (6) represents the combined loss function used in the knowledge distillation process, where it incorporates both the Kullback–Leibler divergence

L_{K L}

and cross-entropy loss

L_{C E}

. This combined loss optimizes the alignment between the teacher and student network outputs while ensuring accuracy concerning the actual labels. The balance between

L_{K L}

and

L_{C E}

is achieved, along with the role of any weighting factors.

3.2. The Proposed Model Architecture

The architecture of the proposed model is designed around the knowledge distillation algorithm, which facilitates the transfer of logits from the teacher network to the student network. Unlike conventional approaches that utilize pre-existing models as a teacher network, this study introduces a novel network architecture aimed at enhancing overall network performance. The teacher network is characterized by its complexity and depth, comprising five convolutional layers. This is supplemented by batch normalization and max pooling operations, coupled with the Rectified Linear Unit (ReLU) activation function. A detailed schematic representation of the teacher network is provided in Figure 2.

t_{i} = \frac{\exp (\frac{z_{i}}{T})}{\sum_{j}^{C} \exp (\frac{z_{j}}{T})}

(7)

Equation (7) illustrates the representation of

t_{i}

, signifying the softened logits derived from the pre-trained teacher network. Here, C denotes the specified number of categories within the dataset, which encompasses N data samples. Additionally,

z_{i}

represents the output of the fully connected layer. The output from the teacher network is then incorporated into the knowledge distillation loss, as delineated in Equation (8). Furthermore, extensive experimentation was conducted to ascertain the optimal values of T the temperature parameter and α the balancing coefficient that yield more refined softened logits. The optimal parameters are identified as T = 20 and α = 0.7, with the corresponding results presented in Table 1. Upon the successful transference of knowledge from the teacher network, the student network is subjected to training. Concurrently, the hard predictions emanating from both networks are channeled into a loss function. This function is predicated on the KL divergence, quantifying the discrepancy between the probability distributions of the teacher and student networks, as well as the cross-entropy that measures the divergence between the output of the student and the actual labels, as delineated in Equation (8). Furthermore, the output of the student network is processed through the standard softmax function to yield a probabilistic representation. In addition, a clear explanation of the architecture is provided in Algorithm 2.

L_{K D} (W_{s}) = 2 α T^{2} L_{K L} () (s, t) + (1 - α) \cdot L_{C E} (s, y))

(8)

Algorithm 2 shows the workflow of the proposed model.

Algorithm 2. Algorithm of the proposed teacher–student model

1: Preprocess the data.

2: Create model input:

3: Input image (shape (height, width,3))

4: for i = len(train loader) do

train the

{x \in {x}_{i}}_{i}^{N}

N sample of data;

test the validation results of the proposed model;

output = student ();

teacher_output = teacher ();

loss = knowledge_distillation ()

L_{K D} (W_{s}) = 2 α T^{2} L_{K L} () (s, t) + (1 - α) \cdot L_{C E} (s, y))

5: end for

6: Evaluate the model by selecting the two best-performing results

BT_1 = max {

T_{i} \in {T_{1}, T_{2}, T_{3} {, T}_{4}, T_{5} \dots T_{i}}

}
BL_1 = min {

L_{i} \in {L_{1}, L_{2}, L_{3} {, L}_{4}, L_{5} \dots L_{i}}

}

7: Load maximum in BT_1 best train accuracy and minimum in BL_1 best loss result

4. Materials and Experimental Setup

In this section, we describe the data preprocessing process and the dataset that is used for the proposed model, and experiments.

4.1. Dataset

To comprehensively evaluate the proposed knowledge distillation method, we utilized three datasets of increasing complexity: CIFAR-10, CIFAR-100, and Tiny ImageNet. Each dataset serves a specific role in demonstrating the method’s adaptability, generalization capabilities, and performance across varied classification tasks. CIFAR-10, a widely used benchmark in computer vision, contains 60,000 32 × 32 color images evenly distributed across 10 classes. This dataset provides a controlled, straightforward environment ideal for establishing baseline results and tuning model parameters. Its uniformity and simplicity allow for an initial assessment of the method’s effectiveness in handling well-defined categories with balanced data distribution. To further challenge the method scalability, CIFAR-100 introduces a more complex classification task, featuring 100 classes, each with 600 images. This dataset, with its finer-grained distinctions across a larger number of categories, tests the method’s ability to capture nuanced variations and assess its performance as the complexity of the classification problem increases. The Tiny ImageNet dataset offers a closer approximation to real-world scenarios with its higher resolution and greater diversity. Comprising 200 classes with 500 training images and 50 validation images per class, Tiny ImageNet expands the method’s applicability by introducing more variability in both image content and visual features. The greater challenge posed by Tiny ImageNet enables a thorough examination of the model’s robustness and generalization in conditions that better reflect practical applications. This progressive evaluation strategy—from CIFAR-10 to CIFAR-100 to Tiny ImageNet—highlights the versatility of the proposed method and demonstrates its capability to perform well across diverse datasets. Such a range of testing establishes the method’s suitability for real-world tasks that involve complex, varied, and often less-controlled data environments Figure 3.

4.2. Data Preprocessing

To maximize the robustness and generalization capability of the proposed model, we applied a comprehensive data preprocessing pipeline that enhances dataset variability and helps the model learn to recognize a broader range of visual features. By carefully preparing the training and testing datasets, we ensure that the model can better generalize to new, unseen data. For the training set, we introduced a series of transformations designed to augment the dataset without altering the inherent characteristics of each class. This process begins with random horizontal flips, which create mirrored versions of images and effectively double the training data’s diversity. Following this, we performed random cropping with a padding of 4 pixels to simulate translation and scale variations, allowing the model to become more resilient to minor positional shifts within images. Next, we applied color jittering, subtly altering brightness, contrast, saturation, and hue by a factor of 0.1. This modification introduces variability in color information, helping the model remain invariant to lighting conditions and other environmental factors that may impact an image’s appearance. Finally, normalization was employed, standardizing the pixel values based on the mean and standard deviation across the RGB channels in the dataset. This normalization step is critical for maintaining a consistent input distribution, which aids in stabilizing model training and accelerating convergence. For the testing dataset, we applied the same normalization procedure, ensuring consistency between training and testing environments. The testing set was not subjected to additional augmentations, as the goal was to evaluate the model on data that closely resemble real-world conditions without artificial modifications. This data preprocessing strategy, combining augmentation techniques with normalization, contributes to the model’s ability to generalize effectively. By simulating diverse visual conditions in the training data, we enhance the model’s adaptability to complex and varied real-world scenarios.

4.3. Metrics

To evaluate the performance of the proposed model, we used several standard classification metrics that provide a comprehensive analysis of its effectiveness across various datasets and configurations. These metrics include Accuracy, Precision, Recall, F1-Score, and their macro and weighted averages. Accuracy measures the proportion of correct predictions among the total predictions, offering an overall assessment of the model performance. It is calculated as:

A c c u r a c y = \frac{N u m b e r o f C o r r e c t P r e d i c t i o n s}{T o t a l N u m b e r o f P r e d i c t i o n s}

(9)

Precision indicates the model’s accuracy in predicting each class and is defined as the ratio of true positive predictions to the total positive prediction true positives and false positives. Precision for a given class i is calculated as:

{P r e c i s i o n}_{i} = \frac{{T r u e P o s i t i v e s}_{i}}{{F a l s e P o s i t i v e s}_{i} + {T r u e P o s i t i v e s}_{i}}

(10)

Recall represents the model’s ability to identify all relevant instances of a class. It is calculated as the ratio of true positives to the sum of true positives and false negatives for each class i:

{R e c a l l}_{i} = \frac{{T r u e P o s i t i v e s}_{i}}{{F a l s e N e g a t i v e s}_{i} + {T r u e P o s i t i v e s}_{i}}

(11)

The F1-Score is the harmonic mean of Precision and Recall, balancing the two metrics to provide an aggregate measure of model performance, especially in cases of class imbalance. For each class i, the F1-Score is given by:

F 1 - {S c o r e}_{i} = 2 \cdot \frac{{P r e c i s i o n}_{i} \cdot {R e c a l l}_{i}}{{P r e c i s i o n}_{i} + {R e c a l l}_{i}}

(12)

The evaluation of our model performance also includes macro and weighted averages to give a balanced view of the metrics across classes. The macro average calculates each metric—such as precision, recall, and F1-score—for each class individually, then averages these values without accounting for the size of each class. This approach provides an unweighted view of model performance, treating each class equally, which can highlight how well the model performs on less common classes. In contrast, the weighted average takes class distribution into account by weighting each metric according to the number of instances in each class. This method offers a more accurate reflection of the model’s overall performance, particularly in datasets with class imbalances, as it ensures that metrics from more prevalent classes contribute proportionally more to the final score. Together, these two averaging methods allow us to assess the model performance across all classes, both in a balanced manner and with attention to class representation within the dataset.

4.4. Results

Our experiments demonstrate that the proposed model substantially improves key metrics such as training and validation accuracy while reducing computational load. Table 1 provides evidence that our model consistently outperforms traditional teacher–student networks across a range of hyperparameters. Notably, the model’s high validation accuracy of 92.94% and low validation loss of 0.51 indicate robust performance under varied conditions, underscoring its practical applicability. These results signify not only theoretical advancements but also mark a shift towards more accessible AI solutions for industries where computational efficiency and rapid deployment are critical. Table 1 provides the results from a teacher–student model that is documented across various settings of temperature parameter T and smoothing factor α, while maintaining a constant learning rate of 0.01. The smoothing factor α balances the weight between the knowledge distillation loss and the standard cross-entropy loss for the student model. It is a scalar value between 0 and 1 that allows control over the emphasis placed on the distillation component versus the original ground truth labels. We conduct the experiments using three distinct temperature values (20.0, 15.0, and 10.0) and four different α values (0.7, 0.8, 0.9, and 1.0), resulting in twelve unique parameter configurations. Performance metrics are evaluated in terms of training accuracy (Train acc), validation accuracy (Val acc), training loss (Train loss), and validation loss (Val loss), over a consistent training duration of 70 epochs.

The table highlights the most successful and least successful results by using blue and red colors, respectively. The top performance in validation accuracy is observed at T = 15.0 and α = 0.7, achieving 85.11%, whereas the lowest validation accuracy is at T = 20.0 and α = 1.0, with 82.77%. Training accuracy exhibits a high of 86.09% at T = 20.0 and α = 0.8, and a modest minimum of 85.45% at T = 20.0 and α = 1.0. For validation loss, the values fluctuate slightly with a minimum at T = 10.0, α = 1.0 (0.61) and a maximum at T = 15.0, α = 1.0 (0.68). All configurations are subjected to the same number of training epochs, ensuring a fair comparison across the different experimental conditions. Furthermore, we train our model using T = 15.0 and α = 0.7 for 100 epochs. Moreover, we check the accuracy of the proposed model using the metrics that are shown in Table 2.

The metrics include precision, recall, and the F1-score for each class and these metrics provide a detailed evaluation of a proposed model, showcasing performance metrics across different classes. The classes evaluated are ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, and ‘truck’. The model achieves high precision in classifying ‘automobile’ and ‘frog’ with scores of 0.93 and 0.95, respectively. It exhibits strong recall particularly in the ‘automobile’, ‘ship’, and ‘truck’ categories, with scores above 0.9. The F1-scores, which balance precision and recall, are relatively uniform across classes, with the ‘automobile’, ‘frog’, ‘ship’, and ‘truck’ classes achieving the highest F1-scores of 0.92, 0.89, 0.89, and 0.89, respectively. The ‘cat’ class has the lowest precision and F1-score, indicating that the model may struggle with accurately identifying this class compared to others. Overall, the model demonstrates a robust performance with precision, recall, and F1-scores mostly above 0.8 across all classes. In addition, we provide summaries of the overall performance of a classification model, providing aggregated metric scores for precision, recall, and the F1-score. Table 3 distinguishes between the macro average and weighted average for these metrics, with both averaging methods yielding identical precision and recall scores of 0.86 and 0.85, respectively. The F1-scores are also consistent at 0.85 for both averaging methods. The accuracy of the model, which is the proportion of correct predictions out of the total number of instances, is reported as 0.8542. This singular value for accuracy suggests a high level of model correctness across the various classes when predictions are aggregated. In addition, Table 3 shows the overall performance of a classification model, providing the metric scores for precision, recall, and the F1-score, while Table 2, however, distinguishes between the macro average and weighted average for these metrics, with both averaging methods yielding identical precision and recall scores of 0.86 and 0.85, respectively. The F1-scores are also consistent at 0.85 for both averaging methods. The accuracy of the model, which is the proportion of correct predictions out of the total number of instances, is reported as 0.8542. This singular value for accuracy suggests a high level of model correctness across the various classes when predictions are aggregated.

Further, we tested the proposed model, and as shown in Figure 3, we obtained the results of 2 errors to 18 given images. The images in Figure 3 are indeed from the CIFAR-10 dataset, which contains low-resolution (32 × 32 pixels) images. As a standard benchmark dataset, CIFAR-10 is limited in resolution, which can result in blurriness when the images are enlarged for display in the paper. We did not deliberately select blurry images; rather, the low resolution is inherent to the dataset itself. Correct predictions are emphasized with blue squares, while errors are with red squares. The comparative analysis of the proposed model against established architectures, namely variations of ResNet and VGG networks, in the context of a teacher–student (T/S) setup.

The evaluation metrics include training accuracy, validation accuracy, training loss, and validation loss, along with the number of epochs and the hyperparameters T and α, which are consistent across all models at 100 epochs, T = 15, and α = 0.7. The teacher–student pairings of ResNet152/ResNet50, ResNet152/ResNet18, and ResNet152/ResNet10 show descending performance in both training and validation accuracy, with the ResNet152/ResNet50 pairing achieving the highest validation accuracy of 91.70% and the lowest validation loss of 0.45 as in Table 4. As the student models decrease in depth from ResNet50 to ResNet10, there is an observable trend of increased training and validation loss. In comparison, the standalone architectures of ResNet50 paired with VGG8 and VGG13 paired with VGG8 exhibit lower performance metrics than the teacher–student combinations, with VGG13/VGG8 showing the lowest validation accuracy of 83.78% and the highest validation loss of 2.56 among the compared methods. The proposed model outperforms the other networks with the highest training accuracy of 90.89% and validation accuracy of 92.94%. It also reports a competitive training loss of 3.11 and the lowest validation loss of 0.51, indicating its effectiveness and efficiency over both the teacher–student networks and the standalone networks within the given experimental conditions.

In Table 4, the focus is on evaluating the proposed model alongside established models under consistent hyperparameters across multiple architecture pairings. Here, accuracy and loss values reflect results obtained with a specific training and validation pipeline, aimed at a comprehensive comparison with ResNet and VGG architectures. In Table 4, our proposed model achieves the highest training and validation accuracy compared to the ResNet152/ResNet50 pairing, yet it shows slightly higher training and validation losses. This outcome arises because accuracy and loss measure different performance aspects: accuracy counts correct predictions, while loss reflects confidence in these predictions. Our model often identifies the correct class accurately but with slightly lower confidence in its probability scores, leading to marginally higher loss. This design choice is intentional, as it enables the model to generalize better by focusing on accurate classification rather than overly confident probability distributions. Regularization techniques, aimed at preventing overfitting, also contribute to these results by maintaining robust performance across varied conditions. This balance between accuracy and loss highlights the model’s effectiveness in achieving high accuracy with strong generalization capabilities.

Table 5 provides an in-depth sensitivity analysis of the proposed model across various temperature (T) and smoothing (α) values, examining how these parameters impact model performance independently. This table reflects a more granular analysis, isolated from comparisons to other architectures. The lower performances in Table 5 result from testing each parameter setting in isolation rather than under optimal conditions, thus explaining the difference in performance. The comparative analysis and detailed results provided demonstrate the effectiveness and robustness of our proposed model. By presenting clear and significant improvements in key performance metrics, the proposed model stands out as a viable and efficient solution for knowledge distillation in neural networks. This detailed analysis of Table 5 shows that the proposed model’s performance is sensitive to the hyperparameters T and α. Optimal performance is achieved by carefully tuning these parameters, with T = 15.0 and α = 0.7 emerging as the best combination for achieving high validation accuracy and low validation loss. These findings underscore the importance of hyperparameter optimization in knowledge distillation to maximize model performance. While both tables share parameter labels, they serve different experimental purposes: Table 4 provides comparative results across model pairings, and Table 5 focuses on tuning specific parameters within the proposed model.

To validate the robustness and generalizability of our proposed knowledge distillation method, we extended our experiments beyond the CIFAR-10 dataset. CIFAR-10, with its 10 distinct classes and balanced distribution, serves as an excellent benchmark for initial evaluation, but additional datasets allow us to assess the method’s adaptability to diverse data characteristics. Table 6 summarizes the results of our experiments on CIFAR-100 and Tiny ImageNet. The proposed knowledge distillation model demonstrates comparable, and in some cases superior, performance across different datasets, underscoring its flexibility and effectiveness across varied visual tasks. On CIFAR-100, our model achieved a validation accuracy of 78.3%, which is competitive with state-of-the-art KD methods while maintaining lower computational complexity. On Tiny ImageNet, our approach also maintained high accuracy while minimizing training and validation loss, highlighting its capability to generalize across datasets of increasing complexity.

The results on CIFAR-100 and Tiny ImageNet validate the versatility of our proposed approach, suggesting that it can be effectively applied across various image classification tasks. The consistent performance across datasets with differing numbers of classes and complexity levels indicates that our method does not rely on dataset-specific optimizations, thus making it suitable for real-world applications where data diversity is a key factor. By expanding our evaluation to include multiple datasets, we provide a comprehensive analysis of the proposed method’s effectiveness. These results demonstrate that our knowledge distillation technique generalizes well beyond the CIFAR-10 benchmark, making it a robust and scalable solution for various classification tasks in computer vision. Future research could explore its application on even larger datasets, such as the full ImageNet dataset, to further establish the model’s efficacy in handling diverse and high-complexity data.

5. Conclusions

In this study, we presented a novel knowledge distillation approach using a streamlined teacher–student model architecture to enhance computational efficiency without compromising performance. Our model was evaluated on benchmark datasets, including CIFAR-10, CIFAR-100, and Tiny ImageNet, where it demonstrated superior accuracy and reduced resource demands, making it ideal for resource-limited environments, such as IoT and embedded systems. This work contributes to the practical deployment of deep learning models by providing an accessible and efficient solution for model compression. Future research will explore adaptations of this method to additional domains, optimizing the distillation process further. Our findings support the development of scalable AI solutions that balance accuracy and efficiency, broadening the applicability of deep learning in real-world scenarios.

Author Contributions

Methodology, S.U., S.M. (Shakhnoza Muksimova), S.M. (Sevara Mardieva), N.L. and M.A.; Software, S.U., S.M. (Shakhnoza Muksimova) and S.M. (Sevara Mardieva); Validation, S.U., S.M. (Shakhnoza Muksimova), S.M. (Sevara Mardieva), N.L. and M.A.; Formal analysis N.L. and M.A.; Resources, S.U., S.M. (Shakhnoza Muksimova) and S.M. (Sevara Mardieva); Data curation, N.L. and M.A.; Writing—original draft, S.U., S.M. (Shakhnoza Muksimova), S.M. (Sevara Mardieva), N.L. and M.A.; Writing—review and editing, S.U., S.M. (Shakhnoza Muksimova), S.M. (Sevara Mardieva), N.L. and M.A.; Supervision, S.U., S.M. (Shakhnoza Muksimova) and S.M. (Sevara Mardieva); Project administration, N.L. and M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All datasets used are available online with open access.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Moein, M.M.; Saradar, A.; Rahmati, K.; Mousavinejad, S.H.G.; Bristow, J.; Aramali, V.; Karakouzian, M. Predictive models for concrete properties using machine learning and deep learning approaches: A review. J. Build. Eng. 2023, 63, 105444. [Google Scholar] [CrossRef]
Muksimova, S.; Umirzakova, S.; Mardieva, S.; Cho, Y.I. Enhancing Medical Image Denoising with Innovative Teacher–Student Model-Based Approaches for Precision Diagnostics. Sensors 2023, 23, 9502. [Google Scholar] [CrossRef] [PubMed]
Zhang, J.; Shi, Y.; Yang, J.; Guo, Q. KD-SCFNet: Towards more accurate and lightweight salient object detection via knowledge distillation. Neurocomputing 2024, 572, 127206. [Google Scholar] [CrossRef]
Liu, L.; Wang, Z.; Phan, M.H.; Zhang, B.; Ge, J.; Liu, Y. BPKD: Boundary Privileged Knowledge Distillation for Semantic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1062–1072. [Google Scholar]
Chen, W.; Rojas, N. TraKDis: A Transformer-based Knowledge Distillation Approach for Visual Reinforcement Learning with Application to Cloth Manipulation. IEEE Robot. Autom. Lett. 2024, 9, 2455–2462. [Google Scholar] [CrossRef]
Wang, Z.; Ren, Y.; Zhang, X.; Wang, Y. Generating long financial report using conditional variational autoencoders with knowledge distillation. IEEE Trans. Artif. Intell. 2024, 5, 1669–1680. [Google Scholar] [CrossRef]
Alzahrani, S.M.; Qahtani, A.M. Knowledge distillation in transformers with tripartite attention: Multiclass brain tumor detection in highly augmented MRIs. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 101907. [Google Scholar] [CrossRef]
Pham, C.; Nguyen, V.A.; Le, T.; Phung, D.; Carneiro, G.; Do, T.T. Frequency Attention for Knowledge Distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 2277–2286. [Google Scholar]
Gou, J.; Xiong, X.; Yu, B.; Du, L.; Zhan, Y.; Tao, D. Multi-target knowledge distillation via student self-reflection. Int. J. Comput. Vis. 2023, 131, 1857–1874. [Google Scholar] [CrossRef]
Yang, S.; Yang, J.; Zhou, M.; Huang, Z.; Zheng, W.S.; Yang, X.; Ren, J. Learning from Human Educational Wisdom: A Student-Centered Knowledge Distillation Method. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4188–4205. [Google Scholar] [CrossRef] [PubMed]
Zabin, M.; Choi, H.J.; Uddin, J. Hybrid deep transfer learning architecture for industrial fault diagnosis using Hilbert transform and DCNN–LSTM. J. Supercomput. 2023, 79, 5181–5200. [Google Scholar] [CrossRef]
Feng, J.; Wang, Q.; Zhang, G.; Jia, X.; Yin, J. CAT: Center Attention Transformer with Stratified Spatial-Spectral Token for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Tejasree, G.; Agilandeeswari, L. An extensive review of hyperspectral image classification and prediction: Techniques and challenges. Multimed. Tools Appl. 2024, 83, 80941–81038. [Google Scholar] [CrossRef]
Jiang, Y.; Feng, C.; Zhang, F.; Bull, D. MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution. arXiv 2024, arXiv:2404.09571. [Google Scholar]
Hasan, M.J.; Islam, M.M.; Kim, J.M. Acoustic spectral imaging and transfer learning for reliable bearing fault diagnosis under variable speed conditions. Measurement 2019, 138, 620–631. [Google Scholar] [CrossRef]
Allen-Zhu, Z.; Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv 2020, arXiv:2012.09816. [Google Scholar]
Yuan, M.; Lang, B.; Quan, F. Student-friendly knowledge distillation. Knowl.-Based Syst. 2024, 296, 111915. [Google Scholar] [CrossRef]
Yang, C.; Yu, X.; An, Z.; Xu, Y. Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation. In Advancements in Knowledge Distillation: Towards New Horizons of Intelligent Systems; Springer International Publishing: Cham, Switzerland, 2023; pp. 1–32. [Google Scholar]
Huang, T.; Zhang, Y.; Zheng, M.; You, S.; Wang, F.; Qian, C.; Xu, C. Knowledge diffusion for distillation. Adv. Neural Inf. Process. Syst. 2024, 36, 65299–65316. [Google Scholar]
Fu, S.; Li, Z.; Liu, Z.; Yang, X. Interactive knowledge distillation for image classification. Neurocomputing 2021, 449, 411–421. [Google Scholar] [CrossRef]
Chen, D.; Mei, J.P.; Zhang, Y.; Wang, C.; Wang, Z.; Feng, Y.; Chen, C. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, No. 8. pp. 7028–7036. [Google Scholar]
Ding, X.; Wang, Y.; Xu, Z.; Wang, Z.J.; Welch, W.J. Distilling and transferring knowledge via cGAN-generated samples for image classification and regression. Expert Syst. Appl. 2023, 213, 119060. [Google Scholar] [CrossRef]
Chen, D.; Mei, J.P.; Zhang, H.; Wang, C.; Feng, Y.; Chen, C. Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11933–11942. [Google Scholar]
Ahuja, N.; Datta, P.; Kanzariya, B.; Somayazulu, V.S.; Tickoo, O. Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2022–2030. [Google Scholar]
Chen, P.; Liu, S.; Zhao, H.; Jia, J. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5008–5017. [Google Scholar]
Pham, C.; Hoang, T.; Do, T.T. Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6435–6443. [Google Scholar]
Xu, C.; Gao, W.; Li, T.; Bai, N.; Li, G.; Zhang, Y. Teacher-student collaborative knowledge distillation for image classification. Appl. Intell. 2023, 53, 1997–2009. [Google Scholar] [CrossRef]
Yang, J.; Martinez, B.; Bulat, A.; Tzimiropoulos, G. Knowledge distillation via softmax regression representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
Zhou, S.; Wang, Y.; Chen, D.; Chen, J.; Wang, X.; Wang, C.; Bu, J. Distilling holistic knowledge with graph neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10387–10396. [Google Scholar]
Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]

Figure 1. The architecture of teacher–student models using the knowledge distillation algorithm.

Figure 2. The architecture of novel teacher model.

Figure 3. The result of the proposed method based on CIFAR—10 dataset, blue boxes show correct predictions while red boxes are incorrect.

Table 1. The results of the proposed teacher–student model in different T and α values with 100 epochs. The best and worst two results emphasized blue and red colors, respectively.

T	α	lr	Train_acc	Val_acc	Train__loss	Val_loss
20.0	0.7	0.01	85.90	82.96	4.90	0.65
20.0	0.8	0.01	86.09	83.82	4.69	0.63
20.0	0.9	0.01	86.05	84.09	4.65	0.60
20.0	1.0	0.01	85.45	82.77	4.74	0.64
15.0	0.7	0.01	84.56	85.11	4.76	0.60
15.0	0.8	0.01	85.95	83.24	4.78	0.63
15.0	0.9	0.01	85.58	83.23	4.79	0.64
15.0	1.0	0.01	85.68	82.45	4.67	0.68
10.0	0.7	0.01	86.45	83.63	4.58	0.63
10.0	0.8	0.01	85.67	83.17	4.69	0.62
10.0	0.9	0.01	86.21	83.51	4.47	0.62
10.0	1.0	0.01	85.74	83.02	4.55	0.61

Table 2. The evaluation of the proposed model using the metrics includes precision, recall, and f1-score for each class.

Class	Precision	Recall	F1-Score
airplane	0.91	0.82	0.86
automobile	0.93	0.92	0.92
bird	0.81	0.82	0.81
cat	0.72	0.76	0.74
deer	0.84	0.85	0.84
dog	0.81	0.80	0.80
frog	0.95	0.85	0.89
horse	0.87	0.89	0.88
ship	0.86	0.93	0.89
truck	0.87	0.92	0.89

Table 3. Overall performance metrics of the proposed model.

Metric Type	Precision	Recall	F1-Score	Accuracy
Macro Average	0.86	0.85	0.85	-
Weighted Average	0.86	0.85	0.85	-
Overall Accuracy	-	-	-	0.8542

Table 4. The comparison results of the proposed model with famous ResNet and VGG networks.

Teacher–Student Network (T/S)	Train Accuracy	Validation Accuracy	Train Loss	Validation Loss	Epochs	T	α
ResNet152/ResNet50	89.91	91.70	3.03	0.45	100	15	0.7
ResNet152/ResNet18	88.99	90.45	4.46	0.56	100	15	0.7
ResNet152/ResNet10	85.78	88.34	6.45	1.34	100	15	0.7
ResNet50/VGG8	82.45	84.56	7.33	1.98	100	15	0.7
VGG13/VGG8	79.23	83.78	8.32	2.56	100	15	0.7
The proposed model	90.89	92.94	3.11	0.51	100	15	0.7

Table 5. Performance of the proposed model with different T and α values with 100 epochs.

T	α	Train Accuracy	Validation Accuracy	Train Loss	Validation Loss
20.0	0.7	85.90	82.96	4.90	0.65
20.0	0.8	86.09	83.82	4.69	0.63
20.0	0.9	86.05	84.09	4.65	0.60
20.0	1.0	85.45	82.77	4.74	0.64
15.0	0.7	84.56	85.11	4.76	0.60
15.0	0.8	85.95	83.24	4.78	0.63
15.0	0.9	85.58	83.23	4.79	0.64
15.0	1.0	85.68	82.45	4.67	0.68
10.0	0.7	86.45	83.63	4.58	0.63
10.0	0.8	85.67	83.17	4.69	0.62
10.0	0.9	86.21	83.51	4.47	0.62
10.0	1.0	85.74	83.02	4.55	0.61

Table 6. Illustration of extra results of CIFAR-10, CIFAR-100 and Tiny ImageNet.

Dataset	Validation Accuracy	Validation Loss
CIFAR-10	92.94%	0.51
CIFAR-100	78.3%	1.02
Tiny ImageNet	65.7%	1.54

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Umirzakova, S.; Abdullaev, M.; Mardieva, S.; Latipova, N.; Muksimova, S. Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture. Electronics 2024, 13, 4530. https://doi.org/10.3390/electronics13224530

AMA Style

Umirzakova S, Abdullaev M, Mardieva S, Latipova N, Muksimova S. Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture. Electronics. 2024; 13(22):4530. https://doi.org/10.3390/electronics13224530

Chicago/Turabian Style

Umirzakova, Sabina, Mirjamol Abdullaev, Sevara Mardieva, Nodira Latipova, and Shakhnoza Muksimova. 2024. "Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture" Electronics 13, no. 22: 4530. https://doi.org/10.3390/electronics13224530

APA Style

Umirzakova, S., Abdullaev, M., Mardieva, S., Latipova, N., & Muksimova, S. (2024). Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture. Electronics, 13(22), 4530. https://doi.org/10.3390/electronics13224530

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture

Abstract

1. Introduction

2. Related Work

3. The Proposed Method

3.1. The Knowledge Distillation

3.2. The Proposed Model Architecture

4. Materials and Experimental Setup

4.1. Dataset

4.2. Data Preprocessing

4.3. Metrics

4.4. Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI