Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture
Abstract
:1. Introduction
2. Related Work
3. The Proposed Method
Algorithm 1. Proposed Method chart. |
Teacher Network Training Initialize Teacher Network with five convolutional layers, batch normalization, max pooling, and ReLU activations Set hyperparameters (learning rate, batch size, number of epochs) For each epoch: For each batch in training data: Forward pass through Teacher Network Calculate loss (cross-entropy) Backpropagate and update model weights Evaluate Teacher Network on validation data Record validation accuracy and loss Knowledge Distillation to Student Network Initialize Student Network (smaller model) Set distillation parameters (temperature T, smoothing factor α) For each epoch: For each batch in training data: Forward pass through Teacher Network to get logits Forward pass through Student Network Calculate distillation loss using Teacher’s softened logits and Student’s predictions: Softened logits = softmax(Teacher logits/T) Distillation loss = cross-entropy (softened logits, Student predictions) * α + original loss * (1 − α) Backpropagate and update Student Network weights Evaluate Student Network on validation data Record validation accuracy and loss Evaluation and Sensitivity Analysis For each model pair (e.g., ResNet152/ResNet50, ResNet152/ResNet18, etc.): Train and validate under identical distillation parameters (T, α) Record and compare training accuracy, validation accuracy, training loss, and validation loss For each combination of temperature T and smoothing factor α: Train and validate Student Network Record performance metrics (training accuracy, validation accuracy, training loss, validation loss) Identify optimal values of T and α for best performance |
3.1. The Knowledge Distillation
3.2. The Proposed Model Architecture
Algorithm 2. Algorithm of the proposed teacher–student model |
1: Preprocess the data. |
2: Create model input: |
3: Input image (shape (height, width,3)) |
4: for i = len(train loader) do |
train the N sample of data; |
test the validation results of the proposed model; |
output = student (); |
teacher_output = teacher (); |
loss = knowledge_distillation () |
5: end for |
6: Evaluate the model by selecting the two best-performing results |
BT_1 = max {} BL_1 = min {} |
7: Load maximum in BT_1 best train accuracy and minimum in BL_1 best loss result |
4. Materials and Experimental Setup
4.1. Dataset
4.2. Data Preprocessing
4.3. Metrics
4.4. Results
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Moein, M.M.; Saradar, A.; Rahmati, K.; Mousavinejad, S.H.G.; Bristow, J.; Aramali, V.; Karakouzian, M. Predictive models for concrete properties using machine learning and deep learning approaches: A review. J. Build. Eng. 2023, 63, 105444. [Google Scholar] [CrossRef]
- Muksimova, S.; Umirzakova, S.; Mardieva, S.; Cho, Y.I. Enhancing Medical Image Denoising with Innovative Teacher–Student Model-Based Approaches for Precision Diagnostics. Sensors 2023, 23, 9502. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J.; Shi, Y.; Yang, J.; Guo, Q. KD-SCFNet: Towards more accurate and lightweight salient object detection via knowledge distillation. Neurocomputing 2024, 572, 127206. [Google Scholar] [CrossRef]
- Liu, L.; Wang, Z.; Phan, M.H.; Zhang, B.; Ge, J.; Liu, Y. BPKD: Boundary Privileged Knowledge Distillation for Semantic Segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1062–1072. [Google Scholar]
- Chen, W.; Rojas, N. TraKDis: A Transformer-based Knowledge Distillation Approach for Visual Reinforcement Learning with Application to Cloth Manipulation. IEEE Robot. Autom. Lett. 2024, 9, 2455–2462. [Google Scholar] [CrossRef]
- Wang, Z.; Ren, Y.; Zhang, X.; Wang, Y. Generating long financial report using conditional variational autoencoders with knowledge distillation. IEEE Trans. Artif. Intell. 2024, 5, 1669–1680. [Google Scholar] [CrossRef]
- Alzahrani, S.M.; Qahtani, A.M. Knowledge distillation in transformers with tripartite attention: Multiclass brain tumor detection in highly augmented MRIs. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 101907. [Google Scholar] [CrossRef]
- Pham, C.; Nguyen, V.A.; Le, T.; Phung, D.; Carneiro, G.; Do, T.T. Frequency Attention for Knowledge Distillation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 2277–2286. [Google Scholar]
- Gou, J.; Xiong, X.; Yu, B.; Du, L.; Zhan, Y.; Tao, D. Multi-target knowledge distillation via student self-reflection. Int. J. Comput. Vis. 2023, 131, 1857–1874. [Google Scholar] [CrossRef]
- Yang, S.; Yang, J.; Zhou, M.; Huang, Z.; Zheng, W.S.; Yang, X.; Ren, J. Learning from Human Educational Wisdom: A Student-Centered Knowledge Distillation Method. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 4188–4205. [Google Scholar] [CrossRef] [PubMed]
- Zabin, M.; Choi, H.J.; Uddin, J. Hybrid deep transfer learning architecture for industrial fault diagnosis using Hilbert transform and DCNN–LSTM. J. Supercomput. 2023, 79, 5181–5200. [Google Scholar] [CrossRef]
- Feng, J.; Wang, Q.; Zhang, G.; Jia, X.; Yin, J. CAT: Center Attention Transformer with Stratified Spatial-Spectral Token for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
- Tejasree, G.; Agilandeeswari, L. An extensive review of hyperspectral image classification and prediction: Techniques and challenges. Multimed. Tools Appl. 2024, 83, 80941–81038. [Google Scholar] [CrossRef]
- Jiang, Y.; Feng, C.; Zhang, F.; Bull, D. MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution. arXiv 2024, arXiv:2404.09571. [Google Scholar]
- Hasan, M.J.; Islam, M.M.; Kim, J.M. Acoustic spectral imaging and transfer learning for reliable bearing fault diagnosis under variable speed conditions. Measurement 2019, 138, 620–631. [Google Scholar] [CrossRef]
- Allen-Zhu, Z.; Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv 2020, arXiv:2012.09816. [Google Scholar]
- Yuan, M.; Lang, B.; Quan, F. Student-friendly knowledge distillation. Knowl.-Based Syst. 2024, 296, 111915. [Google Scholar] [CrossRef]
- Yang, C.; Yu, X.; An, Z.; Xu, Y. Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation. In Advancements in Knowledge Distillation: Towards New Horizons of Intelligent Systems; Springer International Publishing: Cham, Switzerland, 2023; pp. 1–32. [Google Scholar]
- Huang, T.; Zhang, Y.; Zheng, M.; You, S.; Wang, F.; Qian, C.; Xu, C. Knowledge diffusion for distillation. Adv. Neural Inf. Process. Syst. 2024, 36, 65299–65316. [Google Scholar]
- Fu, S.; Li, Z.; Liu, Z.; Yang, X. Interactive knowledge distillation for image classification. Neurocomputing 2021, 449, 411–421. [Google Scholar] [CrossRef]
- Chen, D.; Mei, J.P.; Zhang, Y.; Wang, C.; Wang, Z.; Feng, Y.; Chen, C. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, No. 8. pp. 7028–7036. [Google Scholar]
- Ding, X.; Wang, Y.; Xu, Z.; Wang, Z.J.; Welch, W.J. Distilling and transferring knowledge via cGAN-generated samples for image classification and regression. Expert Syst. Appl. 2023, 213, 119060. [Google Scholar] [CrossRef]
- Chen, D.; Mei, J.P.; Zhang, H.; Wang, C.; Feng, Y.; Chen, C. Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11933–11942. [Google Scholar]
- Ahuja, N.; Datta, P.; Kanzariya, B.; Somayazulu, V.S.; Tickoo, O. Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2022–2030. [Google Scholar]
- Chen, P.; Liu, S.; Zhao, H.; Jia, J. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5008–5017. [Google Scholar]
- Pham, C.; Hoang, T.; Do, T.T. Collaborative Multi-Teacher Knowledge Distillation for Learning Low Bit-width Deep Neural Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 6435–6443. [Google Scholar]
- Xu, C.; Gao, W.; Li, T.; Bai, N.; Li, G.; Zhang, Y. Teacher-student collaborative knowledge distillation for image classification. Appl. Intell. 2023, 53, 1997–2009. [Google Scholar] [CrossRef]
- Yang, J.; Martinez, B.; Bulat, A.; Tzimiropoulos, G. Knowledge distillation via softmax regression representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
- Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
- Zhou, S.; Wang, Y.; Chen, D.; Chen, J.; Wang, X.; Wang, C.; Bu, J. Distilling holistic knowledge with graph neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10387–10396. [Google Scholar]
- Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
T | α | lr | Train_acc | Val_acc | Train__loss | Val_loss |
---|---|---|---|---|---|---|
20.0 | 0.7 | 0.01 | 85.90 | 82.96 | 4.90 | 0.65 |
20.0 | 0.8 | 0.01 | 86.09 | 83.82 | 4.69 | 0.63 |
20.0 | 0.9 | 0.01 | 86.05 | 84.09 | 4.65 | 0.60 |
20.0 | 1.0 | 0.01 | 85.45 | 82.77 | 4.74 | 0.64 |
15.0 | 0.7 | 0.01 | 84.56 | 85.11 | 4.76 | 0.60 |
15.0 | 0.8 | 0.01 | 85.95 | 83.24 | 4.78 | 0.63 |
15.0 | 0.9 | 0.01 | 85.58 | 83.23 | 4.79 | 0.64 |
15.0 | 1.0 | 0.01 | 85.68 | 82.45 | 4.67 | 0.68 |
10.0 | 0.7 | 0.01 | 86.45 | 83.63 | 4.58 | 0.63 |
10.0 | 0.8 | 0.01 | 85.67 | 83.17 | 4.69 | 0.62 |
10.0 | 0.9 | 0.01 | 86.21 | 83.51 | 4.47 | 0.62 |
10.0 | 1.0 | 0.01 | 85.74 | 83.02 | 4.55 | 0.61 |
Class | Precision | Recall | F1-Score |
---|---|---|---|
airplane | 0.91 | 0.82 | 0.86 |
automobile | 0.93 | 0.92 | 0.92 |
bird | 0.81 | 0.82 | 0.81 |
cat | 0.72 | 0.76 | 0.74 |
deer | 0.84 | 0.85 | 0.84 |
dog | 0.81 | 0.80 | 0.80 |
frog | 0.95 | 0.85 | 0.89 |
horse | 0.87 | 0.89 | 0.88 |
ship | 0.86 | 0.93 | 0.89 |
truck | 0.87 | 0.92 | 0.89 |
Metric Type | Precision | Recall | F1-Score | Accuracy |
---|---|---|---|---|
Macro Average | 0.86 | 0.85 | 0.85 | - |
Weighted Average | 0.86 | 0.85 | 0.85 | - |
Overall Accuracy | - | - | - | 0.8542 |
Teacher–Student Network (T/S) | Train Accuracy | Validation Accuracy | Train Loss | Validation Loss | Epochs | T | α |
---|---|---|---|---|---|---|---|
ResNet152/ResNet50 | 89.91 | 91.70 | 3.03 | 0.45 | 100 | 15 | 0.7 |
ResNet152/ResNet18 | 88.99 | 90.45 | 4.46 | 0.56 | 100 | 15 | 0.7 |
ResNet152/ResNet10 | 85.78 | 88.34 | 6.45 | 1.34 | 100 | 15 | 0.7 |
ResNet50/VGG8 | 82.45 | 84.56 | 7.33 | 1.98 | 100 | 15 | 0.7 |
VGG13/VGG8 | 79.23 | 83.78 | 8.32 | 2.56 | 100 | 15 | 0.7 |
The proposed model | 90.89 | 92.94 | 3.11 | 0.51 | 100 | 15 | 0.7 |
T | α | Train Accuracy | Validation Accuracy | Train Loss | Validation Loss |
---|---|---|---|---|---|
20.0 | 0.7 | 85.90 | 82.96 | 4.90 | 0.65 |
20.0 | 0.8 | 86.09 | 83.82 | 4.69 | 0.63 |
20.0 | 0.9 | 86.05 | 84.09 | 4.65 | 0.60 |
20.0 | 1.0 | 85.45 | 82.77 | 4.74 | 0.64 |
15.0 | 0.7 | 84.56 | 85.11 | 4.76 | 0.60 |
15.0 | 0.8 | 85.95 | 83.24 | 4.78 | 0.63 |
15.0 | 0.9 | 85.58 | 83.23 | 4.79 | 0.64 |
15.0 | 1.0 | 85.68 | 82.45 | 4.67 | 0.68 |
10.0 | 0.7 | 86.45 | 83.63 | 4.58 | 0.63 |
10.0 | 0.8 | 85.67 | 83.17 | 4.69 | 0.62 |
10.0 | 0.9 | 86.21 | 83.51 | 4.47 | 0.62 |
10.0 | 1.0 | 85.74 | 83.02 | 4.55 | 0.61 |
Dataset | Validation Accuracy | Validation Loss |
---|---|---|
CIFAR-10 | 92.94% | 0.51 |
CIFAR-100 | 78.3% | 1.02 |
Tiny ImageNet | 65.7% | 1.54 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Umirzakova, S.; Abdullaev, M.; Mardieva, S.; Latipova, N.; Muksimova, S. Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture. Electronics 2024, 13, 4530. https://doi.org/10.3390/electronics13224530
Umirzakova S, Abdullaev M, Mardieva S, Latipova N, Muksimova S. Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture. Electronics. 2024; 13(22):4530. https://doi.org/10.3390/electronics13224530
Chicago/Turabian StyleUmirzakova, Sabina, Mirjamol Abdullaev, Sevara Mardieva, Nodira Latipova, and Shakhnoza Muksimova. 2024. "Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture" Electronics 13, no. 22: 4530. https://doi.org/10.3390/electronics13224530
APA StyleUmirzakova, S., Abdullaev, M., Mardieva, S., Latipova, N., & Muksimova, S. (2024). Simplified Knowledge Distillation for Deep Neural Networks Bridging the Performance Gap with a Novel Teacher–Student Architecture. Electronics, 13(22), 4530. https://doi.org/10.3390/electronics13224530