Multiloss Joint Gradient Control Knowledge Distillation for Image Classification
Abstract
:1. Introduction
- We present a novel method for independently controlling the momentum of task loss, feature distillation loss, and logit distillation loss. This innovation facilitates a more effective transfer of knowledge, resulting in enhanced student network performance.
- We empirically validate the efficacy of the proposed MJKD method on two widely used image classification datasets, CIFAR-100 and Tiny-ImageNet. The empirical results demonstrate that training the student network using the MJKD distillation method achieves superior performance compared with traditional knowledge distillation methods. Our findings indicate that logits may contain richer network information. Furthermore, the robustness and efficiency of MJKD has been substantiated.
- We provide a comprehensive analysis of the MJKD method, including visualizations of loss landscapes and correlation matrices between student and teacher logits. These insights offer a more profound understanding of the mechanisms underlying the improved performance of student networks trained with MJKD.
2. Main Methods
2.1. Feature Distillation Loss
2.2. Logit Distillation Loss
2.3. Multiloss Joint Gradient Control
3. Experiments and Discussion
3.1. Dataset and Experimental Setup
3.2. Main Results
3.2.1. Motivation Validations
3.2.2. Comparative Results
3.2.3. Visualization Analysis
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Anwar, S.; Hwang, K.; Sung, W. Structured pruning of deep convolutional neural networks. ACM J. Emerg. Technol. Comput. Syst. 2017, 13, 1–18. [Google Scholar] [CrossRef]
- Liu, Z.; Sun, M.; Zhou, T.; Huang, G.; Darrell, T. Rethinking the value of network pruning. arXiv 2018, arXiv:1810.05270. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 2015, 28, 1135–1143. [Google Scholar]
- Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. Adv. Neural Inf. Process. Syst. 2015, 28, 3123–3131. [Google Scholar]
- Jacob, B.; Kligys, S.; Chen, B.; Zhu, M.; Tang, M.; Howard, A.; Adam, H.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
- Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. Xnor-net: Imagenet classification using binary convolutional neural networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 525–542. [Google Scholar]
- Passalis, N.; Tzelepi, M.; Tefas, A. Probabilistic Knowledge Transfer for Lightweight Deep Representation Learning. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 2030–2039. [Google Scholar] [CrossRef]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive Representation Distillation. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar]
- Peng, B.; Jin, X.; Liu, J.; Li, D.; Wu, Y.; Liu, Y.; Zhou, S.; Zhang, Z. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 5007–5016. [Google Scholar]
- Park, W.; Kim, D.; Lu, Y.; Cho, M. Relational knowledge distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3967–3976. [Google Scholar]
- Heo, B.; Kim, J.; Yun, S.; Park, H.; Kwak, N.; Choi, J.Y. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1921–1930. [Google Scholar]
- Adriana, R.; Nicolas, B.; Samira, E.; Antoine, C.; Carlo, G.; Yoshua, B. Fitnets: Hints for Thin Deep Nets. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
- Hao, Z.; Guo, J.; Han, K.; Hu, H.; Xu, C.; Wang, Y. VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from Small Scale to Large Scale. arXiv 2023, arXiv:2305.15781. [Google Scholar]
- Zheng, Z.; Ye, R.; Hou, Q.; Ren, D.; Wang, P.; Zuo, W.; Cheng, M.M. Localization distillation for object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10070–10083. [Google Scholar] [CrossRef]
- Zhao, B.; Cui, Q.; Song, R.; Liang, J. DOT: A Distillation-Oriented Trainer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Vancouver, BC, Canada, 18–22 June 2023; pp. 6189–6198. [Google Scholar]
- Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Wang, D.; Lu, H.; Bo, C. Visual tracking via weighted local cosine similarity. IEEE Trans. Cybern. 2014, 45, 1838–1850. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Le, Y.; Yang, X. Tiny imagenet visual recognition challenge. CS 231N 2015, 7, 3. [Google Scholar]
- Ahn, S.; Hu, S.X.; Damianou, A.; Lawrence, N.D.; Dai, Z. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9163–9171. [Google Scholar]
- Keskar, N.S.; Mudigere, D.; Nocedal, J.; Smelyanskiy, M.; Tang, P.T.P. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv 2016, arXiv:1609.04836. [Google Scholar]
- Dinh, L.; Pascanu, R.; Bengio, S.; Bengio, Y. Sharp minima can generalize for deep nets. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1019–1028. [Google Scholar]
- Izmailov, P.; Podoprikhin, D.; Garipov, T.; Vetrov, D.; Wilson, A.G. Averaging weights leads to wider optima and better generalization. arXiv 2018, arXiv:1803.05407. [Google Scholar]
- Cho, J.H.; Hariharan, B. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 4794–4802. [Google Scholar]
- Li, H.; Xu, Z.; Taylor, G.; Studer, C.; Goldstein, T. Visualizing the loss landscape of neural nets. Adv. Neural Inf. Process. Syst. 2018, 31, 6391–6401. [Google Scholar] [CrossRef]
−0.05 −0.05 0.05 | −0.05 0.05 −0.05 | 0.05 −0.05 −0.05 | |
Top-1 | 75.41 ± 0.21 | 76.52 ± 0.37 | 75.26 ± 0.29 |
−0.05 0.05 0.05 | 0.05 −0.05 0.05 | 0.05 0.05 −0.05 | |
Top-1 | 76.08 ± 0.20 | 74.64 ± 0.22 | 75.47 ± 0.33 |
0.00 0.00 0.00 | −0.0125 0.0125 −0.0125 | −0.025 0.025 −0.025 | −0.0375 0.0375 −0.0375 | |
Top-1 | 75.78 ± 0.37 | 76.17 ± 0.31 | 76.15± 0.12 | 76.44 ± 0.23 |
−0.05 0.05 0.05 | −0.0625 0.0625 −0.0625 | −0.0750 0.0750 −0.0750 | −0.0875 0.0875 −0.0875 | |
Top-1 | 76.42 ± 0.26 | 76.33 ± 0.27 | 76.15 ± 0.19 | 75.29 ± 0.13 |
0.1 | 0.25 | 0.5 | 0.75 | 0.9 | MJKD | |
---|---|---|---|---|---|---|
top1 | 74.88 ± 0.19 | 75.14 ± 0.39 | 74.87 ± 0.20 | 75.01 ± 0.15 | 74.93 ± 0.23 | 76.42 ± 0.26 |
0.1 | 0.25 | 0.5 | 0.75 | 0.9 | MJKD | |
---|---|---|---|---|---|---|
top1 | 60.25 ± 0.31 | 60.98 ± 0.24 | 60.78 ± 0.22 | 60.47 ± 0.05 | 59.97 ± 0.16 | 63.53 ± 0.26 |
Teacher | ResNet32x4 | VGG13 | VGG13 | WRN-40-2 | ResNet50 | ResNet32x4 | ResNet32x4 |
---|---|---|---|---|---|---|---|
79.42 | 74.64 | 74.64 | 75.61 | 79.34 | 79.42 | 79.42 | |
Student | ResNet8x4 | VGG8 | MobileNet-V2 | WRN-16-2 | MobileNet-V2 | ShuffleNet-V1 | ShuffleNet-V2 |
73.06 ± 0.16 | 70.94 ± 0.24 | 65.85 ± 0.22 | 73.50 ± 0.25 | 65.85 ± 0.22 | 72.40 ± 0.38 | 73.80 ± 0.22 | |
AT [12] | 73.61 ± 0.26 | 71.76 ± 0.17 | 60.42 ± 0.47 | 74.30 ± 0.17 | 58.06 ± 1.44 | 73.57 ± 0.34 | 73.66 ± 0.24 |
VID [26] | 72.98 ± 0.12 | 71.00 ± 0.30 | 65.72 ± 0.53 | 73.87 ± 0.15 | 65.77 ± 0.47 | 72.78 ± 0.23 | 73.84 ± 0.38 |
FITNET [16] | 73.71 ± 0.17 | 71.48 ± 0.32 | 64.44 ± 0.99 | 73.61 ± 0.21 | 64.33 ± 0.56 | 74.39 ± 0.35 | 75.09 ± 0.15 |
PKT [10] | 74.10 ± 0.35 | 73.36 ± 0.13 | 68.34 ± 0.18 | 75.12 ± 0.18 | 68.59 ± 0.82 | 75.71 ± 0.35 | 76.10 ± 0.20 |
KD [3] | 73.70 ± 0.37 | 73.31 ± 0.22 | 68.02 ± 0.27 | 75.07 ± 0.23 | 68.50 ± 0.42 | 74.88 ± 0.24 | 75.35 ± 0.12 |
RKD [14] | 72.72 ± 0.16 | 71.71 ± 0.39 | 65.97 ± 0.29 | 73.82 ± 0.09 | 65.95 ± 0.52 | 73.84 ± 0.23 | 74.87 ± 0.37 |
DKD [17] | 76.06 ± 0.25 | 74.66 ± 0.17 | 67.11 ± 0.51 | 75.40 ± 0.19 | 68.23 ± 0.43 | 74.34 ± 0.21 | 76.97 ± 0.46 |
MJKD | 76.42 ± 0.26 | 74.75 ± 0.23 | 68.76 ± 0.69 | 75.52 ± 0.23 | 69.78 ± 0.55 | 76.38 ± 0.09 | 77.25 ± 0.43 |
Teacher | Student | AT [12] | KD [3] | PKT [10] | DKD [17] | MJKD | |
---|---|---|---|---|---|---|---|
ResNet18 as the teacher, MobileNet-V2 as the student | |||||||
Top-1 | 63.74 | 55.51 ± 0.48 | 57.20 ± 0.44 | 57.26 ± 0.25 | 57.43 ± 0.16 | 61.17 ± 0.22 | 63.53 ± 0.26 |
Top-5 | 83.55 | 79.76 ± 0.21 | 80.91 ± 0.24 | 80.80 ± 0.37 | 81.37 ± 0.22 | 83.55 ± 0.19 | 84.64 ± 0.18 |
ResNet18 as the teacher, ShuffleNetV2 as the student | |||||||
Top-1 | 63.74 | 51.65 ± 0.36 | 53.56 ± 0.11 | 52.32 ± 0.42 | 52.50 ± 0.45 | 56.09 ± 0.18 | 57.68 ± 0.23 |
Top-5 | 83.55 | 76.67 ± 0.30 | 78.16 ± 0.21 | 77.31 ± 0.23 | 77.40 ± 0.37 | 79.96 ± 0.13 | 80.80 ± 0.27 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
He, W.; Pan, J.; Zhang, J.; Zhou, X.; Liu, J.; Huang, X.; Lin, Y. Multiloss Joint Gradient Control Knowledge Distillation for Image Classification. Electronics 2024, 13, 4102. https://doi.org/10.3390/electronics13204102
He W, Pan J, Zhang J, Zhou X, Liu J, Huang X, Lin Y. Multiloss Joint Gradient Control Knowledge Distillation for Image Classification. Electronics. 2024; 13(20):4102. https://doi.org/10.3390/electronics13204102
Chicago/Turabian StyleHe, Wei, Jianchen Pan, Jianyu Zhang, Xichuan Zhou, Jialong Liu, Xiaoyu Huang, and Yingcheng Lin. 2024. "Multiloss Joint Gradient Control Knowledge Distillation for Image Classification" Electronics 13, no. 20: 4102. https://doi.org/10.3390/electronics13204102
APA StyleHe, W., Pan, J., Zhang, J., Zhou, X., Liu, J., Huang, X., & Lin, Y. (2024). Multiloss Joint Gradient Control Knowledge Distillation for Image Classification. Electronics, 13(20), 4102. https://doi.org/10.3390/electronics13204102