DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy
Abstract
:1. Introduction
- We propose a novel method for multi-teacher knowledge distillation, namely, DE-MKD, which decouples the original KD loss function and assigns sample-aware weights to the teachers based on entropy;
- In order to further improve the performance of the student network, we also use the teacher’s intermediate layer features as transmitted knowledge;
- Extensive experiments on the image classification dataset CIFAR-100 validated the effectiveness and flexibility of our approach.
2. Related Work
3. Method
3.1. The Loss of Decoupled Logit
3.2. The Loss of Intermediate Features
3.3. The Overall Loss
Algorithm 1 Our proposed DE-MKD on CIFAR-100 |
|
4. Experiments
4.1. Dataset and Details
4.2. Main Results
4.3. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Hinton, G.E.; Osindero, S.; Teh, Y.W. A fast learning algorithm for deep belief nets. Neural Comput. 2006, 18, 1527–1554. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
- Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar]
- You, S.; Xu, C.; Xu, C.; Tao, D. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Alifax, NS, Canada, 13–17 August 2017; pp. 1285–1294. [Google Scholar]
- Fukuda, T.; Suzuki, M.; Kurata, G.; Thomas, S.; Cui, J.; Ramabhadran, B. Efficient Knowledge Distillation from an Ensemble of Teachers. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 3697–3701. [Google Scholar]
- Wu, M.-C.; Chiu, C.-T.; Wu, K.-H. Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2202–2206. [Google Scholar]
- Liu, Y.; Zhang, W.; Wang, J. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 2020, 415, 106–113. [Google Scholar] [CrossRef]
- Du, S.; You, S.; Li, X.; Wu, J.; Wang, F.; Qian, C.; Zhang, C. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. Adv. Neural Inf. Process. Syst. 2020, 33, 12345–12355. [Google Scholar]
- Zhang, H.; Chen, D.; Wang, C. Confidence-aware multi-teacher knowledge distillation. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 23–27 May 2022; pp. 4498–4502. [Google Scholar]
- Zhao, B.; Cui, Q.; Song, R.; Qiu, Y.; Liang, J. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11953–11962. [Google Scholar]
- Tang, J.; Liu, M.; Jiang, N.; Cai, H.; Yu, W.; Zhou, J. Data-free network pruning for model compression. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Republic of Korea, 22–28 May 2021; pp. 1–5. [Google Scholar]
- Tang, J.; Chen, S.; Niu, G.; Sugiyama, M.; Gong, C. Distribution Shift Matters for Knowledge Distillation with Webly Collected Images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17470–17480. [Google Scholar]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive representation distillation. arXiv 2019, arXiv:1910.10699. [Google Scholar]
- Chen, D.; Mei, J.P.; Zhang, H.; Wang, C.; Feng, Y.; Chen, C. Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11933–11942. [Google Scholar]
- Song, J.; Chen, Y.; Ye, J.; Song, M. Spot-adaptive knowledge distillation. IEEE Trans. Image Process. 2022, 31, 3359–3370. [Google Scholar] [CrossRef] [PubMed]
- Liu, J.; Li, B.; Lei, M.; Shi, Y. Self-supervised knowledge distillation for complementary label learning. Neural Netw. 2022, 155, 318–327. [Google Scholar] [CrossRef] [PubMed]
- Shen, C.; Wang, X.; Song, J.; Sun, L.; Song, M. Amalgamating knowledge towards comprehensive classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 29–31 January 2019; Volume 33, pp. 3068–3075. [Google Scholar]
- Yuan, F.; Shou, L.; Pei, J.; Lin, W.; Gong, M.; Fu, Y.; Jiang, D. Reinforced multi-teacher selection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 14284–14291. [Google Scholar]
- Kwon, K.; Na, H.; Lee, H.; Kim, N.S. Adaptive knowledge distillation based on entropy. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7409–7413. [Google Scholar]
- Zhao, H.; Sun, X.; Dong, J.; Chen, C.; Dong, Z. Highlight every step: Knowledge distillation via collaborative teaching. IEEE Trans. Cybern. 2020, 52, 2070–2081. [Google Scholar] [CrossRef] [PubMed]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide residual networks. arXiv 2016, arXiv:1605.07146. [Google Scholar]
- Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; Ma, K. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3713–3722. [Google Scholar]
Teacher | VGG13 | ResNet32x4 |
74.89 ± 0.18 | 79.45 ± 0.19 | |
Student | VGG8 | ResNet8x4 |
70.70 ± 0.26 | 72.97 ± 0.22 | |
AVER-KD [9] | 74.08 ± 0.09 | 75.01 ± 0.41 |
AVER-FitNet [11] | 73.99 ± 0.18 | 74.78 ± 0.04 |
AEKD [17] | 73.90 ± 0.19 | 74.82 ± 0.10 |
EBKD [28] | 73.89 ± 0.34 | 74.44 ± 0.33 |
CA-MKD [18] | 74.30 ± 0.24 | 75.66 ± 0.13 |
DE-MKD | 75.25 ± 0.17 | 76.75 ± 0.13 |
Teacher | WRN40-2 | ResNet56 | VGG13 | ResNet32x4 | ResNet32x4 |
76.62 ± 0.17 | 73.19 ± 0.30 | 74.89 ± 0.18 | 79.45 ± 0.19 | 79.45 ± 0.19 | |
Student | ShuffleNetV2 | MobileNetV2 | MobileNetV2 | ShuffleNetV1 | VGG8 |
73.07 ± 0.06 | 65.46 ± 0.10 | 65.46 ± 0.10 | 71.58 ± 0.30 | 70.70 ± 0.26 | |
AVER-KD [9] | 76.98 ± 0.19 | 70.68 ± 0.11 | 68.89 ± 0.10 | 75.02 ± 0.25 | 73.51 ± 0.22 |
AVER-FitNet [11] | 77.29 ± 0.14 | 70.63 ± 0.23 | 68.87 ± 0.06 | 74.75 ± 0.27 | 73.00 ± 0.16 |
AEKD [17] | 77.02 ± 0.17 | 70.36 ± 0.19 | 69.07 ± 0.22 | 75.11 ± 0.19 | 73.21 ± 0.04 |
EBKD [28] | 76.75 ± 0.13 | 69.89 ± 0.14 | 68.09 ± 0.26 | 74.95 ± 0.14 | 73.01 ± 0.01 |
CA-MKD [18] | 77.64 ± 0.19 | 71.19 ± 0.28 | 69.29 ± 0.09 | 76.37 ± 0.51 | 75.02 ± 0.12 |
DE-MKD | 78.86 ± 0.15 | 71.48 ± 0.23 | 70.05 ± 0.16 | 77.43 ± 0.16 | 75.39 ± 0.20 |
Teacher | ResNet8x4 | ResNet20x4 | ResNet32x4 |
72.69 | 78.28 | 79.31 | |
Student | VGG8 | ||
70.70 ± 0.26 | |||
AVER-KD [9] | 74.53 ± 0.17 | ||
AVER-FitNet [11] | 74.38 ± 0.23 | ||
AEKD [17] | 74.75 ± 0.21 | ||
EBKD [28] | 74.27 ± 0.14 | ||
CA-MKD [18] | 75.21 ± 0.16 | ||
DE-MKD | 75.56 ± 0.17 |
Variants | Top-1 | ||||
---|---|---|---|---|---|
A | ✔ | ✘ | ✘ | ✘ | 70.70 ± 0.26 |
B | ✔ | ✔ | ✘ | ✘ | 73.51 ± 0.22 |
C | ✔ | ✘ | ✔ | ✘ | 75.10 ± 0.18 |
D | ✔ | ✘ | ✔ | ✔ | 75.39 ± 0.20 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cheng, X.; Zhang, Z.; Weng, W.; Yu, W.; Zhou, J. DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy. Mathematics 2024, 12, 1672. https://doi.org/10.3390/math12111672
Cheng X, Zhang Z, Weng W, Yu W, Zhou J. DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy. Mathematics. 2024; 12(11):1672. https://doi.org/10.3390/math12111672
Chicago/Turabian StyleCheng, Xin, Zhiqiang Zhang, Wei Weng, Wenxin Yu, and Jinjia Zhou. 2024. "DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy" Mathematics 12, no. 11: 1672. https://doi.org/10.3390/math12111672
APA StyleCheng, X., Zhang, Z., Weng, W., Yu, W., & Zhou, J. (2024). DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy. Mathematics, 12(11), 1672. https://doi.org/10.3390/math12111672