Matching the Ideal Pruning Method with Knowledge Distillation for Optimal Compression
Abstract
:1. Introduction
- We begin with a comprehensive comparison of two renowned pruning methods, namely weight pruning and channel pruning, in combination with knowledge distillation (KD). Both methods utilize the L1 norm as the shared criteria, allowing for a fair and rigorous evaluation of their efficacy in compressing models while preserving performance.
- Furthermore, we delve into the concept of “Performance Efficiency” in our framework, a novel formula designed to quantify model efficiency by considering reductions in parameters alongside accuracy. This metric serves as a valuable tool for assessing the effectiveness of compression techniques in real-world deployment scenarios.
- Additionally, we employ rigorous statistical analysis, including t-tests, to evaluate the significance of differences between pruning methods in terms of their impact on model performance. By subjecting our findings to statistical scrutiny, we ensure the reliability and robustness of our conclusions, enhancing the credibility of our research outcomes.
- To demonstrate the effectiveness of our proposed pipeline, we conducted evaluations involving 10 model combinations on the CIFAR-10 and CIFAR-100 datasets. These experiments provide empirical evidence of the advantages and limitations of each approach, shedding light on their applicability in practical settings.
2. Related Works
2.1. Knowledge Distillation
2.2. Pruning
3. Materials and Methods
3.1. Simple Knowledge Distillation (SimKD)
3.2. Weight Pruning
3.3. Channel Pruning
- Architectural Integrity: By focusing on entire channels, channel L1 pruning preserves the underlying architecture of the network. This ensures that critical information flow patterns, which contribute to the model’s effectiveness, are maintained.
- Resource Efficiency: Removing less influential channels results in a leaner model, leading to improved resource efficiency. This reduces the computational resources and memory required during both training and inference, making the model more efficient.
- Regularization and Generalization: Channel L1 pruning encourages the network to rely on essential features while diminishing reliance on redundant or less informative channels. This regularization process helps improve the generalization capabilities of the model and reduces overfitting, resulting in better performance on unseen data.
3.4. Efficiency Metric
4. Experiments
4.1. Pruning
4.2. Impact of Pruning Methods on Efficiency
4.3. How to Select the Student?
4.4. Comparing the Results
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Malihi, L.; Heidemann, G. Efficient and Controllable Model Compression through Sequential Knowledge Distillation and Pruning. Big Data Cogn. Comput. 2023, 7, 154. [Google Scholar] [CrossRef]
- Hinton, G.E.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. arXiv 2017, arXiv:1612.03928. [Google Scholar]
- Ahn, S.; Hu, S.X.; Damianou, A.; Lawrence, N.D.; Dai, Z. Variational Information Distillation for Knowledge Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. FitNets: Hints for Thin Deep Nets. arXiv 2015, arXiv:1412.6550. [Google Scholar]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive Representation Distillation. arXiv 2022, arXiv:1910.10699. [Google Scholar]
- Tung, F.; Mori, G. Similarity-Preserving Knowledge Distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Pham, T.X.; Niu, A.; Kang, Z.; Madjid, S.R.; Hong, J.W.; Kim, D.; Tee, J.T.J.; Yoo, C.D. Self-Supervised Visual Representation Learning via Residual Momentum. arXiv 2022, arXiv:2211.09861. [Google Scholar] [CrossRef]
- Xu, K.; Lai, R.; Li, Y.; Gu, L. Feature Normalized Knowledge Distillation for Image Classification. In Computer Vision—ECCV 2020 ECCV 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; pp. 664–680. ISBN 978-3-030-58594-5. [Google Scholar]
- Chen, D.; Mei, J.-P.; Zhang, Y.; Wang, C.; Wang, Z.; Feng, Y.; Chen, C. Cross-Layer Distillation with Semantic Calibration. Proc. AAAI Conf. Artif. Intell. 2021, 35, 7028–7036. [Google Scholar] [CrossRef]
- Chen, D.; Mei, J.-P.; Zhang, H.; Wang, C.; Feng, Y.; Chen, C. Knowledge Distillation with the Reused Teacher Classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022. [Google Scholar]
- Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv 2016, arXiv:1510.00149. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning Both Weights and Connections for Efficient Neural Networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
- Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning Filters for Efficient ConvNets. arXiv 2017, arXiv:1608.08710. [Google Scholar]
- Lin, S.; Ji, R.; Yan, C.; Zhang, B.; Cao, L.; Ye, Q.; Huang, F.; Doermann, D. Towards Optimal Structured CNN Pruning via Generative Adversarial Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. arXiv 2017, arXiv:1611.06440. [Google Scholar]
- Ding, X.; Ding, G.; Guo, Y.; Han, J.; Yan, C. Approximated Oracle Filter Pruning for Destructive CNN Width Optimization. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Aghli, N.; Ribeiro, E. Combining Weight Pruning and Knowledge Distillation for CNN Compression. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 19–25 June 2021; pp. 3185–3192. [Google Scholar]
- Xie, H.; Jiang, W.; Luo, H.; Yu, H. Model Compression via Pruning and Knowledge Distillation for Person Re-Identification. J. Ambient Intell. Humaniz. Comput. 2021, 12, 2149–2161. [Google Scholar] [CrossRef]
- Cui, B.; Li, Y.; Zhang, Z. Joint Structured Pruning and Dense Knowledge Distillation for Efficient Transformer Model Compression. Neurocomputing 2021, 458, 56–69. [Google Scholar] [CrossRef]
- Kim, J.; Chang, S.; Kwak, N. PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation. arXiv 2021, arXiv:2106.14681. [Google Scholar]
- Wang, R.; Wan, S.; Zhang, W.; Zhang, C.; Li, Y.; Xu, S.; Zhang, L.; Jin, X.; Jiang, Z.; Rao, Y. Progressive Multi-Level Distillation Learning for Pruning Network. Complex Intell. Syst. 2023, 9, 5779–5791. [Google Scholar] [CrossRef]
- Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning Structured Sparsity in Deep Neural Networks. Adv. Neural Inf. Process. Syst. 2016, 29. [Google Scholar]
- Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2018, arXiv:1608.06993. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Furlanello, T.; Lipton, Z.C.; Tschannen, M.; Itti, L.; Anandkumar, A. Born Again Neural Networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
Teacher, Student | Res101, Res50 | Res152, Res 34 | Res101, Res 18 | Dens169, Mobile | Dens161, Dens 169 | Dens161, Dens 201 | Effici b1, Effici b0 | Effici b3, Effici b2 | Effici b3, Mobile 3 | Effici b5, Effici b4 |
---|---|---|---|---|---|---|---|---|---|---|
Weight Pruning [1] | ||||||||||
ACC(t) | 54.07 | 52.2 | 54.07 | 57.2 | 58.2 | 58.2 | 59.2 | 58.1 ± 0.5 | 58.1 | 59.5 |
ACC(s) distillation | 59.12 | 51.3 | 55.6 | 52.1 | 56.45 | 54.78 | 57.1 | 55.87 | 52.07 | 59.4 |
P 1(t)before pruning × 107 | 4.45 | 6.02 | 4.45 | 1.42 | 2.68 | 2.68 | 0.78 | 1.22 | 1.22 | 3.04 |
P 2(s)before pruning × 107 | 2.56 | 2.13 | 1.17 | 0.42 | 1.42 | 2 | 0.56 | 0.91 | 0.42 | 1.93 |
ACC(t)breakpoint | 54.34 | 47.48 | 53.15 | 57.95 | 57.71 | 58.94 | 60.17 | 60 | 60 | 60.06 |
ACC(s)breakpoint | 58.92 | 53.47 | 55.25 | 55.1 | 57.61 | 55.23 | 58.63 | 55.91 | 54.8 | 59.8 |
P(t)breakpoint × 107 | 0.91 | 1.1 | 0.91 | 0.35 | 0.5 | 0.52 | 0.39 | 0.6 | 0.55 | 1.42 |
P(s)breakpoint × 107 | 0.5 | 0.37 | 0.32 | 0.18 | 0.29 | 0.41 | 0.27 | 0.42 | 0.15 | 0.8 |
Channel Pruning | ||||||||||
ACC(t) | 54.07 | 52.2 | 54.7 | 57.2, | 58.2 | 58.2 | 59.2 | 58.1 ± 0.5 | 58.1 | 59.5 |
ACC(s)distillation | 59.12 | 51.3 | 55.6 | 52.1 | 56.45 | 54.78 | 57.1 | 55.87 | 52.07 | 59.4 |
P 1(t)before pruning × 107 | 4.45 | 6.02 | 4.45 | 1.42 | 2.68 | 2.68 | 0.78 | 1.22 | 1.22 | 3.04 |
P 2(s)before pruning × 107 | 2.56 | 2.13 | 1.17 | 0.42 | 1.42 | 2 | 0.56 | 0.91 | 0.42 | 1.93 |
ACC(t)breakpoint | 54.07 | 52.2 | 54.1 | 57.2 | 57.87 | 57.87 | 54.1 | 56.2 ± 0.5 | 58.1 | 56.2 |
ACC(s)breakpoint | 56.78 | 50.7 | 53.78 | 52.89 | 56.89 | 53.67 | 55 | 55.87 | 51.98 | 56.98 |
P(t)breakpoint × 107 | 3.5 | 2.7 | 3.5 | 1.19 | 1.5 | 1.5 | 0.65 | 1 | 1 | 2.5 |
P(s)breakpoint × 107 | 2.1 | 1.8 | 0.8 | 0.35 | 1.1 | 1.7 | 0.45 | 0.75 | 0.27 | 1.6 |
Teacher, Student | Res101, Res50 | Res152, Res 34 | Res101, Res 18 | Dens169, Mobile | Dens161, Dens 169 | Dens161, Dens 201 | Effici b1, Effici b0 | Effici b3, Effici b2 | Effici b3, Mobile | Effici b5, Effici b4 |
---|---|---|---|---|---|---|---|---|---|---|
Weight Pruning [1] | ||||||||||
ACC(t) | 81.1 | 83.3 | 81.7 | 83.15 | 83.07 | 83.07 | 81.1 | 80.5 | 80.5 | 83.4 |
ACC(s)distillation | 81.7 | 80.1 | 80.56 ± 0.75 | 78.2 | 82.76 | 82.33 | 80.2 | 81.14 | 80.2 | 83.3 |
P(t)before pruning × 107 | 4.45 | 6.02 | 4.45 | 1.42 | 2.68 | 2.68 | 0.78 | 1.22 | 1.22 | 3.04 |
P(s)before pruning × 107 | 2.56 | 2.13 | 1.17 | 0.42 | 1.42 | 2 | 0.56 | 0.91 | 0.42 | 1.93 |
ACC(t)breakpoint | 82.7 | 83.15 | 83.32 | 83.53 | 82.86 | 83.53 | 81.62 | 80.52 | 80.12 | 83.1 |
ACC(s)breakpoint | 83.51 | 80.71 | 80.78 | 82.93 | 83.02 | 83.02 | 80.74 | 81.15 | 80.13 | 83.78 |
P(t)breakpoint × 107 | 0.91 | 1.1 | 0.91 | 0.25 | 0.38 | 0.58 | 0.37 | 0.47 | 0.22 | 1.4 |
P(s)breakpoint × 107 | 0.6 | 0.51 | 0.34 | 0.16 | 0.2 | 0.41 | 0.27 | 0.27 | 0.18 | 0.8 |
Channel Pruning | ||||||||||
ACC(t) | 81.1 | 83.3 | 81.7 | 83.15 | 83.07 | 83.07 | 81.1 | 80.5 | 80.5 | 83.4 |
ACC(s)distillation | 81.7 | 80.1 | 80.65 ± 0.75 | 78.2 | 82.76 | 82.33 | 80.2 | 81.14 | 80.2 | 83.3 |
P(t)before pruning × 107 | 4.45 | 6.02 | 4.45 | 1.42 | 2.68 | 2.68 | 0.78 | 1.22 | 1.22 | 3.04 |
P(s)before pruning × 107 | 2.56 | 2.13 | 1.17 | 0.42 | 1.42 | 2 | 0.56 | 0.91 | 0.42 | 1.93 |
ACC(t)breakpoint | 80.2 | 80.1 | 80.2 | 79.12 | 82.45 | 82.45 | 80.1 | 80.65 | 80.5 | 83.4 |
ACC(s)breakpoint | 81.7 | 80 | 79.56 | 80 | 81.3 | 82.33 | 80.3 | 79.1 | 80.2 | 83.3 |
P(t)breakpoint × 107 | 2.1 | 3 | 2.1 | 0.73 | 1.5 | 1.5 | 0.67 | 1 | 1 | 2.5 |
P(s)breakpoint × 107 | 2 | 1 | 0.6 | 0.33 | 0.75 | 1.58 | 0.43 | 0.75 | 0.27 | 1.6 |
Teacher | Res101 | Res152 | Dens169 | Dens161 | Effici b1 | Effici b3 | Effici b5 |
---|---|---|---|---|---|---|---|
(Channel pruning) × 107 | 0.8 | 1.8 | 0.35 | 1.1 | 0.45 | 0.27 | 1.6 |
(Weight pruning) [1] × 107 | 0.32 | 0.37 | 0.18 | 0.29 | 0.27 | 0.15 | 0.8 |
[2] × 107 | 1.87 | 2.42 | 0.41 | 1.95 | 0.52 | 0.43 | 1.92 |
[10] × 107 | 1.12 | 2.1 | 0.36 | 1.76 | 0.51 | 0.38 | 1.81 |
[11] × 107 | 2.23 | 1.2 | 0.68 | 1.52 | 0.58 | 1 | 2.4 |
[13] × 107 | 3.5 | 2.8 | 0.71 | 1.43 | 0.65 | 1.1 | 2.51 |
[17] × 107 | 0.81 | 1.15 | 0.37 | 1.09 | 0.25 | 0.3 | 1.54 |
[21] × 107 | 0.74 | 0.83 | 0.32 | 0.85 | 0.43 | 0.21 | 1.67 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Malihi, L.; Heidemann, G. Matching the Ideal Pruning Method with Knowledge Distillation for Optimal Compression. Appl. Syst. Innov. 2024, 7, 56. https://doi.org/10.3390/asi7040056
Malihi L, Heidemann G. Matching the Ideal Pruning Method with Knowledge Distillation for Optimal Compression. Applied System Innovation. 2024; 7(4):56. https://doi.org/10.3390/asi7040056
Chicago/Turabian StyleMalihi, Leila, and Gunther Heidemann. 2024. "Matching the Ideal Pruning Method with Knowledge Distillation for Optimal Compression" Applied System Innovation 7, no. 4: 56. https://doi.org/10.3390/asi7040056
APA StyleMalihi, L., & Heidemann, G. (2024). Matching the Ideal Pruning Method with Knowledge Distillation for Optimal Compression. Applied System Innovation, 7(4), 56. https://doi.org/10.3390/asi7040056