Bi-Level Orthogonal Multi-Teacher Distillation
Abstract
:1. Introduction
- Our work introduces a novel BOMD approach that combines orthogonal projections and bi-level optimization for effective knowledge transfer from an ensemble of diverse teacher models.
- A key component of our BOMD method is the use of bi-level optimization to learn optimal weighting factors for combining knowledge from multiple teachers. Unlike heuristic weighting strategies, our approach treats the weighting factors as upper-level variables and the student’s parameters as lower-level variables in a nested optimization problem.
- Through extensive experiments on benchmark datasets, we validate the effectiveness and flexibility of our BOMD approach. Our method achieves state-of-the-art performance on the CIFAR-100 benchmark for multi-teacher knowledge distillation, consistently outperforming existing approaches across diverse teacher–student scenarios, including homogeneous and heterogeneous teacher ensembles.
2. Related Work
2.1. Knowledge Distillation
2.2. Multi-Teacher Knowledge Distillation
2.3. Difference of Our Method vs. Existing Methods
3. Bi-Level Orthogonal Multi-Teacher Distillation
3.1. Multi-Teacher Feature-Based Distillation
3.2. Multi-Teacher Logit-Based Distillation
3.3. Multiple Orthogonal Projections
3.4. Benefits and Limitations
3.5. Bi-Level Optimization for Weighting Factors
4. Experiments
4.1. Datasets and Implementation Details
4.2. Settings and Hyperparameters
4.3. Experimental Framework and Devices
5. Experiment Results
5.1. Distillation Performance of Multi-Teacher KD Methods on CIFAR-100
5.2. Compared to Single-Teacher Methods
5.3. Distillation Performance on Large-Scale Datasets
5.4. Results on CIFAR-100 with Three Teachers
5.5. Results on CIFAR-100 with Five Teachers
5.6. Advantages over Other Method
5.7. Analysis of Our Method
5.8. Limitations of Our Method
6. Conclusions
- Align teacher features with the student’s feature space through orthogonal projections, preserving structural properties during knowledge transfer.
- Optimize weighting factors for combining teacher knowledge using a principled bi-level optimization approach.
- Achieve significant performance improvements even when distilling to very compact student models.
Limitations and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Dong, P.; Niu, X.; Li, L.; Xie, L.; Zou, W.; Ye, T.; Wei, Z.; Pan, H. Prior-Guided One-shot Neural Architecture Search. arXiv 2022, arXiv:2206.13329. [Google Scholar] [CrossRef]
- Dong, P.; Li, L.; Wei, Z.; Niu, X.; Tian, Z.; Pan, H. EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization. In Proceedings of the International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023. [Google Scholar]
- Zhu, C.; Li, L.; Wu, Y.; Sun, Z. Saswot: Real-time semantic segmentation architecture search without training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 7722–7730. [Google Scholar]
- Wei, Z.; Dong, P.; Hui, Z.; Li, A.; Li, L.; Lu, M.; Pan, H.; Li, D. Auto-prox: Training-free vision transformer architecture search via automatic proxy discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 15814–15822. [Google Scholar]
- Wei, Z.; Pan, H.; Li, L.; Dong, P.; Tian, Z.; Niu, X.; Li, D. TVT: Training-Free Vision Transformer Search on Tiny Datasets. arXiv 2023, arXiv:2311.14337. [Google Scholar] [CrossRef]
- Lu, L.; Chen, Z.; Lu, X.; Rao, Y.; Li, L.; Pang, S. UniADS: Universal Architecture-Distiller Search for Distillation Gap. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
- Dong, P.; Li, L.; Pan, X.; Wei, Z.; Liu, X.; Wang, Q.; Chu, X. ParZC: Parametric Zero-Cost Proxies for Efficient NAS. arXiv 2024, arXiv:2402.02105. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
- Fukuda, T.; Suzuki, M.; Kurata, G.; Thomas, S.; Cui, J.; Ramabhadran, B. Efficient Knowledge Distillation from an Ensemble of Teachers. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 3697–3701. [Google Scholar]
- Chen, D.; Mei, J.P.; Wang, C.; Feng, Y.; Chen, C. Online knowledge distillation with diverse peers. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 3430–3437. [Google Scholar]
- Zhang, H.; Chen, D.; Wang, C. Confidence-aware multi-teacher knowledge distillation. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 4498–4502. [Google Scholar]
- Kwon, K.; Na, H.; Lee, H.; Kim, N.S. Adaptive knowledge distillation based on entropy. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 7409–7413. [Google Scholar]
- Du, S.; You, S.; Li, X.; Wu, J.; Wang, F.; Qian, C.; Zhang, C. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. Adv. Neural Inf. Process. Syst. 2020, 33, 12345–12355. [Google Scholar]
- Li, L. Self-regulated feature learning via teacher-free feature distillation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 347–363. [Google Scholar]
- Dong, P.; Li, L.; Wei, Z. Diswot: Student architecture search for distillation without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11898–11908. [Google Scholar]
- Liu, X.; Li, L.; Li, C.; Yao, A. Norm: Knowledge distillation via n-to-one representation matching. arXiv 2023, arXiv:2305.13803. [Google Scholar] [CrossRef]
- Li, L.; Liang, S.N.; Yang, Y.; Jin, Z. Teacher-free distillation via regularizing intermediate representation. In Proceedings of the 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, 18–23 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–6. [Google Scholar]
- Li, L.; Dong, P.; Wei, Z.; Yang, Y. Automated knowledge distillation via monte carlo tree search. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 17413–17424. [Google Scholar]
- Li, L.; Dong, P.; Li, A.; Wei, Z.; Yang, Y. Kd-zero: Evolving knowledge distiller for any teacher–student pairs. Adv. Neural Inf. Process. Syst. 2023, 36, 69490–69504. [Google Scholar]
- Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar] [CrossRef]
- Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar] [CrossRef]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive Representation Distillation. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Yuan, F.; Shou, L.; Pei, J.; Lin, W.; Gong, M.; Fu, Y.; Jiang, D. Reinforced multi-teacher selection for knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 14284–14291. [Google Scholar]
- Ahn, S.; Hu, S.X.; Damianou, A.; Lawrence, N.D.; Dai, Z. Variational information distillation for knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9163–9171. [Google Scholar]
- Yang, J.; Martinez, B.; Bulat, A.; Tzimiropoulos, G. Knowledge distillation via softmax regression representation learning. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Chen, D.; Mei, J.; Zhang, Y.; Wang, C.; Wang, Z.; Feng, Y.; Chen, C. Cross-Layer Distillation with Semantic Calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 7028–7036. [Google Scholar]
Teacher | VGG13 | ResNet32x4 | ResNet32x4 | WRN-40-2 | WRN-40-2 | ResNet20x4 | ARI (%) |
75.17 ± 0.18 | 79.31 ± 0.14 | 79.31 ± 0.14 | 76.62 ± 0.26 | 76.62 ± 0.26 | 78.632 ± 0.24 | ||
Ensemble | 77.07 | 81.16 | 81.16 | 79.62 | 79.62 | 80.81 | |
Student | VGG8 | MobileNetV2 | VGG8 | MobileNetV2 | WRN-40-1 | ShuffleNetV1 | / |
70.74 ± 0.40 | 65.64 ± 0.19 | 70.74 ± 0.40 | 65.64 ± 0.19 | 71.93 ± 0.22 | 71.70 ± 0.43 | ||
AVER [9] | 73.98 ± 0.13 | 68.42 ± 0.06 | 73.23 ± 0.35 | 69.67 ± 0.01 | 74.56 ± 0.13 | 75.73 ± 0.02 | 49.97% |
AEKD-logits [13] | 73.82 ± 0.09 | 68.39 ± 0.13 | 73.22 ± 0.29 | 69.56 ± 0.34 | 74.18 ± 0.25 | 75.93 ± 0.32 | 54.87% |
FitNet-MKD [21] | 74.05 ± 0.07 | 68.46 ± 0.49 | 73.24 ± 0.24 | 69.29 ± 0.42 | 74.95 ± 0.30 | 75.98 ± 0.06 | 46.97% |
AEKD-feature [13] | 73.99 ± 0.15 | 68.18 ± 0.06 | 73.38 ± 0.16 | 69.44 ± 0.25 | 74.96 ± 0.18 | 76.86 ± 0.03 | 43.16% |
CA-MKD [11] | 74.27 ± 0.16 | 69.19 ± 0.04 | 75.08 ± 0.07 | 70.87 ± 0.14 | 75.27 ± 0.21 | 77.19 ± 0.49 | 11.98% |
BOMD | 74.90 ± 0.07 | 69.88 ± 0.04 | 75.86 ± 0.18 | 71.56 ± 0.03 | 75.78 ± 0.11 | 77.98 ± 0.35 | / |
Teacher | ResNet32x4 | WRN-40-2 | WRN-40-2 |
79.31 ± 0.14 | 76.62 ± 0.26 | 76.62 ± 0.26 | |
Student | MobileNetV2 | MobileNetV2 | WRN-40-1 |
65.64 ± 0.19 | 65.64 ± 0.19 | 71.93 ± 0.22 | |
KD [8] | 67.57 ± 0.10 | 69.31 ± 0.20 | 74.22 ± 0.09 |
AT [23] | 67.38 ± 0.21 | 69.18 ± 0.37 | 74.83 ± 0.15 |
VID [26] | 67.78 ± 0.13 | 68.57 ± 0.11 | 74.37 ± 0.22 |
CRD [24] | 69.04 ± 0.16 | 70.14 ± 0.06 | 74.82 ± 0.06 |
SRRL [27] | 68.77 ± 0.06 | 69.44 ± 0.13 | 74.60 ± 0.04 |
SemCKD [28] | 68.86 ± 0.26 | 69.61 ± 0.05 | 74.41 ± 0.16 |
BOMD | 69.89 ± 0.12 | 71.45 ± 0.12 | 75.76 ± 0.15 |
Dataset | Stanford Dogs | Tiny-ImageNet | ||
---|---|---|---|---|
Teacher | ResNet101 | ResNet34x4 | ResNet32x4 | VGG13 |
68.39 ± 1.44 | 66.07 ± 0.51 | 53.38 ± 0.11 | 49.17 ± 0.33 | |
Student | ShuffleNetV2x0.5 | ShuffleNetV2x0.5 | MobileNetV2 | MobileNetV2 |
59.36 ± 0.73 | 59.36 ± 0.73 | 39.46 ± 0.38 | 39.46 ± 0.38 | |
AVER [9] | 65.13 ± 0.13 | 63.46 ± 0.21 | 41.78 ± 0.15 | 41.87 ± 0.11 |
EBKD [12] | 64.28 ± 0.13 | 64.19 ± 0.11 | 41.24 ± 0.11 | 41.46 ± 0.24 |
CA-MKD [11] | 64.09 ± 0.35 | 64.28 ± 0.20 | 43.90 ± 0.09 | 42.65 ± 0.05 |
AEKD-feature [13] | 64.91 ± 0.21 | 62.13 ± 0.29 | 42.03 ± 0.12 | 41.56 ± 0.14 |
AEKD-logits [13] | 65.18 ± 0.24 | 63.97 ± 0.14 | 41.46 ± 0.28 | 41.19 ± 0.23 |
BOMD | 65.54 ± 0.12 | 64.67 ± 0.18 | 44.21 ± 0.04 | 44.35 ± 0.12 |
Teacher | ResNet56 | 73.47 | ResNet8 | 59.32 | VGG11 | 71.52 |
ResNet20x4 | 78.39 | WRN-40-2 | 76.51 | VGG13 | 75.19 | |
VGG13 | 75.19 | ResNet20x4 | 78.39 | ResNet32x4 | 79.31 | |
Student | VGG8 | 70.74 ± 0.40 | ResNet8x4 | 72.79 ± 0.14 | VGG8 | 70.74 ± 0.40 |
FitNet-MKD [21] | 75.06 ± 0.13 | 75.21 ± 0.12 | 73.43 ± 0.08 | |||
AVER [9] | 75.11 ± 0.57 | 75.16 ± 0.11 | 73.59 ± 0.06 | |||
EBKD [12] | 74.18 ± 0.22 | 75.44 ± 0.29 | 73.45 ± 0.08 | |||
AEKD-feature [13] | 74.69 ± 0.57 | 73.98 ± 0.18 | 73.40 ± 0.06 | |||
AEKD-logits [13] | 75.17 ± 0.30 | 73.93 ± 0.17 | 74.15 ± 0.08 | |||
CA-MKD [11] | 75.53 ± 0.14 | 75.27 ± 0.18 | 74.63 ± 0.17 | |||
BOMD | 76.42 ± 0.15 | 76.49 ± 0.14 | 75.98 ± 0.14 |
Teacher | ResNet8 | 59.32 | VGG11 | 71.52 | ResNet8 | 59.32 |
VGG11 | 71.52 | ResNet56 | 73.47 | VGG11 | 71.52 | |
ResNet56 | 73.47 | VGG13 | 75.19 | VGG13 | 75.19 | |
VGG13 | 75.19 | ResNet20x4 | 78.39 | WRN-40-2 | 76.51 | |
ResNet32x4 | 79.31 | ResNet32x4 | 79.31 | ResNet20x4 | 78.39 | |
Student | VGG8 | 70.74 ± 0.40 | VGG8 | 70.74 ± 0.40 | MobileNetV2 | 65.64 ± 0.19 |
AEKD-feature [13] | 74.02 ± 0.08 | 75.06 ± 0.03 | 69.41 ± 0.21 | |||
AVER [9] | 74.47 ± 0.47 | 74.48 ± 0.12 | 69.41 ± 0.04 | |||
AEKD-logits [13] | 73.53 ± 0.10 | 74.90 ± 0.17 | 69.28 ± 0.21 | |||
EBKD [12] | 74.37 ± 0.07 | 73.94 ± 0.29 | 69.26 ± 0.64 | |||
CA-MKD [11] | 74.64 ± 0.23 | 75.02 ± 0.21 | 70.30 ± 0.51 | |||
BOMD | 75.56 ± 0.34 | 75.32 ± 0.13 | 71.46 ± 0.26 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gong, S.; Wen, W. Bi-Level Orthogonal Multi-Teacher Distillation. Electronics 2024, 13, 3345. https://doi.org/10.3390/electronics13163345
Gong S, Wen W. Bi-Level Orthogonal Multi-Teacher Distillation. Electronics. 2024; 13(16):3345. https://doi.org/10.3390/electronics13163345
Chicago/Turabian StyleGong, Shuyue, and Weigang Wen. 2024. "Bi-Level Orthogonal Multi-Teacher Distillation" Electronics 13, no. 16: 3345. https://doi.org/10.3390/electronics13163345
APA StyleGong, S., & Wen, W. (2024). Bi-Level Orthogonal Multi-Teacher Distillation. Electronics, 13(16), 3345. https://doi.org/10.3390/electronics13163345