Knowledge Distillation in Image Classification: The Impact of Datasets
Abstract
:1. Introduction
- Teacher model training: The first step is to train a large and complex model (the teacher) on a given dataset to achieve high accuracy.
- Generation of soft targets: The trained teacher model is then used to make predictions on the training data, producing probability distributions (soft targets) over possible classes. These soft targets contain more information than the hard targets (i.e., the actual labels), as they reflect the relative confidence of the teacher model in its predictions. The soft targets can be obtained using a sofmax function
- Student model training: The smaller student model is trained using a combination of the original true labels and the soft targets generated by the teacher model. The loss function typically includes a component for standard classification loss and another component for distillation loss, which measures the difference between the student and teacher probability distributions. The Kullback–Leibler (KL) function is usually used for distillation loss. The KL formula is defined asThe final loss formula is defined as
2. Related Work
2.1. Knowledge Distillation in the Literature
- | About Data | Methods | EM | ||||
Ref | Year | Type | Dataset | Type | Teacher | Student | Acc |
LightCyan [8] | 2015 | article | JFT, MNIST | Images | DNN | DNN | ✓ |
[28] | 2019 | article | PPMI, Willow, UIUC-Sport | Images | AlexNet | AlexNet | ✓ |
LightCyan [29] | 2021 | article | Images | DNN | DNN | ✓ | |
[30] | 2017 | Conf | CIFAR100 | Images | VGG13, ResNet32x4 | VGG13, ResNet32x4 | ✓ |
LightCyan [31] | 2018 | Conf | CIFAR100 | Images | ResNet34 | VGG9 | ✓ |
[32] | 2020 | Conf | CIFAR10, CIFAR100 | Images | ResNet26 | ResNet8&14 | ✓ |
LightCyan [20] | 2014 | article | CIFAR10, CIFAR100, SVHN, MNIST, AFLW | Images | FiNet | ✓ | |
[33] | 2021 | article | CIFAR100 | Images | ResNet20 | ResNet8 | ✓ |
LightCyan [34] | 2024 | article | CIFAR100, ImageNet | Images | ResNet152 | ResNet18 | ✓ |
[35] | 2023 | article | CIFAR100 | Images | ResNet50& ResNet34 | ResNet18 | ✓ |
LightCyan [36] | 2023 | article | CIFAR100 | Images | ResNet18 | ResNet18 | ✓ |
2.2. Role of Datasets for Model Training by KD
2.3. Research Gap and Motivation
3. Research Method
3.1. Datasets Selection
Dataset Description and Complexity Classification
- CIFAR-10 [5]: This dataset consists of 60,000 32 × 32 color images across ten different classes, each containing 6000 images. The classes include common objects like cars, dogs, and cats. The addition of color and more diverse objects increases the complexity compared to MNIST and USPS. Criteria: larger image size (32 × 32 pixels), three-channel color images, more diverse classes, and significant background variations.
- CIFAR-100 [5]: Similar to CIFAR-10, CIFAR-100 has 100 classes, with 600 images per class. It covers a broader range of object categories, making it more challenging. The increased number of classes and the finer distinctions between categories make it a more complex classification task compared to the previous datasets. Criteria: same image size (32 × 32 pixels) and color channels as CIFAR10, but a much larger number of classes (100), increasing variability and the challenge of classification.
- USPS [11] is a digit dataset automatically scanned from envelopes by the U.S. Postal Service containing a total of 9298 16 × 16 pixel grayscale samples; the images are centered and normalized and show a broad range of font styles. Similar to MNIST, USPS contains images of handwritten digits. It is slightly more challenging than MNIST but still relatively simple. Criteria: small image size (16 × 16 pixels), same number of classes (10 digits), and slight variations in style and noise compared to MNIST.
- MNIST [41] is a dataset with 28 × 28 grayscale images of handwritten digits. It consists of ten different classes and is often used for image classification tasks. The dataset is relatively simple and is often used as a beginner’s dataset for image classification tasks. Criteria: small image size (28 × 28 pixels), a limited number of classes (10 digits), simple and uniform structure with minimal noise.
- Fashion MNIST [7] is a dataset with 28 × 28 grayscale images of fashion items, such as clothing and accessories. It consists of ten different classes and is often used as a replacement for the traditional MNIST dataset for image classification tasks. The dataset is more complex than MNIST as it requires the model to recognize various types of clothing items, adding a bit more complexity to the classification task. Criteria: same image size (28 × 28 pixels) as MNIST, but with 10 different classes of clothing, introducing more variability in shapes, and textures.
- Dimensionality: the resolution and color channels of the images. higher resolution and multiple colour channels generally increase the complexity of the dataset, as they require more sophisticated models to capture detail.
- Class diversity: the number and variability of classes within the dataset. A larger number of classes with significant differences between them increases complexity because the model has to distinguish between a larger set of categories.
- Data volume: the size of the dataset in terms of the number of samples. Larger datasets can be more complex to manage and require more computing resources, but they also provide more information for robust model formation.
- Variability: the level of noise, background variation, and object diversity within the dataset. Datasets with high variability in object appearance, backgrounds, and noise levels are more difficult for models to learn and generalize.
- Domain specificity: the within-domain specificity and variability of the dataset (e.g., handwritten figures versus real-world objects). Datasets from domains with high intra-class variability and inter-class similarity are considered more complex due to the more subtle distinctions that need to be learned.
3.2. Model Architecture Details
3.3. Knowledge Distillation Processes
3.3.1. Response-Based Knowledge Distillation (RKD)
3.3.2. Intermediate Knowledge Distillation
4. Experimental Setup and Results Analysis
4.1. Experimental Setup
4.2. Results Analysis
4.2.1. Analysis of the Results of the Teacher and Student Models from Scratch
4.2.2. RKD Performance Results Analysis
4.2.3. IKD Performance Results Analysis
4.2.4. Analysis of the Impact of the Database on Knowledge Distillation
- Knowledge distillation has a considerable effect on problems with complex databases. The more complex the database, the deeper and more powerful the model used for training. With a powerful teacher model capable of characterizing knowledge, the transfer to the student model will be assured.
- By observing the performance provided by RKD and that provided by IKD on different databases, we conclude that the choice of the IKD method will be preferable to that of RKD when dealing with complex databases.
5. Discussion
5.1. Impact of Database Complexity on Distillation
5.2. Optimisation of Distillation Strategies
5.3. Limitation of the Study
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A
References
- Chen, G.; Choi, W.; Yu, X.; Han, T.; Chandraker, M. Learning efficient object detection models with knowledge distillation. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Feng, Y.; Wang, H.; Hu, H.R.; Yu, L.; Wang, W.; Wang, S. Triplet distillation for deep face recognition. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 October 2020; pp. 808–812. [Google Scholar]
- Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. Tinybert: Distilling bert for natural language understanding. arXiv 2019, arXiv:1909.10351. [Google Scholar]
- Wang, H.; Li, Y.; Wang, Y.; Hu, H.; Yang, M.H. Collaborative distillation for ultra-resolution universal style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1860–1869. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada, 2009. Available online: https://www.cs.toronto.edu/~kriz/learningfeatures-2009-TR.pdf (accessed on 1 June 2024).
- LeCun, Y. The MNIST Database of Handwritten Digits. 1998. Available online: http://yann.lecun.com/exdb/mnist/ (accessed on 15 July 2024).
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar] [CrossRef]
- Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
- Ba, J.; Caruana, R. Do deep nets really need to be deep? arXiv 2014, arXiv:1312.6184. [Google Scholar]
- Radosavovic, I.; Dollár, P.; Girshick, R.; Gkioxari, G.; He, K. Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4119–4128. [Google Scholar] [CrossRef]
- Hull, J. A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 550–554. [Google Scholar] [CrossRef]
- Che, Z.; Purushotham, S.; Khemani, R.; Liu, Y. Distilling knowledge from deep networks with applications to healthcare domain. arXiv 2015, arXiv:1512.03542. [Google Scholar] [CrossRef]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive representation distillation. arXiv 2019, arXiv:1910.10699. [Google Scholar] [CrossRef]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar] [CrossRef]
- Wang, H.; Lohit, S.; Jones, M.N.; Fu, Y. What makes a “good” data augmentation in knowledge distillation-a statistical perspective. Adv. Neural Inf. Process. Syst. 2022, 35, 13456–13469. [Google Scholar]
- Das, D.; Massa, H.; Kulkarni, A.; Rekatsinas, T. An empirical analysis of the impact of data augmentation on knowledge distillation. arXiv 2020, arXiv:2006.03810. [Google Scholar]
- Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 1–48. [Google Scholar] [CrossRef]
- Tung, F.; Mori, G. Similarity-preserving knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1365–1374. [Google Scholar] [CrossRef]
- Alabbasy, F.M.; Abohamama, A.; Alrahmawy, M.F. Compressing medical deep neural network models for edge devices using knowledge distillation. J. King Saud-Univ.-Comput. Inf. Sci. 2023, 35, 101616. [Google Scholar] [CrossRef]
- Romero, A.; Ballas, N.; Kahou, S.E.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. arXiv 2014, arXiv:1412.6550. [Google Scholar] [CrossRef]
- Zhang, X.; Chang, H.; Hao, Y.; Chang, D. MKTN: Adversarial-Based Multifarious Knowledge Transfer Network from Complementary Teachers. Int. J. Comput. Intell. Syst. 2024, 17, 72. [Google Scholar] [CrossRef]
- Zhou, T.; Chiam, K.H. Synthetic data generation method for data-free knowledge distillation in regression neural networks. Expert Syst. Appl. 2023, 227, 120327. [Google Scholar] [CrossRef]
- Zhang, J.; Tao, Z.; Zhang, S.; Qiao, Z.; Guo, K. Soft Hybrid Knowledge Distillation against deep neural networks. Neurocomputing 2024, 570, 127142. [Google Scholar] [CrossRef]
- Wang, C.; Wang, Z.; Chen, D.; Zhou, S.; Feng, Y.; Chen, C. Online adversarial knowledge distillation for graph neural networks. Expert Syst. Appl. 2024, 237, 121671. [Google Scholar] [CrossRef]
- Guermazi, E.; Mdhaffar, A.; Jmaiel, M.; Freisleben, B. MulKD: Multi-layer Knowledge Distillation via collaborative learning. Eng. Appl. Artif. Intell. 2024, 133, 108170. [Google Scholar] [CrossRef]
- Ojha, U.; Li, Y.; Sundara Rajan, A.; Liang, Y.; Lee, Y.J. What knowledge gets distilled in knowledge distillation? Adv. Neural Inf. Process. Syst. 2023, 36, 11037–11048. [Google Scholar] [CrossRef]
- Zagoruyko, S.; Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv 2016, arXiv:1612.03928. [Google Scholar] [CrossRef]
- Li, H.T.; Lin, S.C.; Chen, C.Y.; Chiang, C.K. Layer-level knowledge distillation for deep neural network learning. Appl. Sci. 2019, 9, 1966. [Google Scholar] [CrossRef]
- Chen, L.; Chen, Y.; Xi, J.; Le, X. Knowledge from the original network: Restore a better pruned network with knowledge distillation. Complex Intell. Syst. 2022, 8, 709–718. [Google Scholar] [CrossRef]
- Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar] [CrossRef]
- Srinivas, S.; Fleuret, F. Knowledge transfer with jacobian matching. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4723–4731. Available online: https://proceedings.mlr.press/v80/srinivas18a.html (accessed on 25 June 2024).
- Mirzadeh, S.I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 5191–5198. [Google Scholar] [CrossRef]
- Bang, D.; Lee, J.; Shim, H. Distilling from professors: Enhancing the knowledge distillation of teachers. Inf. Sci. 2021, 576, 743–755. [Google Scholar] [CrossRef]
- Li, Z.; Li, X.; Yang, L.; Song, R.; Yang, J.; Pan, Z. Dual teachers for self-knowledge distillation. Pattern Recognit. 2024, 151, 110422. [Google Scholar] [CrossRef]
- Shang, R.; Li, W.; Zhu, S.; Jiao, L.; Li, Y. Multi-teacher knowledge distillation based on joint Guidance of Probe and Adaptive Corrector. Neural Netw. 2023, 164, 345–356. [Google Scholar] [CrossRef] [PubMed]
- Cho, Y.; Ham, G.; Lee, J.H.; Kim, D. Ambiguity-aware robust teacher (ART): Enhanced self-knowledge distillation framework with pruned teacher network. Pattern Recognit. 2023, 140, 109541. [Google Scholar] [CrossRef]
- Torralba, A.; Efros, A.A. Unbiased look at dataset bias. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 1521–1528. [Google Scholar]
- Oliva, A.; Torralba, A. Modeling the shape of the scene: A holistic representation of the spatial envelope. Int. J. Comput. Vis. 2001, 42, 145–175. [Google Scholar] [CrossRef]
- Basu, M.; Ho, T.K. Data Complexity in Pattern Recognition; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar] [CrossRef]
- Shah, B.; Bhavsar, H. Time complexity in deep learning models. Procedia Comput. Sci. 2022, 215, 202–210. [Google Scholar] [CrossRef]
- Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
- Albawi, S.; Mohammed, T.A.; Al-Zawi, S. Understanding of a convolutional neural network. In Proceedings of the 2017 International Conference on Engineering and Technology (ICET), Antalya, Turkey, 21–23 August 2017; pp. 1–6. [Google Scholar]
- O’shea, K.; Nash, R. An introduction to convolutional neural networks. arXiv 2015, arXiv:1511.08458. [Google Scholar]
- Sharma, N.; Jain, V.; Mishra, A. An analysis of convolutional neural networks for image classification. Procedia Comput. Sci. 2018, 132, 377–384. [Google Scholar] [CrossRef]
- Rawat, W.; Wang, Z. Deep convolutional neural networks for image classification: A comprehensive review. Neural Comput. 2017, 29, 2352–2449. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 21–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Kullback, S. Information Theory and Statistics; Courier Corporation: North Chelmsford, MA, USA, 1997. [Google Scholar]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
Dataset | Image Sizes | Nb Classes | Nb-Images | Complexity Level |
---|---|---|---|---|
CIFAR-10 | 32 × 32 | 10 | 60,000 | Moderate to high |
CIFAR-100 | 32 × 32 | 100 | 60,000 | High |
USPS | 16 × 16 | 10 | 9298 | Moderate |
MNIST | 28 × 28 | 10 | 60,000 | Low |
Fashion MNIST | 28 × 28 | 10 | 60,000 | Low to moderate |
Feature | ResNet50 (Teacher Model) | ResNet18 (Student Model) |
---|---|---|
Total layers | 50 | 18 |
Initial Conv Layer | 7 × 7, 64, stride 2 | 3 × 3 Max Pool, stride 2 |
Initial Pooling Layer | 7 × 7, 64, stride 2 | 3 × 3 Max Pool, stride 2 |
Residual Block 1/Channels | 3 Bottleneck Units/246 | 2 Basic Units/64 |
Residual Block 2/Channels | 4 Bottleneck Units/512 | 2 Basic Units/128 |
Residual Block 3/Channels | 6 Bottleneck Units/1024 | 2 Basic Units/256 |
Residual Block 4/Channels | 3 Bottleneck Units/2048 | 2 Basic Units/512 |
Pooling Layer | Global Avg Pool | Global Avg Pool |
Fully Connected Layer | 1000-d FC Layer | 1000-d FC Layer |
Dataset | Training | Validation | Test |
---|---|---|---|
CIFAR-10 | 50,000 | 7000 | 3000 |
CIFAR-100 | 50,000 | 7000 | 3000 |
USPS | 7291 | 1404 | 603 |
MNIST | 60,000 | 7000 | 3000 |
Fashion MNIST | 60,000 | 7000 | 3000 |
Dataset | ResNet50 | ResNet18 | Difference |
---|---|---|---|
MNIST | 98.33 | 97.9 | −0.43 |
FashionMNIST | 89.9 | 88.47 | −1.43 |
USPS | 90.22 | 86.07 | −4.15 |
CIFAR10 | 75.23 | 63.47 | −11.76 |
CIFAR100 | 48.7 | 34.6 | −14.1 |
Dataset | ResNet18 Scratch | ResNet18 RKD | Difference |
---|---|---|---|
MNIST | 97.9 | 98 | +0.1 |
FashionMNIST | 88.47 | 87.6 | −0.87 |
USPS | 86.07 | 88.72 | +2.65 |
CIFAR10 | 63.47 | 64.13 | +0.66 |
CIFAR100 | 34.6 | 35.03 | +0.43 |
Dataset | ResNet18 Scratch | ResNet18 IKD | Difference (KD) |
---|---|---|---|
MNIST | 97.9 | 98.43 | +0.53 |
FashionMNIST | 88.47 | 89.97 | +1.5 |
USPS | 86.07 | 88.72 | +2.65 |
CIFAR10 | 63.47 | 74.7 | +10.6 |
CIFAR100 | 34.6 | 49.83 | +15.23 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Belinga, A.G.; Tekouabou Koumetio, C.S.; El Haziti, M.; El Hassouni, M. Knowledge Distillation in Image Classification: The Impact of Datasets. Computers 2024, 13, 184. https://doi.org/10.3390/computers13080184
Belinga AG, Tekouabou Koumetio CS, El Haziti M, El Hassouni M. Knowledge Distillation in Image Classification: The Impact of Datasets. Computers. 2024; 13(8):184. https://doi.org/10.3390/computers13080184
Chicago/Turabian StyleBelinga, Ange Gabriel, Cédric Stéphane Tekouabou Koumetio, Mohamed El Haziti, and Mohammed El Hassouni. 2024. "Knowledge Distillation in Image Classification: The Impact of Datasets" Computers 13, no. 8: 184. https://doi.org/10.3390/computers13080184
APA StyleBelinga, A. G., Tekouabou Koumetio, C. S., El Haziti, M., & El Hassouni, M. (2024). Knowledge Distillation in Image Classification: The Impact of Datasets. Computers, 13(8), 184. https://doi.org/10.3390/computers13080184