1. Introduction
Diabetes is a chronic disease that appears when the organism cannot produce enough insulin or it cannot effectively be used. There are two types of diabetes: type 1, which is not preventable and requires the patient to be provided with external insulin, and type 2, which is the most common (90% of patients) and is preventable [
1].
According to the 2021 Diabetes Atlas of the International Diabetes Federation (IDF) [
1], 537 million people worldwide, between 20–79 years, have diabetes, and by 2030, the number could increase to 643 million. Thus, following this trend, by 2045 this will be 783 million people (between 20–79 years). Recently, Professor Andrew Boulton, President of the IDF, declared that diabetes is a “pandemic of unprecedented magnitude”, because an it is estimated that 10.5% (around half a billion persons) of the worldwide adult population lives with diabetes [
2].
On the other hand, a possible complication of diabetes is diabetic retinopathy (DR), which is caused by elevated glucose concentrations in the blood, resulting in the damage of blood vessels and the abnormal growth of new vessels in the retina [
3]. This disease occurs among 30–40% of diabetic patients, making the rise in diabetes prevalence parallel to DR. A recent global analysis estimates that the worldwide prevalence of DR is around 103 million patients, and it will rise to 161 million by 2045 [
2].
Diabetic retinopathy has been studied for a long time. For example, Porta et al. [
4] presented the history of diabetic eye disease as follows: in 1876, Wilhelm Manz described some pathologies associated with diabetic retinopathy, for example, fibrovascular deterioration of the optic disc, adhesions in the vitreous and retina, detachment of the retina, and hemorrhages in the vitreous. In 1877, Mackenzie described the finding of microaneurysms with vitreous and retinal bleeding. In 1944, Ballantyne and Lowenstein were the first to use the term “diabetic retinopathy”. On the other hand, in 2017, Ometto et al. [
5] studied the effects of other factors unrelated to DR, how lesions are distributed for discrepancies between decisions made by clinicians, and what the risk model predicts.
Because DR directly affects the blood vessels, it is the main cause of preventable blindness, with a prevalence of 22% in people with diabetes [
6]. However, early detection and treatment can prevent vision loss due to diabetes. Thus, there are currently treatments that can considerably reduce the risks of blindness through early diagnosis. For example, retinography is a non-invasive and accessible method that generates high-quality images of the inner and back surface of the eyes through a color camera showing the blood vessels and revealing possible lesions due to DR, which can be seen in
Figure 1a, showing a healthy patient.
Diabetic retinopathy is characterized by microangiopathy in the venous capillaries, such as microaneurysms, microvascular occlusions, arteriovenous short circuits, neovascularization, exudates, and hemorrhages [
3]. In addition, retinal venous beading and cotton wool spots can be found. Neovascularization consists of the generation of new blood vessels due to a prolonged absence of blood in determined regions of the eye, which generally have irregular, weak walls that are prone to rupture. These new blood vessels are identified as prominences in the contour of the optic nerve. Retinal hemorrhages (
Figure 1b) appear when the blood vessels weaken and begin to bleed, producing irregular reddish spots, which cause either partial or total blindness due to the light occlusion of the photoreceptors. Exudates (
Figure 1c) are presented when substances rich in proteins spread into the eye from the vessels, which form brilliant deposits of different sizes, covering small spots to large areas. Microaneurysms (
Figure 1d) appear as small reddish spots in different areas, which could easily be mistaken for small hemorrhages, but are induced by dilations of the blood vessels and not by blood draining into the vitreous body. Therefore, in diagnosis, these structures are used to evaluate the DR grade [
3]. Cotton wool spots (
Figure 1e) are presented when there are tight retinal arteries due to chronic hypertension and atherosclerosis [
7]. Venous beading (
Figure 1f) is characterized by the beaded appearance of the retinal venules due to long hypoxia cycles and a changing lumen’s dilation and constriction.
The damage that DR causes to small retinal vessels also increases the risk of glaucoma (GLA), cataracts, and other eye problems. GLA (
Figure 1g) is an eye condition that damages the optic nerve. This damage is caused by higher-than-normal pressure in the eye. Open-angle glaucoma is characterized by irregular blind spots in peripheral or central vision, often occurring in both eyes or tunnel vision in very advanced stages. The above figure shows a fundus image of a patient with glaucoma.
Furthermore, age-related macular degeneration (AMD) is an important cause of irreversible blindness worldwide. It is a multi-factorial disease with cuases such as age, race, gender, diet, cardiovascular illnesses, and genetic risk factors. In early stages, AMD presents the appearance of extracellular deposits in the Bruch’s membrane. In late stages, AMD is characterized by the appearance of geographic atrophy (GA), where there is photo-receptor loss, retinal thinning, and a progressive, irreversible loss of central vision. In many cases, GA could evolve into choroidal neovascularization, a formation of new blood vessels in the Bruch’s membrane, which causes hemorrhages that lead to further loss of vision [
8].
Figure 1h shows a patient with macular degeneration.
Due to DR being characterized by several retinal lesions, it is necessary to develop new multi-class image classification algorithms and address the imbalance between the number of images per class found in most datasets. In addition, these algorithms must present an accurate classification to support medical diagnosis. Therefore, in this study, we present a knowledge distillation (KD) model using a new combination of loss functions. First, we used a pre-trained teacher model to distill the target dataset. Then, a student model was trained on the distilled dataset to match the performance of the teacher model. For evaluation, the proposed method was tested on two datasets: one imbalanced dataset containing 1000 images across 39 classes and a second balanced dataset with 3592 images into eight categories. Experimental results show that the DR lesions classification system overcomes the accuracy of some state-of-the-art methods. The contributions of this work are summarized as follows:
Knowledge Distillation Model and Combination of Loss Functions:
- –
We present a knowledge distillation model and develop a new combination of the Kullback–Leibler divergence and categorical cross-entropy loss functions to address the problem of sample number imbalance in the dataset.
- –
To the best of our knowledge, this combination of loss functions is the first to be used.
Teacher–Student Training Framework:
- –
We utilize a pre-trained teacher model to distill the dataset, providing a refined set of data for training.
- –
We train a student model on the distilled dataset to achieve performance comparable to the teacher model, ensuring effective knowledge transfer and maintaining a compact and efficient final model.
Dataset and Methodology:
- –
We perform an analysis of different CNN architectures by testing and evaluating them to determine the best network that would generate the best classification results.
Experimental Setup and Results:
- –
Experimental results using a public and in-house dataset demonstrates that the proposed knowledge distillation model significantly reduces overfitting against baseline models. In addition, the distilled model shows a robust generalization and overcomes some state-of-the-art methods.
The rest of the paper is divided as follows:
Section 2 presents a review of some recent works on DR detection, DR grading, and DR lesion classification.
Section 3 describes the fundus image datasets used, the CNN architectures, and the loss functions utilized to classify DR lesions.
Section 4 details the proposed knowledge distillation-based classification model for DR lesions. Next,
Section 5 presents the experimental setup, evaluation metrics, DR lesions classification performance analysis, and comparison against other works in the literature, followed by a discussion of the results. Finally, the conclusions are presented in
Section 6, followed by future work.
2. Related Work
Various methods have been proposed to detect DR in fundus images. For example, extracting blood vessels to use their information for DR diagnosis. Thus, Sundaram and et al. [
9] presented a hybrid segmentation method of blood vessels by fusing morphological operators, vessel enhancement at multiple scales, and the bottom hat algorithm. In comparison, Colomer et al. [
10] proposed a blood vessel segmentation approach using sparse representations and dictionary-based learning. The proposed method was evaluated using RGB and intensity fundus images via two public databases.
Other works use a machine learning approach to generate DR grading related to the severity of the illness. For example, in [
11], Sharif and Shah applied feature extraction to classify the grading of DR. Contrast-limited adaptive histogram equalization (CLAHE), Gaussian curve fitting, and independent component analysis (ICA) were used as a pre-processing stage, and a multi-class Gaussian Bayes classifier and multi-support vector machine (SVM) were applied to classify the grade of DR. Kaur and Mittal [
12] reported a lesion segmentation method by a means and iterative clustering approach, taking into account the heterogeneity and bright and weak edges, which are used to grade non-proliferative diabetic retinopathy.
On the other hand, other works directly classified retina lesions to identify DR through generic classifiers. For example, in [
13], Estudillo-Ayala et al. presented a method using multi-directional fractional-order Gaussian filters, the differential evolution algorithm, and the Kittler thresholding method to detect microaneurysms and hemorrhages in fundus images. Then, the extracted structures were classified by means of an SVM. In another work, Wang et al. [
14] proposed a semi-supervised method using a series of both healthy and ill fundus images without annotation. The proposed method separated background from blood vessels and background noise applying image processing steps. Furthermore, background noise was modulated as a stochastic variable through a mixture of Gaussian filters for normal and abnormal images, respectively. Thus, both the image background and the noise of the background noise identification were joined in a whole model to detect the lesions, for example, haemorrhages, exudates, and cotton wool spots. In [
15], Biswas et al. developed an intelligent system for the detection of DR using an SVM to detect foveal avascular zones and microaneurysms. In [
16], an automatic detection method of exudates using fundus images was presented. This method involved the following steps: shifting color correction, optic disc removal, and exudates detection. Afrin and Shill [
17] presented a DR grading system based on retinal lesion detection (e.g., microaneurysms, blood vessels, and exudates). Similarly, in [
18], Adal et al. reported an automatic detection method of lesions based on longitudinal retinal changes caused by small red lesions in normalized images. This method reduces illumination variations and improves the contrast of small features in the retina. Sidibé et al. [
19] used a sparse coding approach for retinal lesions classification, specifically for detecting fundus images with exudates or drusen and images without lesions. They applied a linear SVM over the sparse coded features, improving the bag-of-visual-word approach. In [
20], Ghasemi Falavarjan et al. reported an advance in analyzing ultra-wide-field retinal images, enabling accurate measurements of peripheral retinal lesions. Finally, some less recent techniques are those of [
21,
22]. The first work presented a method to detect exudates by applying adaptive histogram equalization and thresholding for area calculation. Then, micro-aneurysms were detected using top-hat and bottom-hat transforms. Micro-aneurysms were finally classified using Otsu thresholding and the Hough transform. The second method used the m-Mediods-based modeling approach together with the Gaussian mixture model.
Although the above methods obtained promising results, they are not suitable for classifying multiple DR lesions. Thus, CNNs have been widely used for multiclass problems, improving the ability of DR lesion diagnosis. Ashraf et al. [
23] focused on identifying red lesions of small size that are less discriminative for early DR detection. A modified ResNet50 model was used as a CNN as an alternative to transfer learning issues such as over-fitting, domain adaptation, and performance degradation. In another work, Alsubai et al. [
24] proposed a quantum-based deep CNN approach because of its parallel computing capacity for DR classification in fundus images. Priya et al. [
25] and Bilal et al. [
26] presented reviews of current methods of detecting non-proliferative diabetic retinopathy, exudates, hemorrhages, and microaneurysms using deep, intelligent systems. Abdelmaksoud et al. [
27] presented a comprehensive computer-aided diagnosis using deep learning for DR classification. First, a prepossessing stage removes noise, performs quality enhancements, and resizes fundus images for standardization. Next, the system classifies healthy and DR patients. Finally, blood vessels, exudates, hemorrhages, and microaneurysms are extracted. Hassan et al. [
28] evaluated semantic segmentation, scene analysis, and hybrid DL systems for retinal lesion detection using optical coherence tomography (OCT) images. Some lesions extracted were intra-retinal fluid, drusen, hard exudates, chorioretinal anomalies, and sub-retinal fluid. In [
29], Playout et al. presented a convolutional multi-task architecture with reinforcing learning for red and bright lesions’ simultaneous segmentation.
In contrast to traditional deep learning methods, which are resource-intensive, containing large amounts of parameters, and are not suitable for direct clinical applications, this work proposes the use of the knowledge distillation (KD) algorithm, which is considered a dataset reduction method that focuses on the creation of more compact and representative sets of the original dataset since it takes into account the most important characteristics of images and saves their information. This method not only reduces computational complexity and storage requirements but can also improve the efficiency and generalization of models trained on previously manipulated sets. Thus, Luo et al. [
30] proposed a knowledge distillation approach to a large teacher network, which later guides a student network using labels of the images for DR grading. In addition, the authors incorporated class activation mapping (CAM), including an attention module and a mimicking stage. Abbasi et al. [
31] presented a knowledge distillation approach to transfer knowledge of a complex model to a simple model with fewer parameters. The proposed method uses unlabeled data to transfer DR knowledge and incorporate it into a low-resource embedded system. A VGG network was used to train the teacher model, and then a simple model (student model) was trained based on the teacher’s knowledge. The proposed approach was used for binary DR classification. In another work, Gao et al. [
32] reported a collaborative learning-based knowledge distillation framework, which was used to enhance fundus images, increasing the detecting retinopathy accuracy and reducing the running time of the model. The collaborative strategy allowed us to incorporate several student models of different scales and architectures to extract relevant diagnosis information. Next, a transfer learning approach was applied. In [
33], Islam et al. proposed a knowledge distillation approach where two models were fused (ResNet152V2 and Swin transformer) as teacher models. Then, the knowledge learned from the teacher model was used to train the lightweight student model (Xception) for the diabetic retinopathy classification task. In addition, a pre-processing stage, formed by several methods such as image denoising, ROI selection, thresholding, bounding box selection, cropping, unsharp masking filter, and gamma transformation, were applied to the APTOS and IDRiD datasets for both binary and multi-class classification. Ju et al. [
34] proposed a DL strategy to learn from long-tailed fundus datasets, where a hybrid knowledge distillation method was used for DR lesion classification. In [
35], Wang et al. presented a knowledge distillation approach focusing on the transfer of lesion knowledge applied to DR lesion segmentation. Finally, Salguero et al. [
36] analyzed something similar to what we are proposing in this paper, in which they processed the information with different unsupervised classification methods and relied on natural language processing with the latent semantic analysis method. Their dataset belonged to several images of ovarian cancer pathology. When comparing the results obtained with 40% of the data and with the entire dataset, they saw that the score obtained was 87%.
3. Materials and Methods
In this section, the fundus image datasets used in this work are introduced. Two datasets are presented: the JSIEC1K dataset, containing 1000 images across 39 categories, and an in-house dataset with 3592 images balanced across eight categories. The characteristics, sources, and composition of these datasets are detailed.
Additionally, the fundamentals of convolutional neural networks are explained, including their architecture, key components such as convolutional layers and pooling layers, and the process of training these networks. Following this, two specific CNN architectures are discussed: Inception-v3 and YOLOv8. The Inception-v3 model, with its innovative use of inception modules and the YOLOv8 architecture, known for its real-time object detection capabilities, are described in detail. Finally, the theory behind the loss functions used in the study, specifically the Kullback–Leibler divergence and categorical cross-entropy loss, is presented to provide a comprehensive understanding of the proposed method’s theoretical foundations.
3.1. Fundus Image Datasets
For the development of this project, we explored multiple fundus image datasets, where we found that the images tend to be highly consistent across different sources. Thus, the specialized equipment used for capturing these images generally produces very similar outputs, regardless of the specific device (see [
37]). However, the queried datasets lacked sufficient images or labels to train deep-learning models. Therefore, in this work, we selected two datasets due to their characteristics described below. The first one includes 1000 images belonging to 39 different categories, but presents an imbalance between them and shows a non-uniform distribution. The second one contains 3592 into eight different categories but maintains a balance between them.
3.1.1. JSIEC1K Dataset
For the initial tests, the JSIEC1K dataset [
38] was used, obtained from Kaggle public datasets (
https://www.kaggle.com/datasets/linchundan/fundusimage1000/, accessed on 13 February 2024). It is an imbalanced dataset, which is a collection of fundus images consisting of a total of 1000 images divided into 39 different categories. It is important to mention that the distribution of images is not uniform across all categories, as there are some with a significantly higher number of samples compared to others. This imbalance in the dataset, coupled with the large number of classes being trained with so few samples, may affect the test results. The images come from the Joint Shantou International Eye Center (JSIEC) located in Guangdong Province, China. They are a small portion of a larger set consisting of 209,494 fundus images. The main objective of this dataset is to be used for training and testing deep learning platforms, which are a type of artificial intelligence.
Table 1 shows the classes and number of images per class of the JSIEC1K dataset.
3.1.2. In-House Dataset
On the other hand, an in-house and balanced dataset was also used to perform the tests. It consists of images from the “Messidor-2” dataset [
39,
40], which was downloaded from Kaggle public datasets (
https://www.kaggle.com/datasets/geracollante/messidor2, accessed on 13 February 2024), containing 1748 fundus images of
and
pixels. Out of the 1748 images, 1705 images were grouped into eight groups: macular degeneration (55), exudates (149), hemorrhages (82), cotton wool spots (56), microaneurysms (156), glaucoma (69), venous beading (59), and healthy patients (449). Later, a data augmentation strategy using the Python package Augmentor was performed in order to generate a balance set with respect to the number of healthy patient images, providing 3592 images (449 for each group).
In contrast to the JSIEC1K dataset, this dataset shows less diversity but a more balanced approach.
Table 2 shows the classes and number of images per class of the in-house dataset.
3.2. Convolutional Neural Networks
A CNN is composed of so-called convolutional layers, which are in charge of extraction of local features from a set of images by applying different convolution kernels to filters. As a result of these convolutions, features maps are obtained. Thus, a convolution layer containing
k kernels could detect
K different features (feature maps) after being trained. In
Figure 2, a convolutional layer and its corresponding feature maps are presented using a gray-scale image and a bank of
K filters.
In a nutshell, each feature map
with
could be mathematically described as is shown in Equation (
1):
where
is a multi-spectral image,
n-th is the color band number,
is the sub-kernel to the band
n, and * represents the 2D convolution operator.
Furthermore, a pooling layer, or downsampling layer, is commonly incorporated at the output of each convolutional layer. Its dimension reduction decreases the number of neurons in the convolutional network and reduces overfitting and computational complexity.
Mathematically, the pooling layer slides a two-dimensional filter over each feature map. Thus, for a feature map of dimension , the output of the pooling layer will be , where f is the size of the filter and s is stride length.
The pooling layers could be classified into three categories: a max-pooling layer, a min-pooling layer, and an average-pooling layer, where the max-pooling layer is the most commonly used. The max-pooling layer is used to reduce the dimension of the feature maps and to improve robustness to changes in the position of the feature in the image or local translation invariance. It is reached by selecting the maximum response in each patch of each feature map.
A simple CNN architecture used for classification tasks contains several pairs of convolutional-pooling layers. Furthermore, another kind of layer, the fully connected layer, is used for classification tasks. In this layer, each of the outputs in the previous layer is connected with a weight to each neuron in the layer. Finally, the output classification labels in the CNN are generated by an activation function.
Figure 3 shows a typical CNN architecture.
Training a CNN consists of adjusting the unknown parameters of the network, i.e., the weights and biases of all connections. In this regard, the backpropagation algorithm is the most frequently used process to iteratively adjust these weights through the minimization of a loss function, for example, the difference between the estimated classes (outputs) and the references (labels).
On the other hand, there are pre-trained CNN models that could be used in cases where a sufficiently large dataset is not available. This strategy is known as transfer learning, and it allows us to fine-tune the last layers, e.g., last fully-connected layer, instead of training the whole model from scratch.
In transfer learning, the weights of a pre-trained model are transferred to the current model to be trained, in contrast to randomly initializing its weights [
41]. Therefore, for many image classification problems, such as medical image classification, this approach is enough. In this regard, VGG16, Inception-v3, and ResNet50 are three of the models with the highest score in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC2014) [
42].
3.2.1. Inception-v3
Inception-v1 is a 22-layer network that was started as a module for Googlenet, a CNN developed by researchers at Google, that used deeper nets than needed for image classification. Inception-v3 CNN [
43], developed by the same group as the original Inception, is composed of 48 convolutional layers and “Inception modules”, which are used to reduce the parameter number and to maintain the efficiency of the network. Furthermore, inception modules allow us to handle information at different scales efficiently, improving the efficiency and performance of image recognition tasks.
In Inception-v3, the authors showed how large convolution kernels can be expressed more efficiently by a series of smaller convolutions, surpassing its ancestor GoogLeNet on the ImageNet benchmark. Thus, Inception-v3 implements three types of Inception modules (
Table 3); these modules allow us to replace large convolutions by applying convolutions of different sizes.
Inception-v3 introduces several improvements over its predecessors, such as the factorization of large convolutions into smaller convolutions and the use of asymmetric convolutions, reducing both the parameter number and the complexity of the model.
Figure 4 shows the Inception-v3 architecture.
3.2.2. YOLOv8
Another popular CNN architecture is YOLO (you only look once). YOLO is a family of object detection models running in real-time that has been key in computer vision evolution. It was introduced in a paper titled “You Only Look Once: Unified, Real-Time Object Detection”, published in 2016 by Joseph Redmon et al. with the YOLOv2 version [
44]. In that work, a novel approach to object detection in images was presented, which stood out for its ability to perform real-time detection with a single pass of the convolutional neural network.
In 2022, YOLOv8 was released by Ultralytics, the original YOLO team [
45]. YOLOv8 is building upon its past versions and brings in novel features and enhancements in performance and flexibility. It was designed with relevant characteristics such as accuracy, speed, and ease of use for a wide range of tasks, e.g., object detection and image segmentation and classification, even with large amounts of data. YOLOv8 is composed of two parts: the so-called backbone and the head.
Figure 5 shows the YOLO architecture.
A modified CSPDarknet53 architecture was used as the backbone of YOLOv8. It is composed of 53 convolutional layers and cross-stage partial connections, which improves the information flow over layers. On the other hand, the head section is composed of several convolutional layers followed by fully connected layers. These layers predict the bounding boxes, the probabilities of the objects identified, and the objectness punctuation.
As the main feature, a self-attention mechanism is used in the head of YOLOv8. This mechanism focuses on different sections of the image and adjusts the relevance of the features according to their importance on the task performed.
On the other hand, YOLOv8 can detect objects of different sizes in the image using a multi-scale approach. This is possible using a pyramid network composed of multiple layers that detect objects at different scales (see
Figure 5).
3.3. Kullback–Leibler Divergence and Categorical Cross-Entropy Loss
We used a combination of two different loss functions, Kullback–Leibler divergence [
46] and the categorical cross-entropy loss function [
47].
On the one hand, KL divergence allows us to measure the deviation between two probability distributions through their relative entropy. KL divergence is shown in Equation (
2).
where
and
are two probability distributions of a random variable
x into the discrete sample space
X.
It is important to say that KL divergence is not a distance measure. This is because it is not symmetric, i.e., .
On the other hand, categorical cross-entropy is a loss function widely used in multi-class classification problems. It is the union of the softmax activation function plus classic cross-entropy. Therefore, it is commonly called softmax loss.
The softmax activation function is given by Equation (
3):
where
Z is an input vector of
C real values, and
and
are output vectors, whose elements range between 0 and 1 and sum up to 1. Thus,
consists of
C probabilities, which are proportional to the exponential of their real input values.
Finally, the categorical cross-entropy definition is shown in Equation (
4):
where
is the target vector (ground truth labels).
4. Knowledge Distillation-Based Classification Model
In this work, a knowledge distillation approach to classify DR lesions in fundus images is proposed. The main aim of this work is to develop a simple deep learning model, capable of being used in smaller devices that do not have the computational capacity to process all fundus images, but only a reduced set of images that only take into account the most important characteristics of the dataset.
The proposed framework of lesion classification using a knowledge distillation method consists of a two-stage model, as shown in
Figure 6. It is composed of two main phases: a teacher and a student model. In the first phase, the teacher model extracts the relevant information related to DR lesions by applying a deep CNN approach. In the second phase, by a transfer learning approach, the extracted relevant DR diagnosis information is used to train the student model.
We used Inception-v3 in its Keras implementation as a base and loaded it without its top layer, allowing it to adapt to our data with a different number of target classes. In this case, the model is prepared to work with images of size and three color channels (RGB). Two tests were conducted to determine whether or not to use pre-trained weights: one with pre-trained weights from ImageNet benchmark and another with no pre-trained weights for training from scratch.
Therefore, it was observed that using the pre-trained weights from ImageNet benchmark, the model’s accuracy improved between 1.1 and 2.3%, depending on whether data augmentation was not used or used (respectively) to counteract the unbalanced dataset.
On top of this base Inception-v3 model, custom layers were added to adapt the model to a specific task. First, the output of Inception-v3 was flattened to convert it into a vector. Then, a dense (fully connected) layer with 1024 neurons and ReLU activation was added, followed by a dropout layer with a rate of 0.2 to reduce overfitting. Finally, an output dense layer was added with as many neurons as classes to be predicted (in this case, 39) using a softmax activation function for multiclass classification.
The model was compiled with the RMSprop optimizer, a learning rate of 0.0001, and categorical cross-entropy loss was used, which is suitable for multiclass classification tasks. Tests were also conducted with an ADAM optimizer, and based on the results, it was decided to use ADAM in future iterations of the model, as it provided better results.
The knowledge distillation method employed in this study consists of several key stages (
Figure 6), each contributing to the overall goal of creating a compact, efficient representation of the original dataset. These stages encompass the entire process from initial model preparation to final evaluation. We will now describe each of these stages in detail.
Stage 1: Teacher Model Preparation. This stage involves initializing an Inception-v3 model that has been pre-trained on ImageNet. The use of a pre-trained model leverages transfer learning, providing a strong foundation of general visual features. This model is then fine-tuned on the fundus image dataset to be distilled, allowing it to specialize in retinal image analysis while retaining its broader visual understanding. The resulting fine-tuned model becomes the teacher model, embodying the comprehensive knowledge that we aim to distill into a more compact form.
Stage 2: Student Model Initialization. In this stage, another Inception-v3 model architecture is initialized. Like the teacher model, it starts with weights from ImageNet pre-training. However, unlike the teacher model, this student model will not be trained on the full original dataset. Instead, it is prepared to learn from the distilled dataset that will be created in subsequent stages. This approach aims to create a model that can achieve performance comparable to that of the teacher model while using significantly less training data.
Stage 3: Synthetic Dataset Initialization. This stage marks the beginning of the knowledge distillation process. A small subset of samples is randomly selected from the original training dataset. These samples form the initial synthetic dataset, which will serve as the starting point for the distillation process. The goal is to refine this small set of samples to encapsulate the essential information from the entire original dataset.
Stage 4: Knowledge Distillation Process. This is the core of the knowledge distillation process. It involves an iterative procedure to optimize the synthetic dataset. In each iteration, the student model, using the custom loss function, is trained on the current version of the synthetic dataset for one epoch. The student’s performance is then evaluated on the original training dataset to assess how well it has captured the knowledge. Gradients of the loss with respect to the synthetic dataset are computed using automatic differentiation. These gradients guide the update of the synthetic dataset through gradient descent, aiming to minimize the loss on the real dataset. This procedure is carried out to a specified number of iterations, progressively refining the synthetic dataset to better represent the knowledge contained in the original dataset.
Stage 5: Student Model Training. Once the knowledge distillation process is complete, the student model is trained on the final distilled synthetic dataset. This training uses a custom loss function that adds Kullback–Leibler divergence and categorical cross-entropy. This combined loss function helps the student model not only learn the correct classifications but also mimic the probability distributions of the teacher model’s outputs, potentially capturing some of the nuanced knowledge of the teacher.
Stage 6: Evaluation. In the final stage, the trained student model is evaluated on a separate validation dataset depending on the one it was trained. Its performance is compared to that of the teacher model. This comparison allows us to assess how well the knowledge distillation process has worked. Ideally, the student model should achieve performance comparable to the teacher model despite being trained on a much smaller, synthetic dataset.
The overall method applies knowledge distillation to compress a large training dataset into a smaller synthetic dataset while aiming to maintain model performance. It leverages transfer learning through ImageNet pre-training and uses meta-learning-based performance matching in the distillation process. The goal is to create a compact, efficient training dataset that encapsulates the essential information from the original dataset, allowing for faster and more resource-efficient model training without significant loss in performance.
The teacher model approach in this knowledge distillation process demonstrates several significant strengths across its various stages. In the preparation phase, the model leverages the powerful Inception-v3 architecture, pre-trained on ImageNet, providing a robust foundation of general image features. This transfer learning approach allows the model to start with a rich set of visual representations, which is particularly beneficial for medical imaging tasks where large-scale, domain-specific datasets may be limited. The strategy of freezing pre-trained layers while adding custom top layers strikes a balance between preserving general image understanding and adapting to the specific nuances of retinal image classification.
The student model initialization phase shows strength in its flexibility, setting all layers to be trainable. This approach allows the student model to fully adapt to the task at hand, potentially refining the transferred knowledge from ImageNet to be more specific to retinal imagery. The use of the same Inception-v3 architecture for both teacher and student models facilitates direct knowledge transfer, ensuring that the student can effectively learn from the teacher’s representations.
The knowledge distillation process itself demonstrates a sophisticated approach to learning, employing a custom loss function that combines Kullback–Leibler divergence and categorical cross-entropy. This dual-objective function balances the goals of mimicking the teacher’s soft predictions and maintaining high task-specific performance. By using the teacher’s predictions as soft targets, the student model can capture nuanced information beyond just the hard class labels, leading to improved generalization and performance.
The method employed in this study leverages a custom loss function that combines Kullback–Leibler divergence and categorical cross-entropy, offering several advantages. This dual-component loss function facilitates effective knowledge distillation by encouraging the student model to mimic the probability distributions of the teacher model’s predictions, capturing nuanced information beyond mere class labels. The categorical cross-entropy component ensures the student model maintains high task performance by focusing on correct classifications. By using the teacher’s predictions as soft targets through KL divergence, the student model can learn from the teacher’s uncertainties, potentially leading to better generalization. This combination also provides a regularizing effect, which is crucial when training on a small synthetic dataset, helping to prevent overfitting. The dual objective allows the student to balance between mimicking the teacher and performing well on the specific classification task.
6. Conclusions
The knowledge distillation approach employed in this study has demonstrated remarkable effectiveness in enhancing the performance of retinal image classification for diabetic retinopathy detection. One of the most significant achievements of the distilled model is its superior generalization capability. The narrow gap between training (99.01%) and validation (97.30%) accuracies suggests that the model has effectively mitigated overfitting issues observed in baseline models.
This substantial improvement enables more effective and efficient learning, which is crucial in the context of medical image analysis. The high accuracy and robust generalization of the distilled model suggest strong potential for clinical application in DR screening.
The distilled model shows a significant reduction in overfitting compared to the baseline models, as evidenced by the close alignment of training and validation performance metrics. Such a characteristic is particularly important in medical applications where generalizability is crucial for accurate diagnosis across diverse patient populations. Finally, the proposed method was compared against some state-of-the-art works, overcoming these and maintaining a compact and efficient final model.
Future research directions could explore the application of this knowledge distillation technique to other medical imaging tasks or larger, more diverse datasets. This could include adapting the method for the detection and classification of other retinal diseases, such as age-related macular degeneration or glaucoma. In addition, extending its application to different medical imaging modalities, e.g., fluorescein angiography or optical coherence tomography, could be explored. Additionally, investigating the model’s interpretability and its performance in real-world clinical settings would be valuable next steps. This could involve collaborating with ophthalmologists to assess the model’s performance in comparison to human experts and to understand how the model’s predictions align with clinical decision-making processes. Finally, an integration of this model into existing clinical workflows could be analyzed. This could be reached by involving several steps: (1) developing a user-friendly interface for clinicians, (2) ensuring seamless connectivity with current imaging systems, and (3) implementing robust data handling and privacy measures.