1. Introduction
Eyelid tumors are complicated and diverse, including tumors of the eyelid, conjunctiva, various layers of ocular tissues (cornea, sclera, uvea, and retina), and ocular appendages (lacrimal apparatus, orbit, and periorbital tissues) [
1,
2,
3]. Primary malignant tumors of the eye can spread to the periorbital area, intracranially or metastasize systematically, and malignant tumors of other organs and tissues throughout the body can also metastasize to the eye. Therefore, eyelid tumors cover almost all histological types of tumors in the whole body and are widely representative, which can be the best thing object of study for pathological diagnosis of tumors.
Basal cell carcinoma (BCC) is a type of skin cancer that originates in the basal cells of the epidermis. It is the most common type of skin cancer, often occurring on sun-exposed areas of the body. Meibomian gland carcinoma (MGC) is a rare form of cancer affecting the meibomian glands in the eyelid, which secrete an oily substance for eye lubrication. MGC typically presents as a slow-growing lump on the eyelid, potentially mistaken for a benign cyst. Cutaneous melanoma (CM) is a type of skin cancer arising from pigment-producing cells known as melanocytes. It is less common than BCC, but more aggressive and capable of spreading to other parts of the body if left untreated. CM typically appears as a dark-colored new or changing mole or patch of skin, but may also present as a pink or red patch. According to morbidity studies, BCC is the most common malignant eyelid tumor, followed by CM and MGC [
4,
5,
6,
7].
Computed Tomography (CT) and Magnetic Resonance Imaging (MRI) have their limitations that affect their respective clinical applications. A biopsy is an important tool for physicians to finally diagnose eyelid tumors, and pathological diagnosis is the “gold standard” of diagnosis, and observation and analysis of histopathological images of biopsies is an important basis for physicians to formulate the best treatment plan [
8,
9,
10]. The observation and analysis generally require qualitative, localization, and scoping judgments. However, the extreme shortage of human resources and overload in pathology departments are far from meeting the needs of clinical patients for accurate and efficient diagnostic pathology. Accurate diagnosis of BCC, MGC, and CM is essential for optimal patient outcomes, as early diagnosis is a key factor in determining the likelihood of a cure. While the physical appearance of these skin cancers may be distinctive, a biopsy is typically required for definitive diagnosis. Histologically, these types of eyelid tumors can be similar, making it possible to misdiagnose based on histological slides alone. The importance of accurate diagnosis cannot be overstated, as 90% of patient survival has been associated with early detection in cases where pathology-based diagnosis is involved.
Inspired by the concept of knowledge distillation [
11], we have trained a teacher-student model to classify and segment eyelid tumors with good performance and a smaller, more efficient student network. In this paper, we will study the classification and segmentation of tumors based on meaningful learning methods using eyelid tumor pathology images, and the overall flowchart of the network is shown in
Figure 1. The main contribution of this paper includes the following points:
- (1)
Propose the network model called Multiscale-Attention-Crop-ResNet (MAC-ResNet). This network model can achieve 96.8%, 94.6%, and 90.8% accuracy in automatically classifying three ocular malignancies, respectively.
- (2)
By training the student network ResNet with MAC-ResNet as the teacher network with the help of the knowledge distillation method, we made the smaller-scale network model to obtain better classification results on the eyelid tumor dataset, which called ZLet dataset.
- (3)
We train three targeted segmentation networks for each of the three different malignant tumors, which enable us to segment the corresponding tumor locations well. With the help of the classification and segmentation networks, we diagnose the disease and the rapid localization of the lesion area.
2. Related Work
The pathology segmentation and classification of eyelid tumors is a crucial aspect of ocular oncology as early diagnosis and treatment can significantly improve patient outcomes. One of the most common types of skin cancer that can occur on the eyelid is basal cell carcinoma (BCC). This type of cancer arises from the basal cells in the skin and is often caused by prolonged exposure to ultraviolet radiation. While BCC is not typically life-threatening, if left untreated it can cause significant damage to the skin and surrounding tissues. Cutaneous melanoma (CM), on the other hand, is a more aggressive form of skin cancer that originates from the pigment-producing cells in the skin. While less common than BCC, it has a higher likelihood of spreading to other parts of the body and can be deadly if not caught early. A rare type of cancer that can affect the eyelid is meibomian gland carcinoma (MGC), which arises from the meibomian glands that produce oil to keep the eye moist. MGC is generally more aggressive than BCC and can spread to other parts of the body if not treated promptly.
Accurately distinguishing between these three types of tumors is vital for treatment planning and research. Patients diagnosed with BCC may be treated with surgical or other local interventions to remove the tumor, while those diagnosed with cutaneous melanoma may require more aggressive treatment approaches, such as surgery, radiation therapy, or chemotherapy, in order to prevent the spread of the cancer. In addition, accurate classification and segmentation of eyelid tumors has significant value for research, including the study of the biology and genetics of these tumors, the evaluation of treatment response and disease progression, and the development of diagnostic and treatment algorithms. Therefore, a reliable method for classifying and segmenting eyelid tumors is necessary.
In recent years, with the development of deep learning in the field of computer vision, the study of medical image processing based on deep learning has become a popular research topic in the field of computer-aided diagnosis [
12,
13,
14], and methods using deep learning are gradually being used for the diagnosis and screening of a variety of ophthalmic diseases, however, less research has been conducted on eyelid tumors.
In 2019, Hekler et al. will use a pre-trained ResNet50 [
15] Network for training 695 whole slide images (WSIs) by migration learning to reduce the diagnosis error of benign moles and malignant melanoma [
16]. Xie et al. used the VGG19 [
17] network and ResNet50 network to classify patches generated from histopathological images [
18]. In 2022 Wei-Wen Hsu et al. proposed CNN for the classification of glioma subtypes using mixed data of WSIs and mpMRIs under weakly supervised learning [
19], Nancy et al. proposed DenseNet-II [
20] model through HAM10000 data set and various deep learning models to improve the accuracy of melanoma detection. At the 2018 ICCV conference, Chan et al. proposed the HistoSegNet method for semantic segmentation of tissue types, using an annotated digital pathology atlas (ADP) for patched training, computation of gradient-weighted class activation maps, which outperforms other more complex weakly supervised semantic segmentation methods [
21]. X Wang et al. based on the idea of model integration designed two complementary models based on SKM and scSEM to extract features from different spaces and scales, the method can directly segment the patches of digital pathology images pixel by pixel and no longer depends on the classification model [
22].
Although computer vision has made some progress in the field of tumor segmentation, automated analysis studies based on eyelid tumor pathology are very rare due to the lack of dataset. In 2018, Ding et al. designed a study using CNN for the binary classification of malignant melanoma (MM) and the whole slide image-level classification was realized using a random forest classifier to assist pathologists in diagnosis [
23]. In 2020, Wang et al. trained CNN on patch-level classification and used malignant probability to embed patches into each WSI to generate visualized heatmaps and also established a random forest model to establish WSI-level diagnosis [
24]. Y Luo et al. performed patch prediction by a network model based on the DenseNet-161 architecture and WSI differentiation by an integration module based on the average probability strategy to differentiate between eyelid BCC and sebaceous carcinoma (SC) [
25]. Parajuli et al. proposed a novel fully automated framework, including the use of DeeplabV3 for WSIs segmentation and the use of pre-trained VGG16 model, among others, to identify melanocytes and keratinocytes and support the diagnosis of melanoma [
26]. Ye et al. first proposed a Cascade network to use the features from both histologic pattern and cellular atypia in a holistic pattern to detect and recognize malignant tumors in pathological slices of eyelid tumors with high accuracy [
27]. Most of the above studies are based on existing methods and do not make significant modifications to the segmentation network. Some studies only focus on the recognition task and assist doctors in the diagnosis through classification, without involving tumor region segmentation due to the lack of a large-scale segmentation dataset in this task. Segmentation task is an important factor in evaluating the tumor stage and is also the basis for quantitative analysis. Our proposed method is able to simultaneously perform eyelid tumor classification and segmentation tasks based on histology slides through the design of the network architecture.
There are various factors that can increase the complexity of segmenting BCC, CM, and MGC in histology slides. The subtle differences in appearance that these tumors may exhibit compared to normal tissue, which can make them difficult to distinguish. Additionally, early-stage cancers may be more challenging to detect due to their small size and potential lack of discernible differences from normal tissue. To address these issues, we proposed the MAC-ResNet based on the teacher-student model for accurate classification and segmentation of eyelid tumors.
The teacher-student model is a machine learning paradigm in which a model, referred to as the “teacher”, is trained to solve a task and then another model, referred to as the “student”, is trained to mimic the teacher’s behavior and solve the same task. The student model is typically trained on a smaller dataset and with fewer resources (e.g., fewer parameters or lower computational power) than the teacher, with the goal of achieving similar or improved performance at a lower cost.
The teacher-student model is also known as the knowledge distillation or model compression approach. It is often used to improve the efficiency and performance of machine learning models, particularly when deploying them in resource-constrained environments such as mobile devices or Internet of Things (IoT) devices. In the teacher-student model, the teacher model is first trained on a large dataset and then used to generate “soft” or “distilled” labels for the student model, which are more informative than the one-hot labels typically used for training. The student model is then trained using these soft labels and the original dataset, with the goal of learning to mimic the teacher’s behavior. There are several variations of the teacher-student model, which can be divided into logits method distillation and feature distillation based on the transfer method. In this study, we adopt the logits method distillation. The concept of knowledge distillation and teacher-student model first appeared in “Distilling the knowledge in a neural network” by Hinton et al., and was used in image classification. Later, knowledge distillation was widely used in various fields of computer vision, such as face recognition [
28], image/video segmentation [
29], etc. In addition, it has also been applied in natural language processing (NLP) fields such as text generation [
30], question answering systems [
31], and others. Furthermore, it has also been applied in areas such as speech recognition [
32] and recommender systems [
33]. Finally, knowledge distillation has also been widely used in medical image processing. Qin et al. proposed a new knowledge distillation architecture in [
34], achieving an improvement of 32.6% on the student network. Thi Kieu Khanh Ho et al. proposed a self-training KD framework in [
35], achieving student network AUC improvements of up to 6.39%. However, this is the first time that knowledge distillation has been used in the classification of dermatopathology images.
3. Methods
First, we normalize and standardize the input data features and use a random combination image processing method to perform image expansion and enhancement. Then, we newly proposed a network structure (MAC-ResNet) that performs well on the classification task on the ZLet dataset, but the whole model structure is complex, consumes a lot of computational resources throughout the training process, and the speed of algorithm inference is slow. Therefore, we adopt the model compression method of knowledge distillation, use MAC-ResNet as the teacher network and ResNet50 as the student network, and achieve good results of the small volume student network ResNet50 in the classification of digital pathological pictures of eyelid tumors by using the knowledge of the teacher network to guide the training of the student network. Thus, this paper achieves automatic classification of three types of malignant tumors and enables automatic localization of lesion areas using U-Net [
36].
3.1. MAC-ResNet
To solve the problem of low accuracy of fine-grained classification, we first propose the Watching-Smaller-Attention-Crop-ResNet (WSAC-ResNet) structure. It combines the Backbone-Attention-Crop-Model (BACM) module, the residual nested structure Double-Attention-Res-block, the SPP-block module, and the SampleInput module.
For the fine-grained classification problem, this paper refers to the fine-grained classification model WSDAN [
37] and modifies it to design the Backbone-Attention-Crop-Model (BACM) module. From
Figure 2, we can learn that the BACM Model consists of three parts. They are the backbone network, the attention module [
38], and the AttentionPicture generated by cropping the original image according to the AttentionMap.
We crop and upsampling key regions of the images to a certain size according to the attention parameters, aiming to guide data for enhancement through the attention mechanism. Before the Feature Map of the neural network is input to the fully connected layer, it is input to the Attention model, and X Attention maps are obtained by convolution, dimensionality reduction, and other operations, each Attention map represents a feature in the picture, and one Attention map is randomly selected among the X Attention maps Then the normalization operation is performed on the Attention map. The normalization operation is as (1).
The value of the newly obtained Attention map is changed to 1 for elements with values more significant than the threshold and set to 0 for elements at other locations to generate a mask of locations worthy of strategic attention. The original image is cropped according to the generated mask against the original image to get the image of important regions and upsampling to a certain size, and then re-input into the neural network after data enhancement processing. When calculating the loss of the network model, the mean of the predicted and labeled loss of the original image and the predicted and labeled loss after cropping and re-inputting into the model is seen as the ultimate loss.
The backbone network is a neural network based on ResNet50 with a modified input structure named SampleInput, specifically by replacing a 7*7 convolutional layer with three 3*3 convolutional layers to enhance the network depth and ensure they have the same perceptual field; the network uses a double-layer nested residual structure Double-Attention-Res-block (DARes-block), which can fuse the deep layer with the shallow layer and the feature maps of the middle layer; SPP-block, which originated from SPPNet [
39], is used to solve the training problem for different image sizes.
To further improve the classification of the network, the loss function and the learning rate adjustment strategy of this network will be optimized.
For the classification of unbalanced samples, the focal loss function [
40] is used, which is a modification of the cross-entropy loss function, as (2).
We use CosineAnnealingLR [
41] to adjust the learning rate. It is used to change the magnitude of the learning rate by the cosine function, and each time the minimum point is reached. The next step resets the value of the learning rate to the maximum value to start a new round of decay.
We named the network that uses the above modules and policies as Multiscale-Attention-Crop-ResNet (MAC-ResNet).
3.2. Network Optimization Based on Knowledge Distillation
First, the teacher network with a complex model and good performance is trained, then the trained teacher network guides the training of the student network, and the trained student network is used to classify the dataset [
42]. The main principle of the teacher network guiding the training of the student network is that the soft labels output by the teacher network and the output of the soft label by the student network are combined to coach the student network to complete the training of the hard labels (as shown in
Figure 3). Soft labeling means that the predicted output of the network is divided by the temperature coefficient T and then the softmax operation is performed, which makes the result values between 0 and 1 with a more moderate distribution of values, while hard labeling means that the predicted output of the network is directly softmaxed without dividing by T [
43].
Traditional segmentation networks consume a large amount of computing resources during the entire training process and has a slow inference speed during the training of large pathology dataset. It is possible to compress the segmentation model to generate a smaller network with similar performance. We adopt the model compression method of knowledge distillation, using the aforementioned MAC-ResNet as the teacher network. Then, we use the simple and classic ResNet50 as the student network. Finally, we achieve good classification results on the ocular tumor pathology image dataset using the relatively simple student network. Knowledge distillation is a method proposed by Hinton et al. [
42], in which a complex and large model is used as the Teacher model, while the student model has a simpler structure. The Teacher model assists in the training of the student model, which has weaker learning ability, by transferring the knowledge it has learned to the student model, thereby enhancing the Student model’s generalization ability. Therefore, in the knowledge distillation process, the teacher network is usually a network with a complex structure, slow inference process, high consumption of computer resources, and good model performance, while the student network is usually a network with a simpler structure, fewer parameters, and poorer model performance. The process of using knowledge distillation is as follows: first, we train the complex and well-performing teacher network (MAC-ResNet), then guide the training of the student network (ResNet50) using the trained teacher network, and finally use the trained student network to classify the dataset. The teacher network guides the training of the student network by providing the student network with soft labels, or the probabilities of each class predicted by the teacher network, instead of hard labels (as shown in
Figure 3), which is the one-hot encoded labels of each class. For soft labeling, the predicted output of the network is divided by the temperature coefficient T and then the softmax operation is performed, which makes the result values between 0 and 1 with a more moderate distribution of values, while hard labeling means that the predicted output of the network is directly softmaxed without dividing by T [
43]. This helps the student network learn from the rich information provided by the teacher network. The softmax process can be denote as:
The loss of the MAC-ResNet network consists of two parts, which are the loss between the predicted value and the label of the first original input picture and the loss between the predicted value and the label of the network model after the attention-guided cropping to generate AttentionPicture into the network, and the weighted sum between them is the final loss. The proposed loss function of the whole training process after using MAC-ResNet as the teacher network and ResNet50 as the student network is shown in (4) and (5).
where
refers to the output of the hard label by the student network,
refers to the output of the soft label by the student network,
refers to the soft labels generated by the teacher network for the original picture prediction, and
refers to the hard labels predicted by the teacher network based on the AttentionPicture (the labels are softened only for the results of the original picture prediction). Besides,
refers to the loss of Knowledge Distillation, and
refers to the total loss.
is the
scattering loss function (Kullback-Leibler divergence),
is the focal loss function. T is the temperature coefficient, the larger the temperature coefficient, the more uniform the output data distribution.
After using knowledge distillation, the lightweight network model ResNet50, which is a student network, showed a significant improvement in the classification of the ZLet dataset.
5. Conclusions
The segmentation based on pathology slides is usually time consuming. In order to improve efficiency, we have adopted the knowledge distillation method, inspired by Hinton et al., to train a student network using a MAC-ResNet as the teacher network, enabling the student network to achieve good accuracy on the target task even with a small capacity. In addition, by using U-Net to achieve automatic localization of the lesion area, we can provide a reliable foundation for the diagnosis of pathologists and improve the efficiency and accuracy of diagnosis. We have applied this method to pathology tumor detection for the first time and have successfully verified the practicality of the teacher-student model in the field of pathology image analysis. Finally, the accuracy of MAC-ResNet on the three target tasks was 96.8%, 94.6%, and 90.8%, respectively. However, this study also had some regrets that we were not able to conduct extensive experiments on this data to widely verify the performance of different methods under the teacher-student framework. Another limitation of this study is that it only studied BCC, MGC, and CM, while eyelid tumors include other diseases, so more data sets will be needed in the future. We are currently working on a larger data set, ZLet-large, based on ZLet. ZLet-large includes over a thousand eyelid tumor pathology images and an increased number of disease types, including squamous cell carcinoma (SCC), seborrheic keratosis (SK), and xanthelasma. We hope to be able to conduct more extensive experiments on ZLet-large to further explore the potential of the teacher-student model in the analysis of eyelid tumors.