Quality Grading of Oudemansiella raphanipes Using Three-Teacher Knowledge Distillation with Cascaded Structure for LightWeight Neural Networks

Chen, Haoxuan; Huang, Huamao; Peng, Yangyang; Zhou, Hui; Hu, Haiying; Liu, Ming

doi:10.3390/agriculture15030301

Open AccessArticle

Quality Grading of Oudemansiella raphanipes Using Three-Teacher Knowledge Distillation with Cascaded Structure for LightWeight Neural Networks

by

Haoxuan Chen

^1,2,

Huamao Huang

²,

Yangyang Peng

¹,

Hui Zhou

¹,

Haiying Hu

^3,* and

Ming Liu

^1,*

¹

Guangdong Key Laboratory for New Technology Research of Vegetables, Vegetable Research Institute, Guangdong Academy of Agricultural Sciences, Guangzhou 510640, China

²

School of Physics and Optoelectronics, South China University of Technology, Guangzhou 510640, China

³

School of Civil Engineering and Transportation, South China University of Technology, Guangzhou 510640, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2025, 15(3), 301; https://doi.org/10.3390/agriculture15030301

Submission received: 27 November 2024 / Revised: 28 January 2025 / Accepted: 28 January 2025 / Published: 30 January 2025

(This article belongs to the Special Issue Application of Vision Technology and Artificial Intelligence in Smart Farming—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Oudemansiella raphanipes is valued for its rich nutritional content and medicinal properties, but traditional manual grading methods are time-consuming and labor-intensive. To address this, deep learning techniques are employed to automate the grading process, and knowledge distillation (KD) is used to enhance the accuracy of a small-parameter model while maintaining a low resource occupation and fast response speed in resource-limited devices. This study employs a three-teacher KD framework and investigates three cascaded structures: the parallel model, the standard series model, and the series model with residual connections (residual-series model). The student model used is a lightweight ShuffleNet V2 0.5x, while the teacher models are VGG16, ResNet50, and Xception. Our experiments show that the cascaded structures result in improved performance indices, compared with the traditional ensemble model with equal weights; in particular, the residual-series model outperforms the other models, achieving a grading accuracy of 99.7% on the testing dataset with an average inference time of 5.51 ms. The findings of this study have the potential for broader application of KD in resource-limited environments for automated quality grading.

Keywords:

Oudemansiella raphanipes; quality grading; knowledge distillation; multi-teacher model

1. Introduction

China was the first country in the world to recognize and cultivate the edible mushroom, and it is now the world’s largest producer, consumer, and exporter of this type of crop. In recent years, China’s edible mushroom sector has consistently maintained a growth trend, and edible mushrooms have emerged as China’s “fifth-largest crop” after grain, oil, vegetables, and fruits in terms of agricultural production [1]. Oudemansiella raphanipes is a well-known edible mushroom renowned for its rich nutritional content, which comprises proteins, amino acids, vitamins, and trace elements. It can also serve as a medicine, containing polyphenols, polysaccharides, flavonoids, and various other bioactive compounds that are beneficial for enhancing immunity, regulating body functions, repairing damaged organs, and providing analgesia and anti-inflammatory effects [2,3]. The increasing awareness of consumers concerning quality leads to significant price variations among different grades. However, traditional methods of artificial quality grading are time-consuming and labor-intensive, significantly limiting the economic benefits. Therefore, there is an urgent need for an algorithm capable of automatically assessing and grading the quality of Oudemansiella raphanipes.

With advancements in deep learning, convolutional neural networks (CNNs) have emerged as powerful tools for extracting high-dimensional features from objects for image classification [4]. In the context of mushroom quality grading, several studies have demonstrated their effectiveness. Tongkai Li et al. discussed a quality grading algorithm for Oudemansiella raphanipes using transfer learning and MobileNetV2, exhibiting a test accuracy of 98.75% [5]. Yanqiang Wu et al. proposed a size-grading method of antler mushrooms using YOLOv5 and PSPNet, achieving a detection accuracy of 94% [6]. Yinhua Zuo et al. introduced a quality grading model for Pleurotus cristatus based on an improved EfficientNet with a self-attention module, resulting in an average recognition accuracy of 91.5% [7]. Li Wang et al. proposed a grade recognition method for dried shiitake mushrooms using an improved VGG network (D-VGG), achieving a high classification accuracy of 96.21% [8]. Lei Shi et al. presented a lightweight grading detection method for Oyster mushrooms using OMC-YOLO, improved based on YOLOv8n, with an mAP50 value of 94.95% [9]. Ziyuan Wei et al. employed a convolutional autoencoder–support vector machine model for the quality classification of Pleurotus eryngii, achieving an accuracy of 91.58% [10].

However, it is essential to develop a model that balances classification accuracy, resource occupation, and response speed, especially for resource-limited devices. Although large-parameter models like VGGNet [11] and ResNet [12] offer high accuracy, they fall short in terms of resource efficiency and response speed. Conversely, small-parameter models such as Xception [13] and ShuffleNet [14,15] have smaller resource occupation and faster calculation speeds but may lack satisfactory classification accuracy. To address this challenge, knowledge distillation (KD) algorithms can be employed to enhance the accuracy of small-parameter models while maintaining their speed advantages [16,17]. KD is a technique in deep learning, aimed at transferring knowledge from large-parameter models to small-parameter models. The large-parameter model, typically characterized by high accuracy but significant resource occupation and slower response speed, is used as the teacher model, whereas the small-parameter model, typically featuring low resource occupation and fast response speed but lower accuracy, is used as the student model. Without KD, the student model learns knowledge solely from the ground truth (referred to as hard labels). With KD, in addition to the ground truth, the student model also learns knowledge from the teacher model, whose softmax output (referred to as soft labels) is utilized to train the student model. Consequently, the small-parameter student model can achieve similar accuracy to the large-parameter teacher model while maintaining low resource occupation and fast response speed. This approach was first proposed to utilize a single-teacher model [16,17]. Since multiple teachers can facilitate the transfer of a broader range of knowledge and mitigate the impact of a single low-quality teacher, multi-teacher KD has been proposed [18,19,20,21,22,23,24,25,26]. In this context, several teacher models are combined together with certain weights to form an ensemble model. As a result, the weights assigned to each teacher model directly influence the performance of the distilled student model. A simple scheme is to adopt equal weights for each teacher model [18]. However, equal weights fail to distinguish between high-quality and low-quality teacher models, potentially resulting in unsatisfactory performance. A variety of methods have been developed to adjust the weights, such as manually tuning weights [19] and adaptive weights [20,21,22,23,24]. To eliminate the necessity of selecting weights, there are other methods to select teacher models, including a voting strategy to choose the majority side [18], randomly selecting a teacher model [25], and applying reinforcement learning to filter out inappropriate teacher models [26]. However, some teacher models were discarded and consequently went to waste. The challenge lies in fully utilizing all three teacher models while mitigating the adverse effects posed by low-quality teacher models, which remains an unresolved issue.

In this paper, a three-teacher KD algorithm utilizing cascaded teacher models is proposed for quality grading of Oudemansiella raphanipes, with the aim of improving the grading accuracy while maintaining a low resource occupation and fast response speed. The main contributions of this paper are as follows:

(1) Three cascaded structures, including the parallel model, the standard series model, and the series model modified with residual connections (hereafter denoted as the residual-series model), were investigated.

(2) Compared with the student model distilled with a single teacher or an ensemble three-teacher model with equal weights (hereafter denoted as the equal-weight three-teacher model), these cascaded structures exhibit improved performance indices. In particular, the residual-series model outperforms the other models, achieving the highest grading accuracy.

(3) The superiority of the residual-series model is also demonstrated by comparison with recently published model compression techniques, through its ability to transfer to edge devices with limited computing resources, and by its generalization ability on the public mushroom dataset.

2. Materials and Methods

2.1. Overall Process

The overall process is sketched in Figure 1. The first step is to create datasets by taking photos of the manually graded Oudemansiella raphanipes. Subsequently, the training dataset is sent to a high-performance computing device for training using the KD algorithm, where a small-parameter student model learns knowledge from large-parameter teacher models. After that, the performance of the well-trained student model is evaluated on the testing dataset using the same training device or on a personal computer. Finally, the verified student model can be transferred to an edge device with limited computing resources for practical application. Ultimately, the edge device can use its camera to take photos of an Oudemansiella raphanipes and accurately grade its quality using the lightweight student model.

2.2. Datasets

The Oudemansiella raphanipes were purchased from the farmer’s market in April 2024, in Guangzhou, Guangdong Province, China. The datasets were collected in the same month at the Vegetable Research Institute, Guangdong Academy of Agricultural Sciences. According to the quality, including size, shape, color, and texture, the Oudemansiella raphanipes were divided into four classes: Class 1, Class 2, Class 3, and Class 4, based on the T/GVEAIA 009.3-2019 standard [27]. Three experts inspected each class and ensured that the classification was correct. Figure 2a presents the experimental setup for image acquisition. Each Oudemansiella raphanipes was put into a shielding box, in which a light source (four 45-watt fluorescent lamps with a color temperature of 5000 K) was illuminated from the top and a color camera (Canon 760D (Canon Inc., Tokyo, Japan) with EOS Utility Version 3.12.0 software) took the images with the vertical shooting height of 45 cm. The application scenario involves industrial automation selection equipment in a factory. A pile of Oudemansiella raphanipes is distributed onto an array of containers through a sieve, with each container holding one Oudemansiella raphanipes. The containers are transported one by one to the image acquisition module for quality grating, and then Oudemansiella raphanipes are collected into different boxes based on their quality grades. The resolution of each image is 3984 × 2656. All these images were cropped to a small square of 600 × 600 pixels, as shown in Figure 2b–e.

A total of 3000 square images of Oudemansiella raphanipes were evenly collected from the four classes. The images from each class were randomly divided into a training dataset and a testing dataset at a ratio of 8:2. Subsequently, the training datasets, originally consisting of 2400 images (600 images per class), was expanded to 38,400 images (9600 images per class) using an open-source tool, imgaug [28]. To enhance the diversity of the training dataset, the following augmentation techniques were employed, with their application order being random:

(1) Mirroring 50% of the images;

(2) Flipping 20% of the images left to right;

(3) Cropping a random subset of images with a cropping range of 90% to 100% of their original size;

(4) Applying affine transformations to 30% of the images, including the following: scaling the images to between 80% and 120% of their original size, translating them by ±20% of their dimensions, rotating them by ±45°, and applying shear transformations by ±16°;

(5) Randomly applying up to five of the following techniques to the images, including representing part of the image using superpixels, enhancing with Gaussian blur, mean blur, or median blur, adding an embossing effect, sharpening, performing edge detection and overlaying the binarized edges on the original image, adding Gaussian noise, setting 1% to 10% of the pixels to black or covering 3% to 15% of the image area with black squares that are 2% to 5% of the original image size, inverting the intensity of 1% to 10% of the pixels, randomly adding or subtracting an intensity value between –10 and 10 to each pixel, multiplying the intensity value of each pixel by a random number between 0.5 and 1.5, halving or doubling the overall contrast of the image, converting the RGB image to grayscale and then adding the resultant values (multiplied by an alpha parameter) to the original image, and warping local regions of the image.

This approach can enhance the robustness of the deep learning model, enabling it to better handle real-world data. Figure 3 presents examples of augmented images. We also conducted a thorough examination to ensure the absence of redundancy in the augmented dataset. The testing dataset, comprising 600 images (150 images per class), remained in its original form without augmentation.

2.3. Three-Teacher Knowledge Distillation

2.3.1. Basic Structure

The basic structure of the KD module is illustrated in Figure 4. During the training process, the training dataset is sent to both the pre-trained teacher model and the student model in the KD module. The teacher model produces soft labels by converting its logit, z_i, to a probability, q_i, for the i-th class using the following softmax function [16]:

q_{i} = \frac{\exp (z_{i} / T)}{\sum_{j} \exp (z_{j} / T)}

(1)

where T is the distillation temperature. A larger T leads to a smoother probability distribution, causing the student model to learn more knowledge from negative labels. Similarly, the student model produces a probability, p_i, from its logit, v_i, for the i-th class using Equation (1). The Kullback–Leibler (KL) divergence from the probability distribution of the student model, (p₁, p₂, p₃, p₄), to the soft labels of the teacher model, (q₁, q₂, q₃, q₄), is adopted to quantify the match of the two models’ predictions. Therefore, the soft loss (also named the distillation loss), L_soft, is computed by the following KL divergence equation:

L_{soft} = \sum_{i} q_{i} \log (\frac{q_{i}}{p_{i}})

(2)

Equation (2) is a wrapped function in TensorFlow. In addition to learning knowledge from the teacher model, the student model also learns from the ground truth. The student model produces a probability, s_i, from its logit, v_i, for the i-th class using the standard softmax function [16]:

s_{i} = \frac{\exp (v_{i})}{\sum_{j} \exp (v_{j})}

(3)

Equation (3) is the degenerate form of Equation (1) at T = 1. The cross-entropy between the probability distribution of the student model, (s₁, s₂, s₃, s₄), and the hard labels of the ground truth, (t₁, t₂, t₃, t₄), is employed to represent the hard loss (also named the student loss), L_hard, using the following equation:

L_{hard} = - \sum_{i} t_{i} \log (s_{i})

(4)

Equation (4) is a wrapped function in TensorFlow. As a result, after passing through a KD module, the total loss of the module, L_module, can be described as follows:

L_{module} = λ • T^{2} • L_{soft} + (1 - λ) • L_{hard}

(5)

where λ is the factor to balance the soft loss and hard loss. The student model updates its network weight parameters through backpropagation (BP) to minimize the total loss. By multiplying the coefficient of T² with L_soft, the impact of soft loss on the gradient during the BP process will increase, thereby ensuring the consistency of the amplitudes of both the soft loss and the hard loss throughout the training process.

2.3.2. Traditional Ensemble Model

For a traditional three-teacher KD model, during the training process, the training dataset is sent to all three pre-trained teacher models and the student model. Each teacher model produces soft labels using Equation (1), and then combines those soft labels with the probability distribution generated by the student model to calculate the soft loss, L_{soft_n} (n = 1, 2, 3), using Equation (2). Then, the three soft losses are weighted summed as ∑L_{soft_n}•t_n, where t_n is the weight of the soft loss for n-th teacher model and ∑t_n = 1. On the other hand, the student model produces a probability distribution using Equation (3), and then combines those probabilities with the hard labels generated by the ground truth to calculate the hard loss using Equation (4). As a result, after passing through a traditional three-teacher KD model, the total loss of the ensemble model, L_ensemble, can be described as follows:

L_{ensemble} = λ • T^{2} • \sum_{n} (L_{soft_n} • t_{n}) + (1 - λ) • L_{hard}

(6)

There are numerous possible combinations of weights for the three teacher models, making the selection of weights a difficult task. A simple scheme is to adopt equal weights for each teacher model, that is, t₁ = t₂ = t₃ = 1/3.

2.3.3. Parallel Model

A scheme that eliminates the necessity of selecting weights is to parallelize all models. As depicted in Figure 5, during the training process, the training dataset is sent to all three KD modules. In each KD module, following Section 2.3.1 and Figure 4, the pre-trained teacher model and the student model produce a module loss using Equation (5); based on the module loss and the BP process, the student model updates its network weight parameters. Subsequently, by comparing the three module losses, only the KD module possessing the minimum loss is selected and reserved, while the other two KD modules with larger losses are discarded. As a result, in this parallel three-teacher KD model, the student model learns knowledge from only one selected “best” teacher model within each epoch. Consequently, this parallel model is expected to outperform the equal-weight model.

2.3.4. Series Model and Its Residual Connections

To fully utilize all three teacher models, the series architecture can be adopted. As presented in Figure 6a, during the training process, the training dataset is sent to the first KD module. Following Section 2.3.1 and Figure 4, the first pre-trained teacher model and the student model produce a module loss using Equation (5); based on this module loss and the BP process, the student model is updated to M_{module_1}. Next, the training dataset is sent into the second KD module. After a similar processing sequence, the second pre-trained teacher model and the updated student model, M_{module_1}, generate another module loss, leading to the update of the student model to M_{module_2}. Subsequently, the training dataset is passed to the third KD module. Upon completion of the same processing steps, the third pre-trained teacher model and the updated student model, M_{module_2}, yield a module loss, resulting in the final update of the student model to M_{module_3}. As a result, in this standard series three-teacher KD model, the student model sequentially acquires knowledge from each teacher model within each epoch. Consequently, this standard series model is expected to outperform the parallel model, if all teacher models can transfer high-quality knowledge effectively. However, the inclusion of a low-quality teacher model can degrade performance.

To eliminate the influence of the low-quality teacher model, the residual connections can be incorporated into the series KD model. As illustrated in Figure 6b, the selected student model from the previous epoch, M_{selected_1}, serves dual purposes in the first KD module: it is used to update the student model and, through a residual connection, to compare with the newly updated student model, M_{module_1}. The model with the minimum module loss, either M_{selected_1} or M_{module_1}, is retained as the selected student model, M_{selected_2}. This process is repeated in the second KD module. M_{selected_2} is utilized to update the student model and compared with M_{module_2}. The model possessing the minimum loss, either M_{selected_2} or M_{module_2}, is reserved as the selected student model, M_{selected_3}. Similarly, in the third KD module, M_{selected_3} is employed for updating and compared with M_{module_3}. The model with the minimum loss, either M_{selected_3} or M_{module_3}, is chosen as the final selected student model, M_selected. The M_selected is then used for the subsequent epoch. As a result, in this residual-series model, the student model progressively acquires knowledge from each teacher model within each epoch while discarding the low-quality knowledge, ensuring that each subsequent KD module performs at least as well as its predecessor. Consequently, this residual-series model is expected to outperform both the parallel model and the standard series model.

2.4. Performance Indices

The grading accuracy and average inference time were used for comparing the models on the training dataset.

Additionally, to more accurately measure the models’ performance, the mean accuracy, mean precision, mean recall, mean F1 score, and kappa coefficient were calculated on the testing dataset, using the following equations:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(7)

Precision = \frac{T P}{T P + F P}

(8)

Recall = \frac{T P}{T P + F N}

(9)

F 1 Score = 2 • \frac{Precision • Recall}{Precision + Recall}

(10)

Kappa coefficient = \frac{P_{o} - P_{e}}{1 - P_{e}}

(11)

where TP (true positives) represents the number of samples correctly predicted by the model as positive cases, TN (true negatives) indicates the number of samples correctly predicted by the model as negative cases, FP (false positives) indicates the number of samples incorrectly predicted by the model as positive cases, FN (false negatives) indicates the number of samples incorrectly predicted by the model as negative cases, P_o is the testing accuracy, and P_e represents the sum of the actual number and predicted number of respective categories divided by the square of the total number of samples.

3. Experiments

3.1. Computing Resources and Parameters

Since the purpose of this study is to evaluate the performance of KD in classification tasks, a typical published ShuffleNet V2 0.5x lightweight model [15] was selected as the student model, while three typical published large-parameter models, VGG16 [11], ResNet50 [12], and Xception [13], served as teacher models. The resources occupied by each model are shown in Table 1. Prior to training, each image was resized to meet the input specifications of the model.

The hardware platform used in this experiment was a personal computer, with an Intel Core i7-12700KF (3.61 GHz), 64 GB of memory, and an NVIDIA GeForce RTX 3080Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA). The software environment consisted of the Windows 10 (64-bit) operating system, Python version 3.8, and TensorFlow version 2.4.0.

All models were trained for 30 epochs. The widely used Adam optimizer [29] and a step learning rate reduction strategy were utilized. The initial learning rate was set to 0.004, and the learning rate was adjusted to 60% of its current value every five epochs. Different learning rate descent strategies, such as cosine decay, adaptive descent, and the step reduction strategy, were tried. Both the cosine decay and adaptive descent strategies exhibited varying degrees of overfitting, while the step reduction strategy did not overfit during training and demonstrated superior performance. In the KD module, the temperature coefficient, T, was set to 5. When the student model has a relatively small number of parameters, a relatively lower value of T suffices, because a model with fewer parameters may not be able to learn all the knowledge from the teacher model, and it is appropriate to ignore some information related to negative labels. Additionally, the balance factor, λ, was set to 0.8 to enhance the effects of the soft loss.

3.2. Results

3.2.1. Single-Teacher KD

Figure 7 presents the loss curves for the original student model and single-teacher KD models on both the training and validation datasets. During the training process, there is no evidence of underfitting or overfitting. For the best epoch in the training process, the grading accuracy and inference time of the single-teacher KD model are presented in Table 2. For the sake of comparison, the results of each pre-trained teacher model and the original student model are also given. The pre-trained teacher models reach a high grading accuracy (the lowest value was 98.3%) but have a large inference time (the smallest value was 65 ms), whereas the original student model only achieves a low grading accuracy (86.5%) but has a small inference time (5.03 ms). Through single-teacher KD, the distilled student model obtains an improved grading accuracy (the highest value is 92.3%) while maintaining the small inference time (about 5 ms).

The performance of the single-teacher KD models was evaluated on the testing dataset. The confusion matrices resulting from these evaluations are illustrated in Figure 8. For comparison, the results of the original student model are also provided. It seems that the main issue with the grading is the inability to distinguish samples between Class 1 and Class 2, and between Class 3 and Class 4. Compared to the original undistilled student model, the single-distillation model with VGG16 significantly reduces incorrect grading between these classes. Additionally, the single-distillation model with ResNet50 further reduces incorrect grading between Class 1 and Class 2. These results indicate that VGG16 and ResNet50 are high-performing teachers. However, the single-distillation model with Xception, intended to enhance the grading accuracy, results in incorrect grading between Class 2 and Class 4. This suggests that Xception is a low-performing teacher.

The various performance indices evaluated from the confusion matrices are presented in Table 3. Compared to the original student model (for example, with a mean F1 score of 94.5% and a Kappa coefficient of 0.9244), the distilled student model obtained through single-teacher KD achieves higher values for the performance indices (with the best mean F1 score of 97.1% and Kappa coefficient of 0.9622).

3.2.2. Three-Teacher KD with Equal-Weight Ensemble Model

It is supposed that the KD model with multiple teachers can facilitate the transfer of a broader range of knowledge. The traditional equal-weight three-teacher KD was adopted to enhance grading accuracy. Figure 9 presents the loss curves for the equal-weight three-teacher KD model on both the training and validation datasets, showing no evidence of underfitting or overfitting during the training process. For the best epoch in the training process, the grading accuracy and inference time are presented in Table 4. It is shown that the equal-weight model only attains a grading accuracy of 92.3% and an inference time of about 5.5 ms, not superior to the results of the student model distilled by a single teacher. The reason can be explained by the fact that the equal weights may lead to the underestimation of the knowledge from the high-performing teachers, while overemphasizing the knowledge from the low-performing teachers. This will affect the student model’s ability to effectively learn from multiple teachers.

The performance of the equal-weight model was also evaluated on the testing dataset. The confusion matrix and various indices are illustrated in Figure 10 and Table 5, respectively. Compared to the best single-distillation model in Section 3.2.1, the implementation of the equal-weight three-teacher KD leads to an improvement in grading accuracy, especially in distinguishing samples between Class 3 and Class 4. Consequently, the mean F1 score and Kappa coefficient are increased to 98.2% and 0.9777, respectively.

3.2.3. Three-Teacher KD with a Cascaded Structure

The three-teacher KD models with a cascaded structure, including the parallel model, the standard series model, and the residual-series model, were investigated. In the standard series three-teacher KD model, the student sequentially acquires knowledge from each teacher within an epoch. After the student model learns from a teacher model and its weights are updated, the student model proceeds to learn from the next teacher model. This implies that the knowledge acquired from the previous teacher model will influence the student model’s learning from the subsequent teacher model. As shown in Table 6, the grading accuracy is influenced by the model sequence. Most standard series model outperform the traditional equal-weight model, with the best grading accuracy of 94.1% achieved on the Xception→VGG16→ResNet50 sequence. All residual-series models demonstrate higher grading accuracy than their standard series counterparts, reaching a best value of 94.5% on the same sequence. In the following of this paper, the best standard series and residual-series models are used for comparison with other cascaded structures.

Figure 11 presents the loss curves for the parallel model, the best standard series model, and the best residual-series model on both the training and validation datasets, showing no evidence of underfitting or overfitting during the training process. For the best epoch in the training process, the grading accuracy and inference time of the three-teacher KD models with a cascaded structure are presented in Table 7. For the parallel three-teacher KD model, the student model learns knowledge from the best one of the three pre-trained teacher models within each epoch, thereby improving the grading accuracy to 94.4%. For the best standard series three-teacher KD model, the student model sequentially acquires knowledge from a specific order of teacher models within each epoch, resulting in a grading accuracy of 94.1%. For the residual-series three-teacher KD model, the student model progressively acquires “good” knowledge while discarding the “bad” knowledge within each epoch, thereby ensuring that each subsequent KD module performs at least as well as its predecessor. Thus, compared to the standard series model, the grading accuracy for the best residual-series model is improved to 94.5%. For all the distilled student models, the inference time is about 5.5 ms.

The performance of these three-teacher KD models was also evaluated on the testing dataset. The results indicate that all performance indices are improved. The confusion matrices resulting from these evaluations are illustrated in Figure 12, and the various indices are presented in Table 8. When compared to the equal-weight model, the parallel model and the best standard series model have fewer incorrect grading predictions, both in distinguishing samples between Class 1 and Class 2, and between Class 3 and Class 4. These two models obtain the same improved mean F1 score of 99.1% and Kappa coefficient of 0.9888, respectively. Furthermore, the residual-series model exhibits the highest performance indices, achieving a mean F1 score of 99.9% and a Kappa coefficient of 0.9955. The high performance achieved on the testing dataset, which surpasses that of the training datasets, is largely attributed to the augmentation technique. Data augmentation in the training datasets can enhance the robustness of the model, thereby enabling it to better handle the testing dataset in which augmentation was not applied. These results also demonstrate the effectiveness of the cascaded structures, including the parallel model, the standard series model, and the demonstrated superiority of the residual-series model. Figure 12d,e present two misclassified samples using the residual-series model. The potential reason for their misclassification from Class 4 to Class 3 is that their stalks are full and have a good ratio to their caps. To improve the accuracy, more samples should be captured for training in the future to enhance the feature extraction.

4. Discussions

4.1. Comparison with Existing Technologies

To show the effectiveness of our proposed model, two existing technologies, including a filter pruning method [30] and a logit standardization KD [31], were compared with the best residual-series model. Using the same datasets described in Section 2.2, as well as the computing resources and parameters described in Section 3.1, the three models were trained and evaluated. The number of parameters, grading accuracy, and inference time evaluated on the testing dataset are tabulated in Table 9. The results show that the best residual-series model in this study can achieve the highest grading accuracy while maintaining the smallest parameters and fastest response speed.

4.2. Model’s Performance Evaluated on an Edge Device

A lightweight model is generally more favorable for deployment on edge devices, as it requires less memory and computational resources. This can be particularly important in real-world applications where energy consumption and hardware limitations are critical factors. Our proposed residual-series model is suitable for deployment on edge devices such as the Raspberry Pi, enabling real-time inference even with limited hardware resources.

The proposed residual-series model was transferred to a Raspberry Pi 4 Model B, equipped with a Broadcom BCM2711 Quad core 64-bit SoC (ARM Cortex-A72 architecture, 1.5 GHz), LPDDR4-2400 SDRAM (4 GB) (Raspberry Pi Ltd, Cambridge, UK). The software environment consisted of the Raspbian operating system based on Linux, Python version 3.7, and TensorFlow version 2.2.0. It was then evaluated on the testing dataset. Due to limited hardware resources, the inference time for the Raspberry Pi is significantly longer compared to that of the personal computer. As indicated in Table 10, the personal computer can detect images from a high-speed camera at a frame rate of 180 FPS in real time, whereas the Raspberry Pi can only detect images from a low-speed camera at a frame rate of 10 FPS in real time. However, both systems still represent a significant improvement over manual selection (with a typical grading time of approximately 1 s, resulting in an efficiency that is one-tenth that of the Raspberry Pi) and lead to reduced human resources and labor costs. Apart from inference time, all other performance metrics presented in Table 10 remain the same as those evaluated using the personal computer.

4.3. Model’s Performance Evaluated on a Dataset from a Public Database

To verify the generalization ability of our proposed model, four types of mushrooms, including Amanita, Apioperdon, Boletus, and Ceriporus, were randomly selected from a mushroom dataset in Kaggle [32]. For each type of mushroom, 1200, 150, and 150 images were randomly selected as the training, validation, and testing datasets, respectively. Using the same computing resources and parameters described in Section 3.1, the three-teacher KD models, including the equal-weight model, the parallel model, the standard series model, and the residual-series model, were trained and evaluated. The performance of these models for the best epoch in the training process and that of the well-trained models evaluated on the testing dataset are presented in Table 11 and Table 12, respectively. Similar to Section 3.2.3, the parallel and standard series models can improve grading accuracy, and the residual-series model can achieve the highest performance, while their inference times are almost the same.

5. Conclusions

A three-teacher KD algorithm with a cascaded structure was proposed to improve the quality grading of Oudemansiella raphanipes using lightweight models in resource-limited devices. The cascaded three-teacher models, including the parallel model, the standard series model, and the residual-series model, outperformed single-teacher model and traditional equal-weight three-teacher model, demonstrating improved grading accuracy while maintaining fast inference times. The residual-series model, in particular, demonstrated a remarkable accuracy of 99.7%, making it a promising candidate for automated quality grading of Oudemansiella raphanipes. It was also successfully deployed to the Raspberry Pi, despite its limited computing resources. Compared with manual grading, the Raspberry Pi exhibits a significant improvement in efficiency by a factor of ten, while maintaining high accuracy. This will result in significant economic benefits for the industry owing to increased efficiency but reduced human resources and labor costs. Future work could focus on extending this approach to other types of agricultural products and exploring more advanced cascaded structures to further enhance performance.

Author Contributions

Conceptualization, H.H. (Huamao Huang) and M.L.; methodology, H.C. and H.H. (Huamao Huang); investigation, H.C.; data curation, Y.P. and H.Z.; data analysis, H.C. and H.H. (Haiying Hu); figures, H.H. (Haiying Hu); writing—original draft preparation, H.C. and H.H. (Huamao Huang); writing—review and editing, H.H. (Haiying Hu), M.L., Y.P. and H.Z.; supervision, H.H. (Huamao Huang); project administration, H.H. (Haiying Hu) and M.L.; funding acquisition, H.H. (Huamao Huang) and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangzhou Key Research and Development Program (2024B03J1359).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, X.-H.; Fu, Y.-P.; Xiao, S.-J.; Li, C.-T.; Wang, D.; Li, Y. Research progress on mushroom phenotyping. Mycosystema 2021, 40, 721–742. [Google Scholar]
Fang, Q.; Lai, Y.; Zhang, D.; Lei, H.; Wang, F.; Guo, X.; Song, C. Gut microbiota regulation and prebiotic properties of polysaccharides from Oudemansiella raphanipes mushroom. World J. Microbiol. Biotechnol. 2023, 39, 167. [Google Scholar] [CrossRef] [PubMed]
Wei, Q.; Zhong, X.; Haruna, M.H.; Liu, S.; Zhou, F.; Chen, M. Evaluation of different agricultural wastes for the production of polysaccharides from Oudemansiella raphanipes and its anti-oxidant properties. Food Sci. Nutr. 2022, 10, 3444–3452. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Wang, L.; Zhang, Y.; Han, X.; Deveci, M.; Parmar, M. A review of convolutional neural networks in computer vision. Artif. Intell. Rev. 2024, 7, 99. [Google Scholar] [CrossRef]
Li, T.; Huang, H.; Peng, Y.; Zhou, H.; Hu, H.; Liu, M. Quality grading algorithm of Oudemansiella raphanipes based on transfer learning and MobileNetV2. Horticulturae 2022, 8, 1119. [Google Scholar] [CrossRef]
Wu, Y.; Sun, Y.; Zhang, S.; Liu, X.; Zhou, K.; Hou, J. A size-grading method of Antler Mushrooms using YOLOv5 and PSPNet. Agronomy 2022, 12, 2601. [Google Scholar] [CrossRef]
Zuo, Y.; Zhao, M. SA-EfficientNet: Quality grading model of Stropharia rugoso-annulate. In Proceedings of the 2022 International Conference on Computer Engineering and Artificial Intelligence (ICCEAI), Shijiazhuang, China, 22–24 July 2022; pp. 358–362. [Google Scholar]
Wang, L.; Dong, P.; Wang, Q.; Jia, K.; Niu, Q. Dried shiitake mushroom grade recognition using D-VGG network and machine vision. Front. Nutr. 2023, 10, 1247075. [Google Scholar] [CrossRef] [PubMed]
Shi, L.; Wei, Z.; You, H.; Wang, J.; Bai, Z.; Yu, H.; Ji, R.; Bi, C. OMC-YOLO: A lightweight grading detection method for Oyster Mushrooms. Horticulturae 2024, 10, 742. [Google Scholar] [CrossRef]
Wei, Z.; Liu, H.; Xu, J.; Li, Y.; Hu, J.; Tian, S. Quality grading method for Pleurotus eryngii during postharvest storage based on hyperspectral imaging and multiple quality indicators. Food Control 2024, 166, 110763. [Google Scholar] [CrossRef]
Karen, S.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; He, J.S. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
You, S.; Xu, C.; Xu, C.; Tao, D. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Halifax, NS, Canada, 13–17 August 2017; pp. 1285–1294. [Google Scholar]
Shao, H.; Zhong, D. Multi-target cross-dataset palmprint recognition via distilling from multi-teacher. IEEE Trans. Instrum. Meas. 2023, 72, 5017814. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, W.; Wang, J. Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 2020, 415, 106–113. [Google Scholar] [CrossRef]
Zhang, H.; Chen, D.; Wang, C. Confidence-aware multi-teacher knowledge distillation. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 7–13 May 2022; pp. 4498–4502. [Google Scholar]
Shang, R.; Li, W.; Zhu, S.; Jiao, L.; Li, Y. Multi-teacher knowledge distillation based on joint guidance of probe and adaptive corrector. Neural Netw. 2023, 164, 345–356. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.; Deng, L.; Gou, J.; Wang, C.; Li, J.; Li, D. Building and road detection from remote sensing images based on weights adaptive multi-teacher collaborative distillation using a fused knowledge. Int. J. Appl. Earth Obs. Geoinf. 2023, 124, 103522. [Google Scholar] [CrossRef]
Lin, Y.-E.; Yin, S.; Ding, Y.; Liang, X. ATMKD: Adaptive temperature guided multi-teacher knowledge distillation. Multimed. Syst. 2024, 30, 292. [Google Scholar] [CrossRef]
Fukuda, T.; Suzuki, M.; Kurata, G.; Thomas, S.; Cui, J.; Ramabhadran, B. Efficient knowledge distillation from an ensemble of teachers. In Proceedings of the 18th Annual Conference of the International-Speech-Communication-Association (INTERSPEECH 2017), Stockholm, Sweden, 20–24 August 2017; pp. 3697–3701. [Google Scholar]
Yuan, F.; Shou, L.; Pei, J.; Lin, W.; Gong, M.; Fu, Y.; Jiang, D. Reinforced multi-teacher selection for knowledge distillation. In Proceedings of the 35th AAAI Conference on Artificial Intelligence/33rd Conference on Innovative Applications of Artificial Intelligence/11th Symposium on Educational Advances in Artificial Intelligence, Virtual, 2–9 February 2021; pp. 14284–14291. [Google Scholar]
T/GVEAIA 009.3-2019; Standards for Edible and Medicinal Mushrooms—Oudemansiella raphanipes—Part 3: Grading Standards for Oudemansiella raphanipes. Available online: https://max.book118.com/html/2019/0726/8051014034002037 (accessed on 1 August 2024).
Jung, A.B.; Wada, K.; Crall, J.; Tanaka, S.; Graving, J.; Reinders, C.; Yadav, S.; Banerjee, J.; Vecsei, G.; Kraft, A.; et al. Imgaug. 2020. Available online: https://github.com/aleju/imgaug (accessed on 1 August 2024).
Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Lin, M.; Ji, R.; Wang, Y.; Zhang, Y.; Zhang, B.; Tian, Y.; Shao, L. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Sun, S.; Ren, W.; Li, J.; Wang, R.; Cao, X. Logit standardization in knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Mushrooms Classification-Common Genus’s Images. Available online: https://www.kaggle.com/datasets/maysee/mushrooms-classification-common-genuss-images (accessed on 5 January 2025).

Figure 1. Sketch of the overall process.

Figure 2. (a) Experimental setup for image acquisition. Images of Oudemansiella raphanipes. (b) Class 1, (c) Class 2, (d) Class 3, and (e) Class 4.

Figure 3. Examples of (a) original image and (b–p) augmented images. (b) Affine transformation, (c) left–right flipping, (d) up–down flipping, (e) adding average blur after affine transformation, (f) adding median blur after affine transformation, (g) sharpening after up–down flipping, (h) adding an embossing effect after left–right flipping, (i) adding Gaussian noise after up–down flipping, (j) setting pixels to black after left–right flipping, (k) inverting pixel intensities, (l) multiplying pixels by random numbers between 0.5 and 1.5 after rotation, (m) changing the contrast after left–right flipping, (n) transforming RGB into a gray image and then multiplying alpha to add to the original image, (o) adding Gaussian blur after up–down flipping, and (p) adding edge detection after up–down flipping.

Figure 4. Basic structure of a KD module.

Figure 5. Schematic diagram of the parallel three-teacher KD model.

Figure 6. Schematic diagram of the series three-teacher KD model (a) without and (b) with residual connections.

Figure 7. Loss curves for (a) the original student model and (b–d) single-teacher KD models on the training and validation datasets. (b) VGG16, (c) ResNet50, and (d) Xception.

Figure 8. Confusion matrices of prediction results for (a) the original student model and (b–d) single-teacher KD models on the testing dataset. (b) VGG16, (c) ResNet50, and (d) Xception.

Figure 9. Loss curves for the equal-weight three-teacher KD model on the training and validation datasets.

Figure 10. Confusion matrix of prediction results for the equal-weight three-teacher KD model on the testing dataset.

Figure 11. Loss curves for three-teacher KD models with a cascaded structure on the training and validation datasets. (a) The parallel model, (b) the best standard series model, and (c) the best residual-series model.

Figure 12. Confusion matrices of prediction results for three-teacher KD models with a cascaded structure on the testing dataset: (a) the parallel model, (b) the best standard series model, and (c) the best residual-series model. The (d) first and (e) second misclassified samples in (c).

Table 1. Resource occupation.

Model	Number of Parameters	Size of the Model
VGG16	134.26 M	512 M
ResNet50	23.58 M	94.4 M
Xception	20.90 M	83.8 M
ShuffleNet v2 0.5x	0.35 M	1.43 M

Table 2. Performance indices for pre-trained teacher models, the original student model, and single-teacher KD models on the training dataset.

Model	Grading Accuracy	Average Inference Time
VGG16 (Pre-trained teacher model)	99.1%	71 ms
ResNet50 (Pre-trained teacher model)	98.3%	65 ms
Xception (Pre-trained teacher model)	99.6%	135 ms
Shufflenet v2 (Original student model)	86.5%	5.03 ms
Shufflenet v2 distilled by VGG16	92.3%	5.14 ms
Shufflenet v2 distilled by ResNet50	92.0%	4.99 ms
Shufflenet v2 distilled by Xception	91.5%	5.72 ms

Table 3. Performance indices for the original student model and single-teacher KD models on the testing dataset.

Model	Mean Accuracy	Mean Precision	Mean Recall	Mean F1 Score	Kappa Coefficient
Shufflenet v2 (original student model)	94.3%	94.6%	94.3%	94.5%	0.9244
Shufflenet v2 distilled by VGG16	97.1%	97.2%	97.1%	97.1%	0.9622
Shufflenet v2 distilled by ResNet50	96.8%	97.0%	96.8%	96.9%	0.9577
Shufflenet v2 distilled by Xception	96.5%	96.5%	96.5%	96.5%	0.9533

Table 4. Performance indices for the equal-weight three-teacher KD model on the training dataset.

Model	Grading Accuracy	Average Inference Time
Equal-weight model	92.3%	5.62 ms

Table 5. Performance indices for the equal-weight three-teacher KD model on the testing dataset.

Model	Mean Accuracy	Mean Precision	Mean Recall	Mean F1 Score	Kappa Coefficient
Equal-weight model	98.3%	98.3%	98.3%	98.2%	0.9777

Table 6. Grading accuracy for series models with different sequences on the training dataset.

	Standard Series Model	Residual-Series Model
VGG16→ResNet50→Xception	93.2%	93.2%
VGG16→Xception→ResNet50	92.8%	94.0%
ResNet50→VGG16→Xception	92.5%	94.1%
ResNet50→Xception→VGG16	93.1%	93.2%
Xception→VGG16→ResNet50	94.1%	94.5%
Xception→ResNet50→VGG16	91.6%	94.3%

Table 7. Performance indices for various three-teacher KD models with a cascaded structure on the training dataset.

Model	Grading Accuracy	Average Inference Time
Parallel model	94.4%	5.52 ms
Best standard series model	94.1%	5.71 ms
Best residual-series model	94.5%	5.51 ms

Table 8. Performance indices for various three-teacher KD models with a cascaded structure on the testing dataset.

Model	Mean Accuracy	Mean Precision	Mean Recall	Mean F1 Score	Kappa Coefficient
Parallel model	99.1%	99.1%	99.1%	99.1%	0.9888
Best standard series model	99.1%	99.2%	99.1%	99.1%	0.9888
Best residual-series model	99.7%	99.7%	99.9%	99.9%	0.9955

Table 9. Comparison of the performance indices of filter pruning, logit standardization KD, and our proposed residual-series model on the testing dataset.

Model	Number of Parameters	Grading Accuracy	Average Inference Time
VGG16	134.26 M	99.1%	71 ms
VGG16 compressed using the filter pruning described in [30]	14.14 M	98.3%	13 ms
Shufflenet v2 distilled by the logit standardization KD described in [31]	0.35 M	99.0%	5.50 ms
Shufflenet v2 distilled by the best residual-series model in this study	0.35 M	99.7%	5.51 ms

Table 10. Comparison of the performance indices for the best residual-series model on the testing dataset using a computer and an edge device.

Model	Mean Accuracy	Mean Precision	Mean Recall	Mean F1 Score	Kappa Coefficient	Average Inference Time
A personal computer ¹	99.7%	99.7%	99.9%	99.9%	0.9955	5.51 ms
An edge device ²	99.7%	99.7%	99.9%	99.9%	0.9955	87.49 ms

¹ A personal computer described in Section 3.1; ² an edge device described in Section 4.2.

Table 11. Performance indices for the equal-weight model and the three-teacher KD models with a cascaded structure on the training dataset.

Model	Grading Accuracy	Average Inference Time
Equal-weight model	87.67%	5.52 ms
Parallel model	89.52%	5.52 ms
Best standard series model	88.92%	5.61 ms
Best residual-series model	91.83%	5.49 ms

Table 12. Performance indices for the equal-weight model and the three-teacher KD models with a cascaded structure on the testing dataset.

Model	Mean Accuracy	Mean Precision	Mean Recall	Mean F1 Score	Kappa Coefficient
Equal-weight model	87.7%	87.7%	87.7%	87.7%	0.9015
Parallel model	93.1%	93.1%	93.1%	93.1%	0.9628
Best standard series model	93.3%	93.2%	93.3%	93.3%	0.9603
Best residual-series model	95.0%	95.0%	95.0%	95.0%	0.9714

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Huang, H.; Peng, Y.; Zhou, H.; Hu, H.; Liu, M. Quality Grading of Oudemansiella raphanipes Using Three-Teacher Knowledge Distillation with Cascaded Structure for LightWeight Neural Networks. Agriculture 2025, 15, 301. https://doi.org/10.3390/agriculture15030301

AMA Style

Chen H, Huang H, Peng Y, Zhou H, Hu H, Liu M. Quality Grading of Oudemansiella raphanipes Using Three-Teacher Knowledge Distillation with Cascaded Structure for LightWeight Neural Networks. Agriculture. 2025; 15(3):301. https://doi.org/10.3390/agriculture15030301

Chicago/Turabian Style

Chen, Haoxuan, Huamao Huang, Yangyang Peng, Hui Zhou, Haiying Hu, and Ming Liu. 2025. "Quality Grading of Oudemansiella raphanipes Using Three-Teacher Knowledge Distillation with Cascaded Structure for LightWeight Neural Networks" Agriculture 15, no. 3: 301. https://doi.org/10.3390/agriculture15030301

APA Style

Chen, H., Huang, H., Peng, Y., Zhou, H., Hu, H., & Liu, M. (2025). Quality Grading of Oudemansiella raphanipes Using Three-Teacher Knowledge Distillation with Cascaded Structure for LightWeight Neural Networks. Agriculture, 15(3), 301. https://doi.org/10.3390/agriculture15030301

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Quality Grading of Oudemansiella raphanipes Using Three-Teacher Knowledge Distillation with Cascaded Structure for LightWeight Neural Networks

Abstract

1. Introduction

2. Materials and Methods

2.1. Overall Process

2.2. Datasets

2.3. Three-Teacher Knowledge Distillation

2.3.1. Basic Structure

2.3.2. Traditional Ensemble Model

2.3.3. Parallel Model

2.3.4. Series Model and Its Residual Connections

2.4. Performance Indices

3. Experiments

3.1. Computing Resources and Parameters

3.2. Results

3.2.1. Single-Teacher KD

3.2.2. Three-Teacher KD with Equal-Weight Ensemble Model

3.2.3. Three-Teacher KD with a Cascaded Structure

4. Discussions

4.1. Comparison with Existing Technologies

4.2. Model’s Performance Evaluated on an Edge Device

4.3. Model’s Performance Evaluated on a Dataset from a Public Database

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI