1. Introduction
Apples are among the fruits with the highest consumption and trade volumes globally [
1] and ensuring their quality is crucial for satisfying consumer demand and maintaining the reputation of suppliers. However, defects such as scratches, spots and irregular shapes pose significant challenges to producers and consumers alike [
2,
3]. These defects not only affect the visual appeal of apples but also impact their taste, texture and nutritional value [
4]. Thus, the timely and accurate identification of apple defects and quality grading are of great economic value and practical significance.
Traditionally, apple defect detection and quality grading have relied on manual inspection [
5], a method that is time-consuming, labor-intensive and highly subjective [
6]. Consequently, the demand for automated systems capable of accurately and efficiently detecting defects and grading apples based on quality attributes has been increasing. With the rapid advancements in computer vision and deep learning technologies, machine learning offers new solutions for agricultural disease management. These technologies enable the automatic analysis of crop images, recognizing and categorizing different quality types, significantly enhancing the efficiency and accuracy of disease detection. Initially, infrared spectroscopy was commonly used, but its measurements of hardness were not very precise [
7]. Upgrades to hyperspectral imaging for detection were proposed [
8]. However, information obtained from near-infrared and hyperspectral spectroscopy can easily be obscured by spectral variations caused by the physical properties of food. Moreover, most instruments used in this method are complex and expensive [
9]. Magnetic methods such as magnetic resonance imaging and electrical conductivity, acoustic methods like ultrasonography and pulse response, and other dynamic methods such as X-ray and CT scanning have also been utilized.
Recent advances in computer vision, machine learning and artificial intelligence have paved the way for the development of systems for apple defect detection and quality grading [
10,
11]. These systems utilize image-processing techniques to analyze digital images of apples and accurately identify various defects. Furthermore, machine learning algorithms are employed to grade apples based on predetermined criteria such as size, color and shape. Nie et al. proposed a particle swarm optimization-based support vector machine model for grading apples by fruit shape color and defect features, achieving a classification accuracy of 92% [
12]. Sun et al. introduced a structure illumination reflectance imaging (SIRI) method based on pixel-based convolutional neural networks to detect early fungal infections in peaches, demonstrating an excellent performance symptom-detection rate of 97.6% [
13]. Anuja et al. further researched and proposed an SVM-based fruit quality-grading system, achieving defect-detection accuracies of 77.24% (k-NN), 82.75% (SRC), 88.27% (ANN) and 95.72% (SVM) [
14]. Su et al. developed a band-pass filter fluorescence macro-imaging system for testing the solution safety of celery [
15]. Krishna et al. compared manual methods with computer vision in assessing mango attributes, developing various multiple linear regression models (MLR) with accuracies exceeding 97.9%, 93.5% and 92.5%, but utilizing monochromatic cameras not suitable for broad scenarios [
16]. Alencastre et al. targeted monochromatic images not suitable for current smartphone photography, using color image datasets and convolutional neural networks for sugarcane quality detection, doubling the performance for the L 01-299 variety and increasing the performance fivefold for the HoCP 09-804 variety, although their data volume was too small [
17].
Genze et al. further expanded the dataset in number and variety, utilizing convolutional neural network (CNN) architectures to achieve high average precision (mAP) of approximately 97.9%, 94.2% and 94.3% on the retained test datasets for corn, rye and fescue [
18]. Li et al. built upon this foundation, proposing a model based on the CNN, with the best training and validation accuracies reaching 99% and 98.98%, respectively [
19]. Zou et al. attempted quality detection from an olfactory perspective, proposing a design for an apple quality-grading electronic nose detection system based on computational fluid dynamics simulation and k-nearest neighbor support vector machine [
20]. Hemamalini et al. utilized KNN, SVM, C4.5 and other machine learning methods to classify fruit photos. These algorithms determined whether the fruit was damaged, but their specificity and sensitivity still require improvement [
21].
Wieme et al. utilized deep learning technology for the quality assessment of fruits, vegetables and mushrooms, with CNN (ResNet/ResNeXt) F1 scores of 0.8952 and 0.8905, enhancing model specificity and sensitivity. However, this approach has drawbacks. As the problem becomes more complex, more data are usually required, thus increasing training time, especially with hyperspectral imaging [
22]. Ismail et al. presented an efficient machine vision system based on state-of-the-art deep learning technologies and stacked ensemble methods, achieving average accuracies of 99.2% and 98.6% for apple and banana test sets, respectively. However, this was solely based on the appearance of the fruit and used only a single view of fruit images, lacking a multi-view vision system training [
23].
This article introduces a deep learning-based system for apple defect detection and quality grading, aimed at enhancing the accuracy and efficiency of detection. The main contributions of this study are as follows:
Firstly, the dataset used in this article underwent data augmentation for pixel-level segmentation tasks and was semantically annotated; through pixel-level annotation, high-precision apple quality detection is achieved, providing defect detection and quality grading in practical applications.
Secondly, in terms of model construction, the advantages of Transformers and neural networks were successfully combined. The self-attention mechanism of the Transformer enables the model to capture long-distance dependencies in the image, significantly enhancing the model’s ability to recognize apple defect features. Concurrently, the hierarchical feature extraction capability of neural networks further optimizes the precision of image segmentation.
Lastly, an innovative model architecture was designed, incorporating a jump connection Segment Anything Model (SAM) and maximum entropy selection optimization (targeted at Segment Anything segmentation). Through carefully designed models, not only was high precision in defect detection in complex agricultural environments ensured, but the size and computation of the model were also effectively controlled. This design enables it to operate under limited computational resources and meet the needs for real-time segmentation and processing of images.
3. Materials and Method
3.1. Dataset Collection
The apple image dataset utilized in this study was primarily collected from two major apple-producing regions in China: the apple orchards in Tianshui, Gansu Province, and Qixia, Yantai City, Shandong Province. These regions are renowned for their unique natural conditions and advanced cultivation techniques, which contribute to the high reputation of their apples in both domestic and international markets. The apple orchards in Tianshui, located in the southeastern part of Gansu Province, offer climatic conditions favorable for apple growth, including significant diurnal temperature variations that are beneficial for the accumulation of sugars in the apples. Apples from Tianshui are known for their bright color and fullness, making them an ideal choice for studying apple quality grading. Conversely, Qixia City, recognized as one of China’s prominent apple production bases, produces apples known for their pleasant taste and crisp texture. The region boasts a long history of apple cultivation, employing representative cultivation techniques and orchard management practices.
Data collection was strategically scheduled during the peak of the apple ripening season to ensure that the images captured reflected the characteristics of the fruit at different stages of maturity. In Tianshui, the collection occurred from mid-September to early October, while in Qixia, it was organized from late August to mid-September. The process involved capturing high-definition images from various angles directly from the apple trees in the orchards. To ensure the diversity and comprehensiveness of the data, the collection team focused on representative apples from each tree, taking at least five different angles (including the top, bottom, shoulders and sides) to obtain a comprehensive set of apple images. Establishing grading standards for apple quality is crucial for subsequent data analysis and model training. Apple grading is typically based on international or national standards, aligned with market demands and consumer preferences. This research adheres to Chinese agricultural industry standards and common international trade criteria, classifying apples into four grades. Grade A (Extra): Apples with no apparent defects, uniform color, regular shape, a diameter greater than 75 mm, no signs of pests or diseases, no mechanical damage and free from cracks, rot or other physiological defects. Grade B (First): Minor imperfections acceptable such as slight bruising or slightly uneven coloration, diameter ranging from 65 mm to 75 mm, with slight signs of pest damage or other minor non-structural defects. Grade C (Second): Apparent mild to moderate defects like blotches, irregular shapes, diameter between 55 mm and 65 mm, with moderate pest damage signs, mechanical injuries or minor cracks. Grade D (Third): Significant visible defects, shapes, colors and textures that do not meet the standards of the other three grades, diameter less than 55 mm or greater than 75 mm, with severe pest damage, mechanical injuries or fruit rot. In this study, approximately 5000 apple images were collected: about 1500 for Grade A, 1300 for Grade B, 1200 for Grade C and 1000 for Grade D, as shown in
Table 1. This distribution aids the model in learning to recognize features of apples at different quality levels, thus enabling accurate automatic grading in practical applications.
Before the commencement of the data collection activities, an on-site inspection of the orchard was conducted by the collection team to evaluate the growth conditions of the apples and the orchard’s lighting conditions, planning the optimal shooting time and angles. Additionally, communication with the orchard managers ensured that the data collection activities would not interfere with the regular operations of the orchard. Professional photography equipment was used for on-site shooting within the orchard. High-quality images were obtained by choosing sunny and well-lit periods for the data collection. During the shooting process, special attention was paid to adjusting the camera settings, such as aperture, shutter speed and ISO sensitivity, to ensure image clarity and color accuracy. Each collection point had a dedicated person responsible for recording data, including the apple variety, tree age and specific shooting time and environmental conditions. The collection team was equipped with high-resolution digital SLR cameras (Nikon D850 (Japan), with a high resolution of 45.75 million pixels) and multiple lenses (including a 50 mm prime lens and a 24–70 mm zoom lens) to meet different shooting needs. Additionally, portable tripods and reflectors were used to stabilize the shots and adjust lighting. All equipment underwent strict dust and moisture protection treatments to ensure stable operation in outdoor environments. The images were taken at a resolution of
pixels, with 24-bit color depth, in JPEG format, ensuring detail richness and usability in subsequent processing. Furthermore, all images underwent preliminary digital processing after collection, including exposure adjustment and contrast optimization, as shown in
Figure 1, to improve visual effects and analysis accuracy. Finally, all images were formatted and labeled in preparation for image analysis and model training, ensuring the dataset supported efficient model training and guaranteed the model’s broad applicability and accuracy in practical operations.
3.2. Data Augmentation
3.2.1. Basic Enhancement Method
In the fields of computer vision and image processing, data augmentation is a commonly used technique to increase the diversity of a dataset through image transformations. These techniques simulate various shooting conditions that may be encountered in the real world, thereby helping deep learning models to enhance their generalization capabilities to unseen data. This section details four basic image enhancement methods: flipping, cropping, translating and rotating.
Firstly, image flipping operations include horizontal and vertical flips. Mathematically, this can be expressed as a coordinate transformation for each pixel in the image. For an image of size
(where
M is the height and
N is the width), the new coordinates of any point
after a horizontal flip are
and after a vertical flip, they are
. These transformations can be expressed by the following equations:
These operations enable the model to recognize and process objects from different directions without being dependent on a specific orientation of the image. Secondly, the cropping operation is often used to generate a local view of an image. Assuming the new image area after cropping is
with a starting point
, then the new coordinates of any point
in the cropped image correspond to
in the original image. This transformation is mathematically expressed as:
where
is randomly selected to ensure
and
. This method not only simulates visibility issues caused by camera angles or obstructions but also enhances the model’s ability to recognize different regions of the image. Following this, the translation operation is achieved by moving the image in either the horizontal or vertical direction. Let the translation vector be
; then, the new position of a point
in the original image after translation is
. This can be described by the following equation:
where
and
can be positive or negative, indicating the distance the image is moved in the respective direction. Translation operations help the model learn to capture object features at different positions, particularly showing more robust performance in handling image edges. Lastly, the rotation operation involves rotating the image around a point (usually the center) by an angle
. Let the center coordinates of the image be
; then, the new coordinates of any point
after rotation can be calculated using the rotation matrix
, where
is defined as:
This transformation enhances the model’s adaptability to changes in the orientation of target objects, especially for objects with axial symmetry (such as apples), where rotational enhancement can significantly improve the model’s flexibility and accuracy in practical applications.
Through the aforementioned enhancement methods, training datasets can be effectively expanded in terms of coverage and scene complexity without incurring additional data collection costs. This solid data support allows deep learning models to perform more effectively and stably in practical applications.
3.2.2. Data Generation Based on Diffusion Models
Diffusion models, as emerging generative models, simulate the reverse process from a high-dimensional data distribution to pure noise, gradually constructing high-quality images. These models are ideally suited for generating complex natural images, such as apples with specific defects, based on the physical process of a random walk and gradual denoising. Mathematically, diffusion models can be described by a continuous process of noise addition and a reverse denoising process. During the forward process, noise is progressively added to the data until it is completely transformed into Gaussian noise. Specifically, this process can be defined by the following stochastic differential equation:
where
represents image data,
is the drift term related to time
t,
is the diffusion coefficient and
denotes Brownian motion. In the reverse process, the goal of the diffusion model is to learn how to progressively recover a clear image from a state of pure noise. This is achieved by training a parameterized neural network to estimate the conditional probability
, that is, estimating the distribution of the image at an earlier time point
given the noisy image at time
. This step is usually optimized using a variational lower bound, with the specific denoising step expressed as:
where
is the noise level coefficient, and
is the noise prediction model parameterized by the neural network. In practical applications, the diffusion model is initially pretrained on an existing dataset of apple images. This step involves extensive forward and reverse iterations to ensure the model accurately captures the distribution characteristics of apple images. Subsequently, the model’s parameters are adjusted to focus more on generating images with specific defects. Additionally, generated images undergo a series of quality control steps to ensure the generated image quality meets training requirements. Filtered images that pass quality control are merged with the original dataset to train higher-performance segmentation models. This process significantly increases the number of rare defect samples in the dataset, thereby enhancing the model’s ability to recognize these challenging categories. By utilizing diffusion-based data generation technology, not only are data collection constraints overcome, but also the diversity and complexity of the dataset are significantly enhanced. This method is particularly suitable for addressing the imbalance in image data, where certain defect types are extremely rare in natural settings and traditional data collection methods struggle to efficiently gather sufficient sample volumes.
3.3. Proposed Method
3.3.1. Overall
The apple defect detection and quality-grading system proposed in this paper is based on a deep learning framework, integrating advanced image analysis techniques and machine learning algorithms, as shown in
Figure 2. The model’s comprehensive architecture and the cohesive integration of its modules are particularly emphasized. The design aims to address the high accuracy requirements for apple defect identification while also meeting the challenges of processing speed and accuracy in practical applications. This section will detail the model’s construction process and the interconnections between its various modules. The core architecture of the system comprises four main parts: the Jump Connection SAM, Jump Connection Attention Mechanism, Maximum Entropy Selection Optimization and Jump Loss calculation. These modules work collaboratively to optimize the entire process from data input to defect detection and quality grading.
The Jump Connection SAM model employs an encoder-decoder structure, utilizing a deep convolutional network for feature extraction and image segmentation. The encoder gradually compresses the image through multiple convolutional and pooling layers, extracting high-dimensional features. Conversely, the decoder progressively restores the image’s spatial resolution and details through upsampling and convolutional layers. Importantly, jump connections directly link the low-level features from the encoder with the corresponding layers in the decoder, ensuring that details are not lost during upsampling. The mathematical expression for jump connections is given by:
where
and
represent the features at layer
l in the encoder and decoder, respectively,
U denotes the upsampling operation and
signifies the feature fusion function.
To further enhance the model’s ability to recognize key features, a jump connection attention mechanism has been introduced. This mechanism dynamically adjusts the importance of the feature map by computing attention weights between different feature layers. Specifically, an attention map is calculated for each feature map to either enhance or suppress certain features:
where
represents the convolution operation,
are the convolution parameters and
is the activation function used to generate attention weights for each channel.
During training, the Maximum Entropy Selection Optimization strategy is employed to automatically select the most informative samples for training. This strategy is based on the model’s current uncertainty, prioritizing data points with the highest entropy, which are the samples the model finds most challenging to distinguish. The entropy of a sample is calculated using the following formula:
where
is the probability that the model predicts the sample belongs to class
c. The Jump Loss is designed to optimize gradient transmission during training, particularly in preventing gradient vanishing in deep networks. The loss at each layer depends not only on the final output layer’s loss but also includes the difference between the output at that layer and the true label. By calculating the loss after each jump connection, it ensures that even the deeper parts of the network effectively update gradients:
where
Y is the true label,
is the predicted output at layer
l and
is the weight of the loss at that layer. Through these designs and optimizations, the system developed in this study can effectively detect defects in apple images and achieve high accuracy in quality grading. This method, which utilizes a combination of advanced technologies and algorithms, not only enhances processing speed but also significantly improves the reliability and accuracy of the system in practical applications.
3.3.2. Jump Connection SAM Model
In this study, the Jump Connection SAM serves as the core component of the apple defect-detection and quality-grading system, as shown in
Figure 3 and
Figure 4.
Emphasizing efficiency and performance in handling complex image tasks, the SAM model employs an encoder-decoder structure enhanced by jump connections. These connections facilitate effective information flow throughout the model, enabling more precise image detail restoration and improved segmentation accuracy. The architecture of the Jump Connection SAM model is divided into two major parts: the encoder and the decoder. The encoder is primarily responsible for extracting image features, utilizing a multi-layer convolutional network structure. Each layer consists of two convolutional layers followed by a maximum pooling layer. The convolutional layers employ kernels and ReLU activation functions to ensure nonlinear processing capabilities. The subsequent maximum pooling layers reduce feature dimensions while preserving essential characteristics. The output channel count in the encoder increases layer by layer, starting from 64 channels and doubling after each pooling stage, reaching up to 1024 channels. The decoder’s role is to restore the spatial information from the high-dimensional features extracted by the again encoder. Comprising alternating upsampling layers and convolutional layers, each upsampling is followed by a convolution to integrate features. Convolutional layers in the decoder also utilize ReLU activation functions, ensuring effective nonlinear feature transformation. The channel count in the decoder decreases symmetrically from the encoder, eventually producing an output matching the input image size.
The design of jump connections is a distinctive feature of the SAM model. These connections directly transmit low-level detailed features from the encoder to the corresponding decoder layers, aiding in the restoration of details potentially lost during upsampling. Specifically, the output from each encoder layer not only proceeds to the next layer for further processing but is also carried over to the corresponding decoder layer via jump connections. This configuration allows the model to utilize a richer context during feature reconstruction, significantly enhancing segmentation precision and image detail restoration. The mathematical foundation of the Jump Connection SAM model relies on an effective feature fusion and information transfer mechanism, expressed mathematically as follows:
where
represents the features at layer
l in the decoder,
denotes the upsampling operation,
is the convolution operation in the decoder,
corresponds to the features from the encoder layer aligned with
and
indicates the feature fusion function in the jump connection. The advantages of the Jump Connection SAM model’s design are manifested in several aspects:
Information Integrity: Jump connections allow the model to utilize low-level features from the encoding stage during decoding, aiding in the restoration of detailed image information. This capability is especially beneficial in high-resolution imagery, where it significantly improves segmentation accuracy.
Enhanced Efficiency: By increasing the feature channels layer by layer during the encoding phase, the model deepens its learning capacity. During the decoding phase, the reduction in feature channels layer by layer optimizes computational efficiency, enabling the model to maintain high processing speeds even with complex images.
Generalization Ability**: The introduction of jump connections reduces information loss during training, enhancing the model’s generalization capability across diverse datasets, thereby stabilizing segmentation outcomes.
Through this efficient and precise network structure design, the Jump Connection SAM model offers a reliable solution for apple defect-detection and quality-grading tasks, demonstrating the significant potential of deep learning applications in the agricultural field.
3.3.3. Jump Connection Attention Mechanism
The Jump Connection Attention Mechanism introduced in this study builds upon the foundation of self-attention, incorporating the characteristics of jump connections to further enhance feature utilization efficiency and precision in image-segmentation tasks, as shown in
Figure 5. Unlike the global focus of traditional self-attention, which emphasizes capturing relationships within entire input sequences, the Jump Connection Attention Mechanism is specifically designed for image segmentation. It underscores the connections between local features as well as the integration of features across different layers.
In the context of image segmentation, not only is it crucial to capture global information, but the emphasis is also on the interplay between local features and the integration across hierarchical levels. By embedding the attention scoring mechanism within traditional jump connections, the network is enhanced to recognize and utilize significant features during feature fusion more effectively. Particularly when merging features from different layers, this mechanism dynamically adjusts the contributions of various features based on their relevance to the current task. The mathematical description of the Jump Connection Attention Mechanism is as follows:
Here,
Q,
K and
V represent the query, key and value, respectively, which are used to generate the attention weights in self-attention mechanisms. Within jump connections,
Q typically originates from the current layer of the decoder, while
K and
V are derived from corresponding layers of the encoder or the output of a preceding layer. This arrangement allows for feature fusion at each stage of jump connections, adapting the significance based on the context.
where
a is the learned alignment model,
is the previous decoding state,
represents the features from the jump connection and
denotes the attention weights. Through this mechanism, each layer of jump connections can dynamically adjust its contribution to the current task based on the relevance of features from different levels. This approach significantly enhances the efficiency of feature utilization and the expressive power of the network. The design advantages of the Jump Connection Attention Mechanism include:
Enhanced Feature Fusion Efficiency: By introducing attention mechanisms within each jump connection, the model not only simply merges features from different layers but dynamically adjusts the importance of each feature according to the current task, focusing more on the features that are crucial for the specific segmentation task.
Improved Segmentation Accuracy: Especially in handling complex or detail-rich images, such as minor defects or diseased areas on the surface of apples, the Jump Connection Attention Mechanism ensures that these details are not overlooked during feature fusion, thus enhancing overall segmentation precision.
Enhanced Model Generalization: The introduction of attention mechanisms allows the model to learn more generalized feature representations during training, reducing the risk of overfitting and enabling the model to perform well on unseen data.
3.3.4. Maximum Entropy Selection Optimization
The Maximum Entropy Selection Optimization strategy is grounded in the concept of entropy from information theory, which measures the uncertainty of a system and is used here to assess the potential contribution of each data point to model training. This strategy involves calculating the predictive entropy of each sample to determine its level of uncertainty, and then selecting those samples with the highest entropy values for training. The entropy of a sample is calculated using the following formula:
where
represents the probability that the sample
x belongs to category
c and
C is the total number of categories. This formula computes the entropy of the predicted outcome for sample
x, reflecting the model’s uncertainty about the classification result for that sample. A higher entropy value indicates greater uncertainty in the model’s prediction, suggesting that the sample contains a rich amount of information, making it more suitable for training the model. This can enhance the model’s ability to learn from complex situations and improve its generalization capabilities.
When applied to the SAM, Maximum Entropy Selection Optimization significantly boosts the model’s performance in image-segmentation tasks, particularly when dealing with images that exhibit high variability and complexity. Traditional training of image-segmentation models often employs random or fixed-interval sample selection strategies, which can lead to inefficient training, especially in cases of unbalanced samples or when certain categories have fewer samples. In contrast, Maximum Entropy Selection Optimization dynamically evaluates the information content of each sample and prioritizes the training of those samples with the highest information content. This method not only accelerates the training process but also enhances the model’s ability to capture and learn discriminative features effectively.
3.3.5. Jump Loss
In the context of deep learning, particularly in deep networks, the design of the loss function is critical for the training outcomes and overall performance of the model. Traditional loss functions are typically computed at the final layer of the network. Even with the application of the backpropagation algorithm, deep networks can face issues like vanishing or exploding gradients, which limit the stability and efficiency of model training. To address these challenges, a novel loss function, termed Jump Loss, is proposed in this work. It calculates loss at every jump connection, facilitating deeper information transfer and more effective gradient flow. The design of the Jump Loss function is based on the principle that each layer of the decoder should use features transmitted from the corresponding layer of the encoder for accurate reconstruction. This design not only improves the efficiency of information utilization but also strengthens the connections between different layers within the network, particularly beneficial for those deep layers that might otherwise be difficult to train due to gradient vanishing. The mathematical expression for Jump Loss is as follows:
Here, L denotes the total number of layers in the network, represents the local loss at layer l, is the loss weight for layer l, N is the number of samples in the batch, is the actual label of the nth sample at layer l and is the corresponding predicted output. This loss function takes into account the outputs of all intermediate layers, focusing not only on the errors of the final output but also optimizing the expressive capabilities of the intermediate layers, thereby enhancing the overall learning effectiveness of the network.
The design of Jump Loss can be interpreted from perspectives of information theory and gradient flow. In traditional single-loss functions, gradients must propagate through multiple layers to reach the bottom of the network, which can lead to gradient vanishing in deep networks as information is progressively lost during transmission, making it challenging to effectively train lower network parameters. Jump Loss, by introducing a loss at every layer, directly reinforces the gradient signal, ensuring that each layer is directly supervised, thereby effectively preventing the problem of gradient vanishing. Furthermore, Jump Loss also aids in enhancing the generalization ability of the model. As each layer is directly accountable for the output, the network can learn more diverse and robust feature representations, which is crucial for detailed image processing, such as identifying subtle defects and quality variations in apple defect-detection and quality-grading tasks. In tasks like apple defect detection and quality grading, Jump Loss offers several advantages:
Improved Accuracy: By computing loss at each layer, the model can learn features at different levels more meticulously, which is crucial for precisely identifying minor surface defects on apples.
Enhanced Model Stability: Jump Loss provides more stable gradient signals during training, avoiding common issues of training instability in traditional deep models.
Accelerated Convergence: Since each layer is directly responsible for the final outcome, the model can quickly adjust its direction early in training, reducing ineffective iterations and speeding up convergence.
3.4. Evaluation Metrics
In the study of apple defect detection and quality grading, the metrics used to assess model performance are crucial for validating the effectiveness of the methodologies employed. Precision, recall, accuracy and mean Intersection over Union (mIoU) have been selected as the primary evaluation metrics to measure the model’s performance in identifying apple defects and assessing quality.
These metrics enable a comprehensive evaluation of the model’s performance in tasks related to apple defect detection and quality grading. The balance between precision and recall is particularly crucial, as an excessively high false positive rate can lead to unnecessary losses in practical production, while a high false negative rate could compromise product quality. Accuracy provides a holistic assessment of performance, whereas mIoU focuses more on the accuracy of segmentation in complex image backgrounds.
3.5. Baseline Models
For the validation of the newly proposed method for apple defect detection and quality grading, several deep learning models have been selected as baselines for comparison. These models include U-Net [
46], SegNet [
47], PSPNet [
48], UNet++ [
49], DeepLabv3+ [
50] and HRNet [
51], all of which have demonstrated outstanding performance in the field of image segmentation.
By comparing with these established models, a more comprehensive evaluation of the performance of the newly proposed method is facilitated, and a deeper understanding of its advantages and limitations in practical applications is gained. Not only is the technological advancement of the new method validated, but an in-depth analysis of each model’s performance also reveals their potential applications in tasks related to apple defect detection and quality grading.
3.6. Experimental Setup
3.6.1. Testbed and Platform
Initially, in terms of hardware configuration, the experiments were conducted on a server equipped with an NVIDIA Tesla V100 GPU(New York, capital of United States). The Tesla V100 GPU, known for its powerful computing capabilities and efficient parallel processing, provides the necessary hardware support for deep learning models, particularly suitable for handling large datasets and complex model architectures. Additionally, the server was also equipped with ample RAM and high-speed SSD storage to ensure the efficiency of data loading and processing. Regarding software configuration, all models were developed and trained within a Python 3.9 environment, utilizing the TensorFlow 2.16.1 and Keras 2.6.0 frameworks. TensorFlow offers a flexible and powerful platform supporting various types of deep learning models, while Keras, with its simplicity and modular design, facilitates the construction of models and the iterative process of experimentation.
3.6.2. Training and Test Strategy
For the training strategy, the Adam optimizer [
52] was chosen to adjust network weights due to its combination of momentum and adaptive learning rate features, which automatically adjust the learning rate for each parameter during training, aiding in rapid convergence and enhancing training outcomes. As for hyperparameter settings, the initial learning rate was set at
, balancing the speed of convergence while avoiding instability in training due to overly large step sizes. Additionally, to address potential overfitting issues, a learning rate decay strategy was implemented, where the learning rate would automatically be halved whenever there was no improvement in performance on the validation set over ten consecutive training epochs, with the minimum reduction reaching
. This dynamic adjustment of the learning rate allows for fine-tuning of the model in later training phases to optimize performance. Each model underwent training for 50 epochs, a duration sufficient to reach convergence in complex image-segmentation tasks. Furthermore, an early stopping strategy was also implemented to prevent overfitting, where if no further improvement in performance on the validation set was observed over 20 consecutive training cycles, training would be terminated prematurely. This not only conserves computational resources but also prevents the model from overfitting on training data at the expense of generalization capability.
In terms of model evaluation, a five-fold cross-validation method was employed to ensure the reliability and consistency of experimental results. In this method, the entire dataset was evenly divided into five subsets, with each subset taking turns serving as the test set while the remaining four subsets were used for training. This approach fully utilizes limited data resources by conducting multiple training and testing iterations to assess the average performance of the model, thus more accurately reflecting its behavior on unseen data. This is particularly crucial for evaluating apple defect-detection and quality-grading models, as it reduces random errors in the model assessment process, providing more robust performance metrics. Through these experimental settings, every research activity was conducted under controlled conditions, minimizing experimental errors and ensuring the reliability of the results.
4. Results and Discussion
4.1. Defect-Segmentation Experiment Results
In this section of this paper, the main objective of the experimental design is to validate the performance of the proposed deep learning models in apple defect detection and quality grading. By performing defect segmentation on apple images, the experiment aims to evaluate the effectiveness of different models in accurately identifying and segmenting surface defects on apples. The results are measured using four key performance indicators: Precision, Recall, Accuracy and mIoU. These metrics collectively reflect the models’ capabilities in identifying defect areas, including accuracy of recognition, rate of missed defects, overall performance and precision of the predicted areas.
Table 2 and
Table 3 show varying levels of performance across different models. The U-Net model achieves a Precision of 0.80, Recall of 0.77, Accuracy of 0.79 and mIoU of 0.78, indicating a good baseline performance in apple defect-segmentation tasks, though it may have limitations in handling complex or subtly defined defects. The SegNet model, with improved performance over U-Net, has Precision, Recall, Accuracy and mIoU of 0.82, 0.79, 0.81 and 0.80, respectively. This improvement suggests better feature extraction and spatial information retention attributed to its unique decoder design, which uses pooling indices from the encoder to guide up-sampling, thus better restoring image details. PSPNet further enhances model performance with Precision, Recall, Accuracy and mIoU reaching 0.85, 0.82, 0.84 and 0.83, respectively. This enhancement primarily originates from PSPNet’s pyramid pooling module, which captures context at various scales, enhancing the model’s comprehension of global image structures. UNet++ and DeepLabv3+ show advances in all metrics, demonstrating the superiority of deep supervision and multi-scale feature fusion in complex image-segmentation tasks. UNet++ enhances feature transmission and reuse through dense connections at each up-sampling node; meanwhile, DeepLabv3+ enhances segmentation accuracy and robustness through dilated convolutions, which expand the receptive field. HRNet further advances Precision, Recall, Accuracy and mIoF to 0.91, 0.88, 0.90 and 0.89, respectively. Its high performance benefits from a unique multi-scale parallel processing architecture that maintains high-resolution information flow, effectively balancing speed and precision. The method described in this document achieves the best performance in all metrics, with Precision, Recall, Accuracy and mIoU of 0.93, 0.90, 0.91 and 0.92, respectively, demonstrating that advanced feature fusion technology and optimized network architecture design significantly enhance the precision and robustness of defect detection.
Theoretical analysis shows that the differences in performance among the models reflect the innovative aspects of their architectural designs. For instance, innovations in feature fusion and multi-scale processing in UNet++ and DeepLabv3+ make them more effective in handling images with complex backgrounds and subtle differences. HRNet optimizes the efficiency of information transmission and utilization by maintaining high-resolution feature flows. The method presented in this paper further optimizes these strategies, incorporating new network modules and training strategies, achieving the best performance in the apple defect-detection task, fully demonstrating the potential and future prospects of deep learning in image processing. The mathematical characteristics and design philosophies of these models provide robust theoretical support for solving practical problems and offer viable directions for future research.
4.2. Quality-Grading Experiment Results
In this section of this study, the primary objective was to verify and compare the performance of different deep learning models in the task of apple quality grading. This task focuses particularly on the models’ ability to accurately judge the quality levels of apples, which is crucial for agricultural production and commodity classification. The experiment evaluates model performance using three key metrics: precision, recall and accuracy, which collectively describe the effectiveness and reliability of the models in quality grading.
From
Table 4 and
Table 5, it is observed how each model performs in the task of apple quality grading. The U-Net model, serving as a baseline, shows a balanced performance with a precision of 0.78, recall of 0.75 and accuracy of 0.77, but there is room for improvement. The SegNet model slightly outperforms U-Net with precision, recall and accuracy of 0.80, 0.77 and 0.79, respectively. This could be attributed to its unique encoder-decoder architecture and the effective use of pooling indices, which helps preserve more spatial information, thus more accurately restoring critical features in quality grading. PSPNet demonstrates further improvements with a precision of 0.83, recall of 0.80 and accuracy of 0.82, thanks to its pyramid pooling module that captures context information at different scales, crucial for identifying and classifying apples of various quality levels. UNet++ and DeepLabv3+ excel in all assessed metrics, with precisions of 0.85 and 0.87, recalls of 0.82 and 0.84 and accuracies of 0.84 and 0.86, respectively. UNet++ enhances information flow and feature integration through deep supervision and nested skip connections, significantly boosting the model’s learning and prediction capabilities. DeepLabv3+, with its atrous convolution strategy, expands the receptive fields and enhances the model’s ability to capture image details, thus performing exceptionally well in precise grading. HRNet further enhances performance, achieving precision, recall and accuracy of 0.89, 0.86 and 0.88, benefiting from its high-resolution network structure that maintains high-resolution information throughout the network, aiding in improving classification accuracy. The method described in this document outperforms all models, with precision, recall and accuracy reaching 0.91, 0.88 and 0.90, respectively. This superior performance is due to the integration of various network optimization techniques and efficient training strategies, such as advanced feature fusion technologies and effective loss functions, significantly enhancing the model’s accuracy in recognizing apple quality levels.
Theoretically, the mathematical characteristics and architectural designs of different models are key factors leading to these variations in results. For instance, PSPNet’s pyramid pooling effectively integrates information at various scales, adapting to the multi-scale feature requirements of apple quality grading. DeepLabv3+ and HRNet, through atrous convolution and high-resolution continuous connections, provide richer contextual information and continuous detail features, crucial for accurately identifying subtle quality differences. The method proposed in this paper, by integrating the advantages of the above technologies and introducing a network structure and algorithms optimized for specific tasks, achieves optimal performance. These models not only reflect the advancement of deep learning in image processing but also highlight the importance of model design and selection in facing complex application scenarios.
4.3. Different Loss Function Ablation Experiment
The main objective of the experiment is to evaluate the impact of various loss functions on model performance in apple defect-segmentation and quality-grading tasks. By comparing the performance of Cross-Entropy Loss, Focal Loss and Jump Loss in these tasks, the experiment aims to reveal the effects and applicability of different loss functions in handling unbalanced datasets and enhancing feature learning. This experiment is crucial for understanding the advantages and limitations of each loss function in practical applications and assists in selecting or designing loss functions that are better suited for specific tasks, as shown in
Table 6.
In the defect-segmentation task, the model utilizing Cross-Entropy Loss demonstrated basic performance, with a precision of 0.83, a recall of 0.80 and an accuracy of 0.82. As a common loss function suitable for multi-class classification problems, Cross-Entropy Loss tends to perform inadequately when faced with class imbalance. Subsequently, the model using Focal Loss showed improved performance, with a precision of 0.88, a recall of 0.85 and an accuracy of 0.87. Focal Loss adjusts the weight of different class samples, reducing the contribution of easy-to-classify samples and thus focusing more on those that are difficult to classify, improving performance under class imbalance conditions. Jump Loss performed best in the defect-segmentation task, achieving a precision of 0.93, a recall of 0.90 and an accuracy of 0.91. By incorporating loss calculations at every network layer, Jump Loss enhances deep learning, ensuring effective transmission and learning of deep features, which is crucial for image-segmentation tasks that require precise pixel-level predictions. In the quality-grading task, the model using Cross-Entropy Loss showed basic classification capabilities with a precision of 0.82, a recall of 0.80 and an accuracy of 0.81, reflecting limited performance improvement potential in the face of class imbalance. The model with Focal Loss performed better, indicating its advantages in dealing with class imbalance, with a precision of 0.86, a recall of 0.84 and an accuracy of 0.85. Jump Loss also demonstrated the best performance in the quality-grading task, with a precision of 0.91, a recall of 0.88 and an accuracy of 0.90, proving its effectiveness in integrating multi-level features and enhancing classification accuracy.
Theoretical analysis reveals that the philosophical design and mathematical characteristics of different loss functions are the fundamental reasons for these experimental results. Cross-Entropy Loss focuses on the probability of each sample being classified correctly, suitable for basic classification tasks but limited under severe class imbalance. Focal Loss modifies Cross-Entropy Loss by reducing the weight of easy samples and increasing the influence of difficult samples, effectively improving the recognition of minority classes, which is crucial in apple quality grading where some quality levels may have significantly fewer samples than others. The design of Jump Loss, by incorporating loss calculations at every layer of the network, not only mitigates the loss of information during transmission but also enhances the network’s ability to capture details, especially in complex image-segmentation tasks where it can better handle details of edges and small regions, thereby achieving higher segmentation and classification accuracy.
4.4. Limitations and Future Work
In the current study, an apple defect-detection and quality-grading system based on deep learning was successfully developed, integrating various models and techniques such as the Jump Connection SAM model, Jump Connection Attention Mechanism and Maximum Entropy Sampling Optimization. These significantly enhanced segmentation and classification precision in complex image backgrounds. Despite experimental results demonstrating superior performance over existing methods, several limitations within the practical application of the system remain to be addressed in future work. Firstly, although the system performs well with complex backgrounds and apples at different maturity stages, its generalization capability requires enhancement. The model primarily trains and tests on specific datasets from two particular geographic locations and harvest seasons. Such data limitation might reduce performance when facing broader natural variations, such as apples maturing under different regional climates. Future research could expand data collection to include apple images from more areas and various seasons, enhancing the model’s adaptability and generalization. Secondly, the loss functions used, while showing good performance in experiments, still need further evaluation under specific conditions like highly imbalanced data distributions or extreme noise. For instance, Jump Loss, despite fostering deep feature learning, could lead to overfitting, particularly in categories with sparse data. Future studies could explore integrating Jump Loss with other regularization techniques or novel loss functions to better balance model training stability and prediction accuracy.