1. Introduction
The food industry is concerned with the processing, production, handling, storage, preservation, control, packaging, and distribution of food products made from raw foods [
1]. The early detection of defects in raw food would make food production and selection processes more efficient. Kumar Pothula et al. [
2] propose a new singulating and rotating mechanism for in-field grading and sorting, and Nturambirwe et al. [
3] and Firouz et al. [
4] review the automatic non-destructive techniques for horticultural quality assessment. The detection of defects in this case would allow food prices to be categorized according to product quality, thus better matching consumer expectations with respect to a specific product category [
5].
Detecting defects during production, handling, storage, preservation, etc. is crucial for prompting appropriate categorization of the products, especially in the food industry. To this aim, it is essential to detect defects within a timeframe that is shorter or at least comparable to the product processing time itself, thus ensuring that timely feedback or alerts are provided.
Defect detection can be carried out through various methods. First, human operators can perform manual defect detection to actively monitor the manufacturing process. Alternatively, computer-based systems equipped with imaging devices or sensors can automatically monitor the production process. Lastly, there is the option of semi-automated detection, where humans interact with computer-based monitoring systems [
6] to analyze critical cases.
In this paper, we focus on the quality control of raw food, which is usually based on shapes, colors, and textures [
7], and in particular we address the problem of apple quality control. Defect detection of apples is a relevant problem since the apple is the second most consumed fruit in the world, following the banana [
8]. China provides the largest apple production in the world; it produced almost 40 million metric tons with a cultivation area of 2.1 million hectares in 2013. Other leading countries include Turkey, the United States, Poland, and India (see
Table 1). The majority of apples produced in the world are destined for fresh consumption, so a defect detection method is crucial to improve the market value of these fruits. Serious or slight defects could be both naturally or non-naturally introduced in the apple life cycle. Natural defects are produced during the fruit’s growth on the tree, for example, frost damage, rot, hail damage, flesh damage, and scald. Non-natural defects are introduced from the harvest operation onwards, for example, spoilage or mechanical damage, transportation spillage, degradation, grading, sorting, washing, and distribution in the stores. Typical non-natural defects are bruises, russets, scar tissue, limb rubs, and flesh damage.
In this paper, we focus on apple defect detection using automatic visual inspection methodologies based on computer vision and deep learning. The majority of the state-of-the-art methods consider apple defect detection as a classification problem, where each apple sample is classified as a defective/non-defective class with any information about the defect localization and size [
9]. However, the European Commission in 2004 defined the marketing standard for apples, asserting that apples are classified in three classes on the basis of the grade of defects in terms of size and visual characteristics [
10]. In order to be compliant with European standards, automatic methods for apple defect detection should also output the segmentation of the defective regions. Defect segmentation makes it possible that some apples could be sold at a lower price if the defect covers a small portion of the surface.
Huang et al. [
11] found that the use of hyper and multi-spectral cameras instead of RGB cameras may bring an improvement in defect detection since some classes of defects are more visible in specific spectral bands. Multi-spectral imaging for raw foods can be performed using two common measurement modes, namely reflectance and transmittance. The difference between them is related to the lighting and detector configurations. The reflectance mode can obtain information concerning the sample surface in terms of color, size, shape, and surface defects, without any contact with it [
12]. The transmittance mode measures the amount of light that passes through the sample [
13], thus providing information regarding the internal part of the sample. Concerning apples, the reflectance mode is the most common choice because it is non-destructive, fast, simple, low-cost, and environmentally friendly. In this field, Rahi et al. [
14] review the spectroscopy and spectral imaging techniques for non-destructive food microbial assessment, Gui et al. [
15] propose a CNN-SVM model for an SMV classification method trained on hyperspectral images with 256 bands in the range 383.70∼1032.70 nm, while Liu et al.’s [
16] system discriminates and eliminates damaged soybean seeds based on the acquired RGB images.
Notwithstanding the importance of visual-based automatic detection of apple defects, the databases that are used for developing these types of systems are usually not available for research. In this work, we employ the only public available multi-spectral database of apples (reflectance mode: RGB + near infrared-NIR). The database has precise segmentation and categorization of defects [
17]. It consists of 280 and 256 images of healthy and defective
Jonagold apples, respectively. The multi-spectral (RGB + NIR) images are acquired in a custom setup that includes a four band multi-spectral image acquisition device and a specific set of lighting sources.
Before the emergence of convolutional neural networks (CNNs), traditional computer vision techniques were commonly used for defect segmentation tasks. These methods often relied on handcrafted features and rule-based algorithms to detect and segment defects on apple surfaces. Here are a few techniques that were commonly used prior to the widespread adoption of CNNs: Mizushima and Lu [
18] propose an image segmentation algorithm for apple sorting and grading based on support vector machine and Otsu’s method; Unay et al. [
19] propose several thresholding and classification-based techniques for defect segmentation on ‘Jonagold’ apples; Kleynen et al. [
17] implement filters of specific spectral bands for a multi-spectral image acquisition device for defect detection on apples.
Convolutional neural networks (CNNs) have emerged as a powerful approach for defect segmentation tasks, including the specific task of apple defect segmentation. CNNs are deep learning models that excel at analyzing visual data and have the ability to automatically learn and extract relevant features from images [
20].
By utilizing CNNs for apple defect segmentation, researchers have achieved significant improvements in accuracy and efficiency compared to traditional computer vision techniques. CNNs can handle complex and varied defect patterns, adapt to different lighting conditions, and learn robust representations of the defects, making them a state-of-the-art approach in this field [
21].
To train these CNN models, large datasets of annotated apple images are required. These datasets contain images of apples with different types and severities of defects, along with corresponding manual annotations that indicate the locations and boundaries of the defects. The process of creating these annotated datasets involves expert human annotators who carefully label the defects in the images [
17].
The main limitation of deep neural networks for defect detection is related to the amount of diversity data in the training process. Acharya et al. [
22] generate synthetic data to overcome imbalance problems. Recent works increase the number of training images by applying different traditional augmentation techniques like salt and pepper noise, Gaussian noise, flips, rotation, brightness, and darkness operation [
23]. But these augmentation algorithms are not enough to increase the diversity of defects in the database and they can change the naturalness of defects.
Moreover, relevant problems of CNNs are related to the inability to run on low computational devices and handle high resolution images in real-time. The most straightforward solution to reduce the computational complexity is to resize the image, but this operation introduces some artifacts and can alter the defects [
24].
This paper proposes an advanced deep learning methodology for the automatic segmentation of apple defects. Our approach is based on a U-shaped convolutional neural networks (CNN) architecture [
25]. To address the challenge of limited training data and thus reduce the effect of the training overfitting, we propose here a novel data synthesis technique. This technique aims to increase the number of available samples for training. To assess the effectiveness of our approach, we conduct evaluations using the multi-spectral apple dataset proposed by [
17].
Our proposed methodology exhibits superior performance in terms of segmentation accuracy compared to commonly used deep learning architectures designed for general segmentation tasks. Our approach significantly improves existing methods for apple defect segmentation, leading to substantial enhancements. Additionally, we evaluate the computational cost of our proposal, demonstrating its suitability for real-time applications. Utilizing a GPU, our method achieves an impressive frame-per-second rate of approximately 100, while with a CPU, it achieves a quasi-real-time performance of about 7/8 frames-per-second in visual inspection processes. To enhance the versatility of our method, we investigate the feasibility of using RGB images exclusively as input data instead of multi-spectral images. Encouragingly, the results show that the accuracy achieved in this scenario is nearly comparable to the multi-spectral approach. The experiments can be reproduced using the code made available at the following address:
https://github.com/cimice15/Quasi_real-time_apple_defect_segmentation (accessed on 8 September 2023).
The paper is organized as follows:
Section 2 presents related works,
Section 3 presents the database used in our experiments and the method we propose.
Section 4 presents evaluation metrics and experimental setups. Finally
Section 5 discusses results of the proposed method in comparison with state-of-the-art methods.
2. Related Works
We approach the apple quality assessment as a defect segmentation task because binary classification restricts the possible performance evaluation and the value on the market of the algorithms. So in this section we analyze the methods specifically designed for the apple defect segmentation and the methods that perform apple classification exploiting defect segmentation. The result of this analysis is reported in
Table 2.
Unay et al. [
19] propose an approach that compares several thresholding and classification-based techniques for apple defect segmentation. The method extracts global and local apple features using different neighbourhoods size and shape. Subsequently, the defect detection is conducted using thresholding or classification. Thresholding applies global or local threshold, while classification uses supervised and unsupervised methods. The results shows that the multi layer perceptron (MLP) is the most promising architecture to be used for the segmentation of surface defects.
Xiaobo et al. [
31] propose a multi-threshold method to segment the apple image from black background. Then, the Yang et al. [
32] algorithm identifies patch-like defects including calyxes and stem-ends. When two or more
ROIs are identified, the apple is classified into the rejection class. In this case, the segmentation of the defect is a previous stage of the apple classification.
Zhang et al. [
26] propose an approach for identifying defects in apples using a combination of the fuzzy C-means algorithm and the nonlinear programming genetic algorithm (FCM-NPGA), along with multivariate image analysis. Initially, the image was subjected to denoising and enhancement through fractional differentiation. This process eliminated noise and edge points while retaining essential texture details. Subsequently, the FCM-NPGA algorithm was employed to segment potentially defective regions within the apple. Ultimately, a strategy founded on multivariate image analysis was employed to identify flaws within the mapped regions indicative of potential defects in the apples.
Bhargava et al. [
29] applied a threshold to segment apple instance from background, then the segmentation of a defective area is performed using fuzzy c-means. The method is designed for apple classification, so the apple feature extraction is performed using various combinations of statistical textural, geometrical, Gabor wavelet, and discrete cosine transform. Finally, for classification, three different classifiers, namely KNN, SRC (sparse representation classifier) and SVM (support vector machine) have been applied.
Huang et al. [
11] developed a multi-spectral imaging system to select the most appropriate wavelengths for apple defect classification using principal component analysis (PCA). Although the method is developed for the classification of normal or bruised apples, the analysis highlights that three effective wavelengths are feasible for bruise segmentation on apples.
Lu et al. [
28] developed a multi-spectral structured illumination reflectance imaging (SIRI) system to acquire near-infrared images of apples with various types of surface and subsurface defects. Direct component (DC) and amplitude component (AC) images are extracted and enhanced using bi-dimensional empirical mode decomposition (BEMD). Defect detection algorithms are developed using random forest (RF), SVM, and CNN.
The most recent method proposed by Fan et al. [
30] combines NIR images provided by three consecutive rubber roller stations. Then, the defect detection and classification are performed using a pruned YOLO V4 network.
Since our method is based on a U-shaped CNN architecture, we compare our method with other U-shaped architectures as well as with other relevant CNN architecture specially designed for image segmentation. In particular, we compare with a traditional U-Net [
33] and a variant, namely U-Net++ [
34], the Pyramid Scene Parsing Network (PSPNet) [
35], DeepLabv3 [
36] and PAN (Pyramid Attention Network) [
37]. The U-net network consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. Its variant U-Net++ is aimed at reducing the semantic gap between the feature maps of the encoder and decoder sub-networks through a series of nested and dense skip pathways. PSPNet embeds difficult scenery context features in a fully convolutional network-based pixel prediction framework. DeepLabv3 network employs atrous convolution in cascade or in parallel to capture multi-scale context by adopting multiple rates. Finally, the PAN network combines attention mechanism and spatial pyramid to extract precise dense features for pixel labeling instead of complicated dilated convolution and artificially designed decoder networks.
6. Final Remarks and Conclusions
In this paper, we proposed a novel deep learning approach for the automated segmentation of apple defects using a convolutional neural network (CNN) based on a U-shaped architecture with skip-connections only within the noise reduction block. To increase the number of samples and at the same time to reduce neural network overfitting, we designed an ad-hoc data synthesis technique to generate new images of apples having a large variety of defects. We compared the proposed approach, both with and without data synthesis, with different CNN-based methods, namely Pyramid Scene Parsing Network (PSPNet) [
35], DeepLabv3 [
36] and PAN (Pyramid Attention Network) [
37], and a hand-crafted method proposed by Unay et al. [
19]. To further improve the applicability of the method, we also investigated the potential of using only RGB images instead of multi-spectral (RGB + NIR) ones.
The results show that all the experimented CNN-based approaches outperform, in terms of
f-score, of about 35% the hand-crafted algorithm proposed by Unay et al. [
19] in both RGB and RGB + NIR configurations. This behavior was expected since the method by Unay et al. [
19] is pixel-wise so it does not consider spatial correlations that are typical in small defective regions. On the contrary, CNN-based approaches process the input image using different receptive fields that permit taking into account spatial correlations among defective pixels.
Our method outperforms, in terms of f-score, the other CNN-based approaches by an average of about 6% (the worst of about 15% and the second best of about 2%) with the RGB + NIR configuration. The gap between our method and other CNN-based approaches is much more evident in the case of RGB, which is on average about 9%. Our architecture requires fewer computations with respect to the other CNN-based approaches we experimented with, since it includes a few bottlenecks and uses skip-connections only within the noise reduction block. Since the dataset is of a small amount of defective apples, neural models with a lower capacity are more likely to work better.
To demonstrate that the goodness of our approach does not rely only on the use of the proposed data synthesis procedure, we also experimented with it in combination with the other CNN-based approaches. Results show that, on average, data synthesis permits an improvement of the performance of all CNN-based approaches we experimented with. However, the best of the these CNN-based approaches achieves a performance that is still lower than that of our proposal, so demonstrating the goodness of our architecture. In fact, our method achieves an f-score about 4% higher than the state-of-the-art (the worst of about 9% and the second best of about 1%) with RGB + NIR configuration. In the case of RGB the average increment is about 3%.
Experimental results confirm that using learning-based methods instead of hand-crafted ones enable the use of RGB solely instead of RGB + NIR thus enabling the use of conventional cameras in a visual-based apple inspection pipeline. Conventional RGB cameras are at a lower cost with respect to multi-spectral ones and, more importantly, they permit fastening of the processing since the amount of information to be processed is lower.
Finally, the approach we present has the potential to significantly impact automatic visual inspection applications that face the following critical challenges: constrained computational resources and a scarcity of annotated data. To satisfy these real-world constraints, hand-crafted computer vision techniques for feature extraction are applied in conjunction with traditional machine-learning classifiers such as support vector machines. We are all aware that convolutional neural networks can obtain better results. However, these methodologies require large annotated datasets for the training process and significant computational resources at both the training and operating stages. In this regard, our proposed method obtains results that allow us to overcome the aspects mentioned above. Firstly, the superiority of deep neural networks over hand-crafted methodologies demonstrates the capability of our data synthesis technique as a tool to mitigate the lack of training data for these networks. Secondly, our neural network outperforms the state-of-the-art alternatives, in terms of both the accuracy and efficient utilization of computational resources, demonstrating that a well-engineered neural network architecture can give rise to tailored solutions capable of satisfying the computational constraints commonly encountered in real-time applications.
In this work we focused on segmentation, as precise as possible, of defects. Defect classification is at this point much more easier both with neural and hand crafted methods. Our segmentation and data augmentation methods could also be used on other types of fruit.