1. Introduction
Computer vision is widely used in digital agriculture [
1]. As a basic task in computer vision, semantic segmentation is applied in all aspects of agricultural automation. Semantic segmentation technology also provides strong technical support for the work of precision agricultural robots. The semantic segmentation of crop fruits can help robots detect and locate fruit positions and realize automatic picking, thereby reducing manual participation in agriculture, improving agricultural efficiency, and reducing production costs.
A widely used semantic segmentation network is the Fully Convolutional Network (FCN) proposed by Long et al. [
2]. Wang et al. used the FCN to realize the recognition of wheat ear targets that is difficult to achieve with traditional methods [
3]. The Pyramid Scene Parsing Network (PSPNet) [
4] has made some improvements based on the FCN, adding an encoder–decoder structure that allows for more refined deconvolution results, improved classification accuracy, and improved overall efficiency [
5]. Deng et al. [
6] used PSPNet to segment a range of kiwifruit vines based on kiwifruit orchard images. The DeepLab semantic segmentation method uses Atrous Spatial Pyramid Pooling (ASPP), which can expand the receptive field without changing the resolution; at the same time, the features of different levels can be fused [
7]. DeepLabV3+ as found to have better segmentation results than former segmentation methods such as FCN and PSPNet. Zhang et al. used DeepLabV3+ to segment the lodging area of different wheat growth stages, and their segmentation accuracy was high [
8]. Using DeepLabV3+, Sharifzadeh et al. detected farm crops with low-resolution satellite images, and their edge segmentation effect was better [
9].
However, the DeepLabV3+ method also has some shortcomings. Firstly, its computational complexity is relatively high because its feature extraction network Xception has a large number of network layers and a large number of parameters and the convolution method in the ASPP module is ordinary convolution, which increases the number of parameters. Secondly, its feature information extraction can be improved. In the process of feature extraction at the encoder, the spatial dimension of the input data is gradually reduced, resulting in the loss of useful information, and detail recovery cannot be achieved well during decoding. Finally, the target edge recognition accuracy is relatively low. Although the ASPP module can improve the method’s ability to extract the boundary of a target, it cannot fully simulate the relationship between the local features of a target, resulting in a reduction in the accuracy of the target segmentation and subsequent problems such as a low recognition accuracy and poor edge recognition.
In view of the above issues, in order to obtain more efficiently and accurately achieve apple fruit segmentation, DeepMDSCBA (a network segmentation model based on DeepLabV3+ with the MobileNet, Depthwise Separable Convolution, and Convolutional Block Attention modules) is proposed in this paper. DeepMDSCBA was constructed based on the DeepLabV3+ structure, and a more lightweight MobileNet network [
10] was adopted as its backbone network other than the original Xception network to reduce the amount of parameter calculations and memory usage. In DeepMDSCBA, Depthwise Separable Convolution (DSC) [
11] is used to replace the ordinary convolution in the ASPP module to improve the calculation speed of the method. In the feature extraction and ASPP modules of DeepMDSCBA, a Convolutional Block Attention Module (CBAM) [
12] is added to filter the background information and reduce the loss of image edge detail information. Because of the improvements in the above aspects, the information processing efficiency and accuracy of DeepMDSCBA were found to be improved and the accuracy of the segmentation model was increased.
In addition, to verify the robustness of DeepMDSCBA, the influence of rot degree, rot position, apple variety, and complexity of background on the performance of apple image semantic segmentation were extensively studied in this paper.
The rest of this article is organized as follows.
Section 2 describes the main idea of the DeepMDSCBA method.
Section 3 describes the design and setup of our experiments.
Section 4 analyzes the results of the comparative experiments. Finally, the last section summarizes the work.
3. Experiments
This section introduces the design of experiments to test the performance of DeepMDSCBA, the method proposed in this paper. First, the hardware and software for the equipment configuration of the experiments is introduced, then the dataset production and preprocessing required for the experiments are described, and the network hyperparameters are established. The designs of the robust and ablation experiments are introduced last.
3.1. Hardware and Software Configuration
The experiments in this paper used the PyTorch deep learning framework to train and test the performance of the DeepMDSCBA method. The specific configuration of the experiments is shown in
Table 2.
3.2. Data Acquisition and Preprocessing
The images in the datasets used for the experiments were mainly obtained from the internet, together with image synthesis, which was divided into training and test sets. The training and test sets had no duplicate images. The training and test sets were classified according to different apple varieties, different degrees of rot, different positions of rot, and background complexity. After enhancing the training set through the addition of random rotation, noise and mirroring and increasing the sample size to improve the model’s generalization ability, we trained the model. Furthermore, the apple images in the datasets were labeled with the graphical interface labeling software LabelMe [
23] to generate JSON files. Although DeepLabV3+ does not limit image parameters such as the resolution of the images in the dataset, images in the datasets were uniformly converted into grayscale images with a resolution of 512 × 512 and a depth of 24 and then were stored in the PASCAL VOC [
24] data format, which allowed the comparison of the performance of DeepMDSCBA and that of various other methods.
3.3. Dataset Partitioning
3.3.1. Dataset of Apple Images with Different Rot Degrees
This dataset consists of apple images with different rot degrees, where the degree is expressed as the proportion of rotten area. All images in the dataset are apples in the front view. Due to the small number of naturally rotten apple images and the difficulty in controlling the degree and position of rot, in addition to the naturally rotten apple images, image synthesis technology was used to synthesize the rotten parts of apple images so that apple images with different rotten areas could be obtained for experiments. According to the ratio of the rotten area in the image to the entire apple area, the proportions of the rotten area to the apple image in the dataset were divided into five sub-datasets of (0, 20%], (20%, 40%], (40%, 60%], (60%, 80%], and (80%, 100%]. Some samples are shown in
Figure 4.
3.3.2. Dataset of Apple Images with Different Rot Positions
This dataset consists of three different views of apples. The views include the fruit stalk, fruit calyx, and neither fruit stalk nor fruit calyx. In order to unify the standard, an apple image with the proportion of rotten area between (0, 40%] was selected, and the dataset was divided into sub-datasets of the three abovementioned views to explore the influence of different rot positions on the apple semantic segmentation results. Some samples are shown in
Figure 5.
3.3.3. Dataset of Apple Images of Different Varieties
This dataset consists of images of different varieties of apples. Images of four common apple varieties—Golden Delicious, Fuji, Honey Crisp and Red Delicious—were selected for the experiments to explore the influence of different apple varieties on the segmentation results. Some samples are shown in
Figure 6.
3.3.4. Dataset of Apple Images with Complex Backgrounds
This dataset is mainly composed of multiple apple images in the natural state. The images are apples under relatively complex backgrounds with branches and leaves, which were used to explore the effect of complex backgrounds on the semantic segmentation of apple images using DeepMDSCBA in this study. Some samples of the dataset are shown in
Figure 7.
3.4. Evaluation Indicators
In the experiments, MIoU (Mean Intersection over Union) and PA (Pixel Accuracy) were used as the evaluation indicators for apple image segmentation to analyze the segmentation performance.
- (1)
Pixel Accuracy (PA)
PA is the ratio of correctly predicted pixels to total pixels. The calculation formula is as follows:
In the formula, denotes the total number of categories, denotes the number of pixels that belong to class but are predicted to belong to class denotes the number of correctly predicted pixels, and and denote false positive and false negative results, respectively.
- (2)
Mean Intersection over Union (MIoU)
MIoU is the most commonly used metric in semantic segmentation experiments. It is used to calculate the ratio of the intersection and union of two sets of real and predicted values on each class and then to calculate the average value of the intersection and union ratio of all classes, that is, the average intersection and union ratio. The calculation formula is as follows:
In the formula, denotes the total number of categories, denotes the number of pixels that belong to class but are predicted to belong to class denotes the number of correctly predicted pixels, and and denote false positive and false negative results, respectively.
3.5. Experimental Scheme
3.5.1. Determination of Training Parameters
For the original DeepLabV3+ method, with an initial learning rate of 0.007 and a batch size of 16, the average intersection–union ratios of the method on the PASCAL VOC2012 and Cityscapes [
25] datasets were 89.1% and 83.2%, respectively, achieving good segmentation results. On this basis, according to the commonly used empirical values of network training hyperparameters, and after repeated testing, the network hyperparameters of DeepMDSCBA used in the experiments were established. They are shown in
Table 3.
3.5.2. Test Scheme
In order to test the performance of the proposed DeepMDSCBA method in an apple image segmentation task, it was compared with the traditional semantic segmentation methods of FCN, SegNet, PSPNet, UNet [
26] and DeepLabV3+. MIoU and PA were selected as indicators to test the segmentation performance of each method.
To test the segmentation efficiency of each method, the training time and single image prediction time, memory occupancy, and parameter quantity were selected as indicators.
In order to test the generalization ability of DeepMDSCBA and verify its robustness, segmentation and comparison experiments were performed on the constructed training and test sets, which comprised datasets of apple images with different levels of rot, different rot positions, different apple varieties, and complex backgrounds.
In order to verify the effectiveness of the ideas of DeepMDSCBA, such as adopting a more lightweight network (MobileNet) than the original feature extraction network (Xception), changing the ordinary convolution to DSC in the ASPP module, and adding CBAM to the feature extraction module and the ASPP module, the following ablation experiments were performed on the total test set.
- (1)
DeepM: Based on the traditional DeepLabV3+ network, the feature extraction network was changed to a more lightweight MobileNetV2 network.
- (2)
DeepMDS: On the basis of DeepM, the ordinary convolution in the ASPP module was changed to DSC.
- (3)
DeepMCBA: On the basis of DeepM, CBAM was added to the feature extraction and ASPP modules.
- (4)
DeepMDSCBA1: Based on DeepMDS, only CBAM was added to the feature extraction module.
- (5)
DeepMDSCBA2: Based on DeepMDS, only CBAM was added to the ASPP module.
- (6)
DeepMDSCBA: Based on DeepMDS, CBAM was added to the feature extraction and ASPP modules, which reflected the method proposed in this paper.
3.5.3. Dataset Configuration
The training set and the test set adopted completely different pictures with no intersection.
The training set comprised a dataset of 212 images of fully healthy apples of different varieties without any rot, a dataset of 240 images of apples with different degrees of rot, a dataset of 180 images of apples with different positions of rot, and a dataset of 216 images of apples with complex backgrounds. The training set details are shown in
Table 4.
For the three experiments related to the unseen cases described in 4.3.5, the training set for each experiment does not contain the corresponding case in the test set as an unseen case, that is a rot degree of (40%, 60%], a rot position of the calyx view, and a variety of Honey Crisp, respectively, as can be seen in
Table 5.
The test set was divided into four subsets: a dataset of 120 images of fully healthy apples of different varieties without any rot, a dataset of 200 images of apples with different degrees of rot, a dataset 90 images of apples with different rot positions, and a dataset of 50 images of apples with complex backgrounds. The sum of all test sets was the total test set. There were no repeated pictures in the test set, and apple image appeared multiple times. The details of the test set are shown in
Table 5.
4. Results and Analysis
4.1. Performance of Segmentation
In order to verify the segmentation performance of the DeepMDSCBA model, the model trained by the training set was used to perform segmentation tests on the previously divided total test set. In the experiment, the FCN, SegNet, PSPNet, UNet and DeepLabV3+ methods were used for comparison with the proposed DeepMDSCBA method. Some segmentation results are shown in
Figure 8.
It can be seen in
Figure 8 that compared with the other five methods, DeepMDSCBA showed the highest degree of recognition of the edges of apples with complex backgrounds in the images, as well as fewer omissions and misclassifications, especially for the rotten parts of apples.
Using MIoU and PA as indicators, the segmentation performance of apple images with DeepMDSCBA and the other five methods were analyzed, and the results are shown in
Table 6.
It can be seen in
Table 6 that the MIoU of DeepMDSCBA was 87.1%, which was 5.5%, 3.1%, 3.5%, 4.1% and 3.4% higher than that of FCN, SegNet, PSPNet, UNet and DeepLabV3+, respectively. The PA of DeepMDSCBA was 95.3%, which was 5.7%, 3.2%, 3.8%, 4.4% and 3.1% higher than that of FCN, SegNet, PSPNet, UNet and DeepLabV3+, respectively. The results of the experiments showed that the adoption of CBAM in DeepMDSCBA improved the feature extraction ability and the segmentation accuracy of various apples in the image test set, further proving the performance of the method.
4.2. Efficiency of Segmentation
It can be seen in
Table 7 that the training time of DeepMDSCBA was 3.52 h, which was 33%, 21%, 18% and 21% faster than that of FCN, PSPNet, UNet and DeepLabV3+, respectively. The single image prediction time of DeepMDSCBA was 32 ms, which was 42%, 15%, 26% and 42% faster than FCN, PSPNet, UNet and DeepLabV3+, respectively. DeepMDSCBA occupied 7.1 GB of memory, which was 20%, 10%, 12%, 18% and 8% less than FCN, SegNet, PSPNet, UNet and DeepLabV3+, respectively. In terms of the number of parameters, the number of parameters of DeepMDSCBA was 22.6 MB, which was 89%, 23%, 49%, 76% and 84% lower than that of FCN, SegNet, PSPNet, UNet and DeepLabV3+, respectively. However, compared with the SegNet method, the training time of DeepMDSCBA was almost 6% longer and the single image prediction time increased by 7%. This is because the SegNet method has a simpler method structure and fewer parameters than the proposed method [
27], which results in a reduction in the method’s training time and single image prediction time. However, it can be seen in
Figure 8 and
Table 6 that the detection accuracy of SegNet was not as good as that of DeepMDSCBA. In general, because DeepMDSCBA (the method proposed in this paper) uses the lightweight network of MobileNet as its feature extraction network and changes the ordinary convolution in the ASPP module to DSC, it showed an improved calculation speed compared with the other tested methods.
4.3. Robustness Verification
In order to test the robustness of DeepMDSCBA, the segmentation performance of our DeepMDSCBA model trained using the training set on the four apple test sets was analyzed in comparison with the other five methods.
4.3.1. Segmentation Performance of Apple Images with Different Rot Degrees
Segmentation experiments were performed on the test set of apple images with complex backgrounds, using MIoU and PA as indicators. The comparison results are shown in
Table 8.
It can be seen in
Table 8 that the MIoU values of DeepMDSCBA on the test set of apple images with rot degrees of (0, 20%], (20%, 40%], (40%, 60%], (60%, 80%], (80%, 100%] were 86.7%, 84.8%, 83.9%, 84.4%, and 85.1%, respectively, and the PA values were 94.4%, 92.5%, 92.2%, 92.3%, and 93.4%, respectively, which were higher than those of FCN, SegNet, PSPNet, UNet, and DeepLabV3+. Furthermore, the segmentation performance of DeepMDSCBA was better than that of the other tested models.
In addition, it can be seen in
Table 8 that for all tested methods, the segmentation performance first decreased and then increased as the proportion of the rotten area increased. The PA and MIoU of each method gradually decreased until the proportion of the spoiled area was in the interval of (60%, 80%]. When the proportion of the spoiled area was in the interval of (80%, 100%], the segmentation effect was similar to that of the spoiled area in the interval of (0, 20%].
The analysis showed that with the gradual increase in the rotten area of the entire apple peel, the normal area of the apple became irregular due to the rotten area and the boundary becoming more difficult to distinguish, resulting in a gradual decrease in segmentation accuracy. When the blackened area gradually spread out to the entire apple, the overall contour of the apple and the background color could be clearly distinguished, so the segmentation accuracy again increased.
4.3.2. Segmentation Performance of Apple Images with Different Rot Positions
Segmentation experiments were carried out on the test set of apple images with different rot positions using MIoU and PA as indicators. The results of the performance comparison are shown in
Table 9.
It can be seen in
Table 9 that the DeepMDSCBA MIoU of the test set of apple images with the view without the stalk or calyx, the calyx view, and the stalk view was 86.7%, 84.6%, and 84.7%, respectively, and the DeepMDSCBA PA was 94.4%, 92.6%, and 92.8%, respectively. These values were higher than those of FCN, SegNet, PSPNet, UNet, and DeepLabV3+, thus proving the segmentation performance of DeepMDSCBA was better than that of the other methods.
In addition, it can be seen in
Table 9 that the DeepMDSCBA PA and MIoU values of the test set of apple images with the view without the stalk or calyx were higher than those of the test set of apple images with the calyx view and the stalk view. This is because the existence of the calyx and the stalk had a certain negative impact on the segmentation effect.
4.3.3. Segmentation Performance of Apple Images of Different Varieties
Segmentation experiments were performed on the test set of apple images of different varieties, using MIoU and PA as indicators. The comparison results are shown in
Table 10.
It can be seen in
Table 10 that the MIoU of DeepMDSCBA on the test set of apple images of the Golden Delicious, Fuji, Honey Crisp and Red Delicious varieties were 87.4%, 87.6%, 87.2% and 86.8%, respectively, and the corresponding PA values were 95.4%, 95.6%, 94.9% and 94.7%, respectively; these values higher than those of FCN, SegNet, PSPNet, UNet and DeepLabV3+.
In addition, it can be seen in
Table 10 that there were no significant differences among the PA and MIoU of different apple varieties using the same segmentation method, indicating that apple varieties had little effect on segmentation performance.
4.3.4. Segmentation Performance of Apple Images with Complex Backgrounds
Segmentation experiments were carried out on the test set of apple images with complex backgrounds, using MIoU and PA as indicators. The comparison results are shown in
Table 11.
On the apple image test set with complex backgrounds, the MIoU and PA of DeepMDSCBA were 86.8% and 94.4%, respectively, which were the highest of the tested method. The segmentation effect of DeepMDSCBA was better than FCN, SegNet, PSPNet, UNet and DeepLabV3+, and the segmentation accuracy of DeepMDSCBA was also improved.
4.3.5. Segmentation Performance for Unseen Cases
To further verify the robustness of DeepMDSCBA, the following three experiments were carried out. The test set for each following experiment consisted of a certain case, which is listed in
Table 5, and the training set for the experiment consisted of the other cases. The segmentation performance of DeepMDSCBA was analyzed in comparison with the other five methods.
As can be seen in
Table 5, apple images with a rot degree of (40%, 60%] were used as the test set and apple images with rot degrees of (0, 20%], (20%, 40%], (60%, 80%], and (80%, 100%] were used in the training set for this experiment. The results of the performance comparison are shown in
Table 12.
It can be seen in
Table 12 that the MIoU and PA of DeepMDSCBA were 84.1% and 91.7%, respectively, which were also the highest of the tested methods. This means that for the apple images with an unseen rot degree, the segmentation effect of DeepMDSCBA was better than that of FCN, SegNet, PSPNet, UNet and DeepLabV3+.
For the positions of rot, apple images with the calyx view were used as the test set, and apple images with the other two views (fruit stalk view and view without the stalk or calyx) were used in the training set for this experiment. The results of the performance comparison are shown in
Table 13.
The experimental results showed that the MIoU and PA of DeepMDSCBA were 84.7% and 92.3%, respectively, which were also the highest of the tested methods. The segmentation effect of DeepMDSCBA was better than that of FCN, SegNet, PSPNet, UNet and DeepLabV3+ for apple images with an unseen rot position.
For apple varieties, images of the Honey Crisp apple variety were used as the test set, and images of the Golden Delicious, Fuji and Red Delicious apple varieties were used as the training set for this experiment. The results of the performance comparison are shown in
Table 14.
According to the experimental results, the MIoU and PA of DeepMDSCBA were 85.9% and 93.6%, respectively, which were still the highest of the tested methods. The segmentation effect of DeepMDSCBA was better than that of FCN, SegNet, PSPNet, UNet and DeepLabV3+ for apple images of an unseen variety.
A comparison of the experimental results of each method for each test set showed that the PA and MIoU of DeepMDSCBA were higher than those of the other tested methods, which proved that the segmentation accuracy and effect of the proposed method on all kinds of test sets were improved. At the same time, it was verified that the method had a strong generalization ability and robustness.
4.4. Results of Ablation Experiments
Ablation experiments were carried out based on the ablation experiment scheme described in
Section 3.5.2, and MIoU and PA were used as indicators. The results of the experiments are shown in
Table 15.
According to the ablation experiment results in
Table 15, DeepMDSCBA had improved MIoU and PA values compared with those of DeepLabV3+, DeepM, DeepMDS, DeepMCBA, DeepMDSCBA1, and DeepMDSCBA2, indicating that replacing the backbone network with MobileNetV2 improved the segmentation accuracy of the method to a certain extent. At the same time, the MIoU and PA of DeepMCBA, DeepMDSCBA1, DeepMDSCBA2 and DeepMDSCBA were better than those of DeepM, indicating that adding CBAM to the feature extraction module or ASPP module could improve the segmentation accuracy of the method.
Furthermore, DeepMDSCBA, the method proposed in this paper, showed greater improvements than DeepMDSCBA1 and DeepMDSCBA2, indicating that adding CBAM to the feature extraction and ASPP modules at the same time could improve the segmentation accuracy of the method. DeepMDSCBA had the highest MIoU and PA values, which proved the effectiveness of this method.
Furthermore, the training segmentation efficiencies of each ablation experimental method were compared, and the results are shown in
Table 16.
It can be seen in
Table 16 that the training time and single image prediction time of the DeepM, DeepMDS, DeepMCBA, DeepMDSCBA1, DeepMDSCBA2 and DeepMDSCBA methods were all reduced compared with those of DeepLabV3+, indicating that the adoption of a more lightweight network (MobileNetV2) could shorten training time and improve prediction speed.
The training time and single image prediction time of the DeepMDS, DeepMDSCBA1, DeepMDSCBA2, and DeepMDSCBA methods were shorter than those of DeepLabV3+ and DeepM, indicating that changing the ordinary convolution in the ASPP module to DSC could further shorten training time and improve prediction speed.
The method training time and single image prediction time of DeepMDSCBA were the lowest of the studied methods. According to the ablation experiments, the DeepMDSCBA method proposed in this paper reduced the computational complexity of the network, shortened the training running time, and improved the segmentation accuracy.
5. Conclusions
A network segmentation method based on deep learning, DeepMDSCBA, was proposed in this paper. The method combines DeepLabV3+ with the optimized lightweight MobileNetV2 network and uses DSC to replace the ordinary convolution in the ASPP module, which effectively reduces the number of method parameters and improves the speed of calculation. CBAM was added to the feature extraction module and the ASPP module to better restore the edge information of objects, improve the feature extraction ability of the method, and result in fewer omissions and misclassifications. The method proposed in this paper, DeepMDSCBA, was shown to more effectively extract apple areas in images than other test methods. The PA of the whole dataset of apple images reached 95.3% and the MIoU reached 87.1%, demonstrating a more efficient and accurate segmentation of apple images compared with other tested methods, even for images of rotten apples and apples with complex backgrounds.
By comparing it with five other semantic segmentation methods on test sets of apple rot degrees, rot positions, apple varieties, and complex backgrounds, the robustness of DeepMDSCBA was fully verified. It was also proven that the performance of DeepMDSCBA was better than the other tested methods under the influence of factors such as the degree of rot, position of rot, apple variety, and complex backgrounds.
Although DeepMDSCBA’s segmentation of images of rotten apples and apples with complex backgrounds was faster and more accurate than other tested methods, its segmentation of hidden areas was not accurate for apples that were partially hidden by leaves or branches in more complex situations. It is therefore necessary to construct a relevant dataset and conduct further experimental research in the future.