Cucumber Picking Recognition in Near-Color Background Based on Improved YOLOv5

Su, Liyang; Sun, Haixia; Zhang, Shujuan; Lu, Xinyuan; Wang, Runrun; Wang, Linjie; Wang, Ning

doi:10.3390/agronomy13082062

Open AccessArticle

Cucumber Picking Recognition in Near-Color Background Based on Improved YOLOv5

by

Liyang Su

^1,2

,

Haixia Sun

^1,2,*,

Shujuan Zhang

^1,2,

Xinyuan Lu

^1,2,

Runrun Wang

^1,2,

Linjie Wang

^1,2 and

Ning Wang

³

¹

College of Agricultural Engineering, Shanxi Agricultural University, Jinzhong 030801, China

²

Dryland Farm Machinery Key Technology and Equipment Key Laboratory of Shanxi Province, Jinzhong 030801, China

³

Department of Biosystems and Agricultural Engineering, Oklahoma State University, 111 Agricultural Hall, Stillwater, OK 74078, USA

^*

Author to whom correspondence should be addressed.

Agronomy 2023, 13(8), 2062; https://doi.org/10.3390/agronomy13082062

Submission received: 17 July 2023 / Revised: 2 August 2023 / Accepted: 2 August 2023 / Published: 4 August 2023

(This article belongs to the Special Issue AI, Sensors and Robotics for Smart Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Rapid and precise detection of cucumbers is a key element in enhancing the capability of intelligent harvesting robots. Problems such as near-color background interference, branch and leaf occlusion of fruits, and target scale diversity in greenhouse environments posed higher requirements for cucumber target detection algorithms. Therefore, a lightweight YOLOv5s-Super model was proposed based on the YOLOv5s model. First, in this study, the bidirectional feature pyramid network (BiFPN) and C3CA module were added to the YOLOv5s-Super model with the goal of capturing cucumber shoulder features of long-distance dependence and dynamically fusing multi-scale features in the near-color background. Second, the Ghost module was added to the YOLOv5s-Super model to speed up the inference time and floating-point computation speed of the model. Finally, this study visualized different feature fusion methods for the BiFPN module; independently designed a C3SimAM module for comparison between parametric and non-parametric attention mechanisms. The results showed that the YOLOv5s-Super model achieves mAP of 87.5%, which was 4.2% higher than the YOLOv7-tiny and 1.9% higher than the YOLOv8s model. The improved model could more accurately and robustly complete the detection of multi-scale features in complex near-color backgrounds while the model met the requirement of being lightweight. These results could provide technical support for the implementation of intelligent cucumber picking.

Keywords:

near-color background; target detection; lightweight algorithm; attention mechanism; feature fusion

1. Introduction

Cucumber is an annual trailing or climbing herb in the Cucurbitaceae family and is a favorite greenhouse vegetable around the world. In a greenhouse environment, cucumbers in the picking period are dark green and highly similar in color to the branches and leaves. During this period, the leaf area of the cucumber plant is also at its maximum, resulting in shading of the fruit. At the same time, the irregular shape and uneven spatial distribution of the fruits can affect the accurate recognition of the target during the smart picking process.

The detection of cucumbers is difficult and has been less studied due to the above problems. Wang Ning et al. [1] conducted CR channel conversion on RGB images and utilized a Cr channel enhanced YOLOv5s model for recognition of cucumber, achieving a precision of 85.2%. Liu et al. [2] proposed an enhanced pixel-level instance segmentation method. The improved Mask RCNN achieved an F1 score of 89.47% with an average elapsed time of only 0.03 s. Bai et al. [3] proposed an enhanced automated approach for segmenting and recognizing mature cucumbers by integrating advanced data processing techniques. The average time of detection in the hybrid model was 1 s.

The above studies have been carried out for these problems using different methods, such as semantic segmentation and modified network models, with great results. However, they suffer from problems such as low detection speed, large model size, and low accuracy.

For the last few years, the YOLO model has been applied to some extent in agriculture for target detection in flowers [4], stems [5], leaves [6], and fruits [7]. And improving the YOLO model can enhance its performance in detecting targets within complex environments.

The enhancement of the YOLO model enables addressing the challenges of unstructured features such as random spatial distribution, overlapping occlusions, and shape contrasts. Yang et al. [8] proposed an improved YOLOv5 model to go about solving the problem of difficult blueberry detection under occlusion, incorporating a self-designed NCBAM module and a tiny target detection layer into the model. The improved model achieved a 2.4% increase in mAP to reach 83.2%, building upon the original model.

The improved YOLO model can give additional attention to the feature target in the near-color background and also meet the lightweight requirements. Li et al. [9], for the purpose of completing the challenges of low efficiency and high model complexity and deployment difficulties of flat jujubes detection in the near-color background, proposed an improved model based on YOLOv5s. The mAP was improved by 1.8% compared with the original model, and the model size was compressed by 49.2%.

Since cucumber object detection suffers from problems such as difficulty in detecting near-color backgrounds, the large size of the detection model, and slow detection speed, the YOLO algorithm can show better performance in near-color background and multi-scale object detection after improvements. Therefore, this study proposes the YOLOv5s-Super model, referred to as YOLOv5s-S, for the fast and accurate recognition of cucumber shoulders in complex near-color backgrounds, and the primary focus of this study encompasses the subsequent aspects.

BiFPN is introduced to achieve multi-scale feature fusion more efficiently. Moreover, the effect of two different feature fusion operations, Concat and Add, on the model performance is compared. And to facilitate result interpretation, the Gradient-weighted Class Activation Mapping (Grad-CAM) method is employed to provide a visual explanation for each model.
C3CA module is added to enhance the feature extraction capability for cucumber shoulder detection. A non-parametric hybrid module based on the energy function, the C3SimAM module, is designed. Five hybrid modules, namely C3CA, C3CBAM, C3SE, C3ECA and C3SimAM, are compared.
The Ghost module is added to speed up the inference time and floating-point computation speed of the model, to realize the operation of lightweight, and to facilitate the deployment of the model with low computing power.
The contribution of the BiFPN, C3CA module, and Ghost module to model performance optimization is verified through ablation experiments, and the functional compatibility of the three modules is analyzed. The experimental results are compared with several mainstream single-stage detection models to validate the advantage in performance of the YOLOv5s-S models.

2. Materials and Methods

2.1. Acquisition and Annotation of Images

Images of cucumbers were collected in Binzhou, Shandong Province, China, on 7 February 2023. Cultivation and physical parameters were measured: the cucumber variety tested was Derritt 935, a prickly cucumber with a strip length of 34–36 cm. The rows of cucumbers were spaced 70 cm apart, and the plants were spaced 50 cm apart. The ambient temperature of the greenhouse was 12 °C, and the weather was sunny.

The mobile device’s camera chip was equipped with a Sony IMX582 sensor boasting 48 × 10⁶ Ppi resolution. The shooting mode was handheld. The shot was taken at a variety of different angles, at a distance of about 20–70 cm from the cucumber. A total of 1000 images were captured. Images with missing cucumber shoulders were removed from the data set. Nine hundred images were retained as the original cucumber shoulder dataset with an image resolution of 4000 × 3000 pixels.

To speed up the data read and write speed, the images were subjected to a pixel compression operation. The image resolution was changed to 640 × 640 pixels after processing. The images were randomly divided into a training set, validation set, and test set according to 7:2:1, and the images were manually labeled using LabelImg. To prevent overfitting during training, data augmentation was performed on the original data set, as shown in Figure 1. The total number of photos after data augmentation was 7200. The final training set, validation set, and test set consisted of 5040, 1440, and 720 images, respectively. There was a total of 11,360 labels for cucumber shoulder in the dataset.

Due to the narrow shape and curved shape of the cucumber do not match the aspect ratio and size of the anchor boxes, resulting in a lot of redundancy in the anchor boxes [2]. In addition, to prevent damage to the epidermal of the cucumber during the picking process, which can cause the cucumber to rot and deteriorate, farmers select the shoulder as the picking part of the cucumber. Therefore, for practical purposes, the region of the cucumber shoulder where the tail and stalk were connected was selected as the targeted feature for detection, as illustrated in Figure 2.

2.2. The Improved Model

2.2.1. YOLO5s-S Model

In this study, based on the YOLOv5s model, the PAN structure was replaced by the BiFPN in the neck network; Ghost modules were introduced in the CBS module; The CA module was cascaded in the C3 module to synthesize a C3CA module that replaced the C3 module throughout the network. The diagram in Figure 3 illustrates the structures of the YOLOv5s-S model.

2.2.2. The Algorithm Principle of YOLOv5

YOLO is a single-stage target detection algorithm that directly obtains the class and estimated probability of the target. It divides the image into a finite number of anchor boxes and continues to predict the edge frame part of each anchor box, thus transforming the target detection problem into a probabilistic regression problem.

In this study, the model based on version 6.0 of YOLOv5 [10] was chosen as the base mode. The neck network adopts a combined structure of FPN [11] and PAN [12], which can realize the bidirectional aggregation of different feature layers of the backbone network. The prediction head part performs probabilistic regression using grid-based anchor boxes. Predictions are made on the generated feature maps of various sizes. The feature extraction network of YOLOv5 mainly contains the feature extraction modules CBS, C3, SPPF, and additional parts.

YOLOv5 includes four basic models: YOLOv5x, YOLOv5l, YOLOv5m, and YOLOv5s, which represent models of different sizes. The YOLOv5s model was chosen as the foundation for enhancement in this study, and the diagram in Figure 4 illustrates its structure.

2.2.3. The CA Module

The attention mechanism [13] has its roots in the study of human vision. In cognitive science, humans selectively focus on a portion of available information and disregard other visible information due to limitations in information processing.

The CA [14] module is a computational unit that enhances the network feature representation. In this study, a CA module was added to the YOLOv5s model, which is concatenated behind the C3 module to form a C3CA module. The C3CA module, after the series connection, increases the attention of the model to critical information and decreases the attention to invalid information. The ability of the CA module to capture long-range dependent features in the spatial dimension has also been improved. The diagram in Figure 5 illustrates the structure of the C3CA module.

Other attentional mechanisms include SE [15], ECA [16], CBAM [17], and SimAM [18], among which SimAM belongs to the parameter-free attentional mechanisms. SimAM is designed to find the linear dissimilarity between the target and other neurons by modeling an energy volume function. Both Raimundo António et al. [19] and Qian et al. [20] improved the detection model’s focus on the target by adding the SimAM module. In this study, we designed a C3SimAM module. Experiments were conducted to compare non-parametric and parametric attention mechanisms.

2.2.4. Ghost Module

GhostNet [21], proposed by Huawei Noah’s Ark Lab in 2020, is a lightweight network structure and is considerably better than MobileNetV3, developed by Google, in terms of computational performance. The Ghost module is designed to be a phased convolutional computation module. Firstly, on top of the feature map obtained from a tiny amount of ordinary convolution, another linear convolution is performed to obtain additional feature maps, and the “Ghost” of the previous feature map is the newly obtained feature map. Finally, the two parts of the feature map are stitched together to generate the final feature map, thus eliminating the redundancy of the feature map and obtaining a lighter model. The diagram in Figure 6 illustrates the structure.

Suppose the input channel is c. The feature map height and width are h and w. The output data height and width are h′ and w′. The number of convolutional kernels is n with kernel size k. Additionally, there is a linear transform convolutional kernel of size d with s transforms. The compression ratio and acceleration ratio parameters, obtained through Ghost convolution instead of conventional convolution, are presented in Equations (1) and (2), respectively.

r_{c} = \frac{n \times c \times k^{2}}{\frac{n}{s} \times c \times k^{2} + (s - 1) \times \frac{n}{s} \times d^{2}} \approx \frac{s \times c}{s + c - 1} \approx s

(1)

r_{s} = \frac{n \times h^{'} \times w^{'} \times c \times k^{2}}{\frac{n}{s} \times h^{'} \times w^{'} \times c \times k^{2} + (s - 1) \times \frac{n}{s} \times d^{2}} = \frac{c \times k^{2}}{\frac{1}{s} \times c \times k^{2} + \frac{(s - 1)}{s} \times d^{2}} \approx \frac{s \times c}{s + c - 1} \approx s

(2)

Liu et al. [22] and Pan et al. [23] add the Ghost module to the network structure of the detection model with the aim of achieving low computing power deployment for lightweight models. Since the cucumber shoulder detection model also requires low-computing power deployment, we also added the Ghost module to the model.

2.2.5. BiFPN

BiFPN [24] is a network structure improved by the Google team based on PANet. BiFPN dynamically learns the weights of each size to better fuse features of different sizes when facing features of different scales. In addition, since BiFPN adds residual linkage to enhance feature expression, it has a small impact on model accuracy but improves feature fusion for near-color and blurred targets. For nodes with a single input and output edge, feature information can be approximately omitted as they are not involved in feature fusion to reduce computational effort. The diagram in Figure 7 illustrates the structures.

In this study, we replaced the PAN structure with the BiFPN to more effectively achieve multi-scale feature fusion so that the output of each feature map contained complete information about the cucumber shoulder.

In addition, BiFPN has two operations to perform feature fusion, namely Add and Concat [25]. Zhang et al. [26] compared the Add operation with the Concat operation with the aim of developing a deep learning model for predicting obstructive sleep apnea (OSA) using upper airway computed tomography (CT). In this study, two feature fusion methods of BiFPN were compared, and Grad-CAM was used to demonstrate the effectiveness of both methods for cucumber shoulder detection. The feature fusion schematic is shown in Figure 8.

2.3. Environment Construction and Evaluation Indicators

2.3.1. Environment Construction

The collected data was used as the input dataset. The image input resolution was set to 640 × 640 pixels. The learning rate was set to 1%, batch size was set to 8, and epoch was set to 100.

The loss values during the training of the YOLOv5s-S model are shown in Figure 9. In the beginning, the model experienced a rapid decrease in the loss value. It stabilized the loss values after 100 training rounds and reached initial convergence. There was no overfitting or underfitting in the training process.

The network model was executed on the identical device with hardware configuration and experimental environment, as presented in Table 1 for this investigation.

2.3.2. Evaluation Indicators

Two aspects, performance and complexity, are used as evaluation metrics in this study. The performance metrics of the model include precision (P), recall (R), mean accuracy (mAP), and F1 score. The mAP is a measure of the performance of the entire model, which is the mean value of AP. As P and R are conflicting performance metrics, the f1 score serves as a harmonic mean of P and R within a certain range (0, 1), which is utilized to regulate their conflict. And P and R can be used to assess the quality of the model. Equations are given in (3) and (4):

R = \frac{TP}{TP + FN} \times 100 %

(3)

R = \frac{TP}{TP + FP} \times 100 %

(4)

Here, the number of correctly identified samples, the number of wrongly identified positive samples, and the number of wrongly identified negative samples are TP, FN, and FP, respectively.

AP = \int_{0}^{1} P (R) dR

(5)

mAP (0.5) = \frac{\sum_{i = 1}^{n} A P_{I}}{m}

(6)

Here, the number of categories is m.

F 1 = \frac{2 \times P \times R}{P + R}

(7)

Parameters, floating point operations (FLOPs), and model size are used to describe the complexity of the model, which are calculated as shown in (8)–(10):

Parameters = [r \times (f \times f) \times o] + o

(8)

FLOPs (Conv) = 2 \times H \times W \times (C_{in} \times k^{2} + 1) {\times C}_{out}

(9)

FLOPs (Pool) = \frac{w}{s} \times \frac{H}{S} \times C_{out}

(10)

Here, the input size, the size of the convolution kernel, and the output size are r, f, and o, respectively. The size of the output feature mapping, the size of the convolution kernel, and the size of the step are H × W, K, and s, respectively. C_in and C_out are the input channel and the output channel.

3. Results

3.1. Comparison of Feature Fusion

To verify the effectiveness of the BiFPN under different feature fusion operations, two different feature fusion operations, Concat and Add, were used to add the BiFPN to the YOLOv5s-S network model.

Grad-CAM [27] can visually represent the attention of the network on the input image information via a heat map. The YOLOv5s-S model with BiFPN_Concat, the YOLOv5s-S model with BiFPN_Add, and the YOLOv5s-S model with PAN were fed into Grad-CAM for testing, and the resulting heatmaps were shown in Figure 10.

As can be seen in the figures, the region of interest of the heat map was concentrated near the cucumber shoulder, and irrelevant information from the growing environment had a low impact on the feature expression. The YOLOv5s-S model, with the addition of BiFPN_Concat, had a higher concentrated focus on the cucumber shoulder. In contrast, the YOLOv5s model with added FPN was slightly less focused on the cucumber shoulder, and the YOLOv5s-S model with added BiFPN_Add was the least effective. Meanwhile, by adding BiFPN_Concat, the YOLOv5s-S model paid less attention to the surrounding irrelevant environmental information. Compared to YOLOv5s and YOLOv5s-S with BiFPN_Add, YOLOv5s-S with BiFPN_Concat could focus better on the cucumber shoulder.

3.2. Comparison of Attentional Mechanism

The above attention mechanisms (SE, ECA, CBAM, SimAM) were selected to generate a hybrid module in series with C3 modules, respectively, replacing the C3CA module in the YOLOv5s-S model for experiments, and the results were presented in Table 2.

Comparing the training results, we could see that the above five hybrid attention mechanism modules enable the model to maintain elevated detection accuracy. The models of them met the requirements for lightweight operation. In terms of detection accuracy, the improved model with the C3CA module achieved the highest F1 score and mAP, reaching 84.7% and 87.6%, respectively.

C3SimAM was a parameter-free attention mechanism, and most of the operations of this module were based on the defined energy function selection, which avoids excessive structure adjustment. Therefore, the addition of the C3SimAM module resulted in a 5% reduction in FLOPs, a 15% reduction in parameters, and a 12% reduction in size compared to the C3CA module. However, the detection accuracy of the network model with the C3CA module was 2.1% higher than that of the network model with the C3SimAM module, and its mAP value was also 1.8% higher.

The box plot is robust to outliers, enabling it to depict the discrete distribution of data in a relatively stable manner. Therefore, the mAP of the above five network models was visualized to obtain the box plot diagram, which is shown in Figure 11. The findings indicated that the enhanced models incorporating the five hybrid modules exhibited a higher median mAP, with all values concentrated around 80%. The network model with the addition of the C3CA module had the smallest dispersion and the highest median line, which indicated that the model with the C3CA module had the best performance.

In summary, the features of the C3CA module itself could better enhance the performance of the modified model. The parametric attention mechanism is better than the non-parametric attention mechanism for detecting cucumber shoulders. Compared to other modules, the C3CA module could adaptively learn the channel weights to capture the features of long-distance dependence in spatial dimensions, so that the model could obtain better performance and stability to complete the recognition of cucumber shoulders.

3.3. Ablation Experiment

In the context of complex deep neural networks, ablation studies are commonly employed to elucidate network behavior by selectively removing components of the system [28]. To verify the performance of the improved model, ablation experiments were performed on the network model by removing three improvement modules. ”√” indicated that the model was applied using the module, and “×” indicated the module was not used. The results of the experiment are presented in Table 3.

In the table, each module made a significant contribution to enhancing the performance of the model. Models with the addition of BiFPN and C3CA modules had a higher improvement in detection accuracy. This was because these two modules could improve the ability to focus on the cucumber shoulder through multi-scale feature fusion and adaptive weight learning. All three improvements reduced the complexity of the model compared to the original model. It could be verified through ablation experiments that the fusion of the Ghost module, BiFPN structure, and C3CA module could significantly improve the performance of the model for cucumber shoulder by overlaying each other and complementing each other.

The loss values of the model are shown in Figure 12. It could be seen that the IPV 2, IPV 3, and IPV 4 models with Ghost module and BiFPN exhibited less oscillation in the loss curve during training and achieved faster convergence. IPV 1 and IPV 5, with the addition of the C3CA module, had large oscillations in the loss curve during training. IPV 6 and IPV7 added the C3CA module and Ghost module with less oscillation of the loss curve and faster convergence. Therefore, it was judged that there was an interaction among the three modified modules, which resulted in a higher stable and reliable performance of the YOLOv5s-S model.

3.4. Comparison of Traditional Network Models

There are four basic models of YOLOv5, namely YOLOv5s, YOLOv5x, YOLOv5l, and YOLOv5m. In addition, YOLOv3-tiny, YOLOv4-tiny, YOLOv7-tiny and YOLOv8s are lightweight models. Comparison with the above model verified the advantage in performance of YOLOv5s-S. The results of the tests are shown in Table 4.

As shown in Table 4, the experimental data obtained from the yolov5s-S model were better than other basic models. Of the four basic model types, YOLOv5s had the highest detection accuracy of 86.5%; the YOLOv5l model had the highest mAP of 86.1%. The YOLOv5s-S model exhibits a 2% improvement in detection accuracy over YOLOv5s and a 1.5% increase in mAP compared to YOLOv5l. The YOLOv5s model was the smallest model in the YOLOv5 series and had the lowest complexity, while the YOLOv5s-S model had 17.1%, 14.6%, and 15.9% of the YOLOv5s in terms of parameters, FLOPs, and size, respectively. Compared to the four basic models of YOLOv5, YOLOv5s-S was able to improve the focus on cucumber shoulders and the integration of multi-scale cucumber shoulders, which improved the accuracy of the model. The reduced complexity of the modified model made the model more lightweight.

The experimental results of the four lightweight models, YOLOv3-tiny, YOLOv4-tiny, YOLOv7-tiny, and YOLOv8s, were visualized by plotting radar charts, as shown in Figure 13. Compared to the four lightweight models, the YOLOv5s-S model had higher mAP and F1 scores. In terms of model size and parameters, the YOLOv5s-S model worked better. In terms of FLOPs, the YOLOv5s-S model had 0.6 and 0.3 more than the YOLOv3-tiny model and YOLOv7-tiny model, respectively. However, as visualized by radar maps, YOLOv5s-S outperformed the four lightweight networks in terms of overall performance, meeting the model lightweight requirements while ensuring elevated detection accuracy when detecting cucumber shoulders in near-color background.

3.5. Model Detection for Test Set

In order to evaluate the detection performance of the YOLOv5s-S model for cucumber shoulder in complex environments, 720 images from the test set were utilized to compare the detection results between the yolov5s and YOLOv5s-S models. The results of the detection are shown in Figure 14.

Based on the example plots, it is evident that the YOLOv5s-S model exhibits superior detection capabilities and higher accuracy rates, with minimal instances of missed detections. The accuracy of YOLOv5s was relatively low, and there were missed detection, as in Figure 14a. As in Figure 14c, when YOLOv5s detected a more distant cucumber shoulder, the model suffered from a degradation of recognition accuracy. While the YOLOv5s-S model, because of adding BiFPN and C3CA module, it enhanced the ability to capture features of long-distance dependence in spatial dimensions and maintained high accuracy in identifying multi-scale targets. In Figure 14b,d, the YOLOv5s model also had problems with recognition errors and repeated detection. The YOLOv5s-S model did not have these problems due to the addition of the C3CA module to improve the attention on the cucumber shoulder, as in Figure 14b,d. At the overall level, the YOLOv5s-S model performed adequately in the test of cucumber shoulder extraction in complex near-color background.

4. Discussion

In this study, the YOLOv5s-S model with better performance is proposed to overcome the difficulty of cucumber fruit recognition in near-color background. This model realizes the fast and accurate recognition of cucumber and lays the foundation for the intelligent picking of cucumber. The algorithm could be used as a basis for yield estimation, growth detection, and picking point localization. Sivamani et al. [10] achieved health monitoring of coral reefs with the YOLOv5 algorithm. Craig B. MacEachern et al. [29] used the YOLOv4-tiny model for ripeness detection and yield estimation of wild blueberries. Zhong et al. [30] implemented the localization of lychee-picking points using YOLACT. Therefore, the focus of subsequent research will be on the following directions:

The improved model will be used in the subsequent keypoint study to realize the point-wise localization of the picked cucumber pixels in the 2D plane and to experimentally verify the accuracy of the picked point localization. Low computing power deployments, such as the Jetson nano platform, will be realized in the future.
The YOLOv5s-S model will be used as a basis to carry out research on yield prediction, growth detection, and maturity detection of cucumber. The performance of the YOLOv5s-S model will be fully developed to improve the technical support for the intelligent cultivation of cucumbers.

5. Conclusions

Faced with the difficult identification of cucumbers in near-color backgrounds, this study provides an improved model based on the YOLOv5s model. The proposed model is able to identify multi-scale cucumber shoulder targets quickly and accurately in complex near-color backgrounds and satisfies lightweight operation. The YOLOv5s-S model achieved an F1 score of 84.7% and an mAP of 87.5%. The model size is 11.6 MB, and the detection speed is measured to be 0.013 s per frame for a single image. Based on the results obtained from this experiment, we can conclude:

The YOLOv5s-S model with BiFPN_Concat outperformed both the YOLOv5s model with PAN and the YOLOv5s-S model with BiFPN_Add, as demonstrated through Grad-CAM visualization. It allowed better extraction of cucumber shoulder features and improved the robustness of the detection algorithm in multi-scale and near-color situations.
The C3CA module outperformed the C3SE, C3CBAM and C3ECA modules. It also outperformed C3SimAM in terms of detection accuracy, which is a parameter-free attention mechanism. The C3CA module gave better attention to the cucumber shoulder.
All three improvement terms contributed to the model performance improvement, as verified by ablation experiments. There was also some mutual compatibility among the three models.
The YOLOv5s-S model surpassed lightweight models like YOLOv7-tiny and YOLOv8s in terms of performance.

The lightweight design and high accuracy of the YOLOv5s-S model make it a promising solution for low-computing power deployment in intelligent cucumber-picking robots while also providing a novel approach to multi-scale crop target recognition in a near-color context that can advance agricultural modernization.

Author Contributions

Conceptualization, L.S.; methodology, L.S. and H.S.; software, L.S., X.L., L.W. and R.W.; writing—original draft, L.S.; writing—review and editing, L.S., H.S., N.W. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

Science and Technology Innovation Fund Project of Shanxi Agricultural University (Project No: 2020BQ02).

Data Availability Statement

The data are available from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, N.; Qian, T.; Yang, J.; Li, L.; Zhang, Y.; Zheng, X.; Xu, Y.; Zhao, H.; Zhao, J. An Enhanced YOLOv5 Model for Greenhouse Cucumber Fruit Recognition Based on Color Space Features. Agriculture 2022, 12, 1556. [Google Scholar] [CrossRef]
Liu, X.; Zhao, D.; Jia, W.; Ji, W.; Ruan, C.; Sun, Y. Cucumber Fruits Detection in Greenhouses Based on Instance Segmentation. IEEE Access 2019, 7, 139635–139642. [Google Scholar] [CrossRef]
Bai, Y.; Guo, Y.; Zhang, Q.; Cao, B.; Zhang, B. Multi-network fusion algorithm with transfer learning for green cucumber segmentation and recognition under complex natural environment. Comput. Electron. Agric. 2022, 194, 106789. [Google Scholar] [CrossRef]
Wenan, Y. Accuracy Comparison of YOLOv7 and YOLOv4 Regarding Image Annotation Quality for Apple Flower Bud Classification. AgriEngineering 2023, 5, 413–424. [Google Scholar] [CrossRef]
Khan, F.; Zafar, N.; Tahir, M.N.; Aqib, M.; Saleem, S.; Haroon, Z. Deep Learning-Based Approach for Weed Detection in Potato Crops. Environ. Sci. Proc. 2022, 23, 6. [Google Scholar] [CrossRef]
Xu, W.; Zhao, L.; Li, J.; Shang, S.; Ding, X.; Wang, T. Detection and classification of tea buds based on deep learning. Comput. Electron. Agric. 2022, 192, 106547. [Google Scholar] [CrossRef]
Zeng, T.; Li, S.; Song, Q.; Zhong, F.; Wei, X. Lightweight tomato real-time detection method based on improved YOLO and mobile deployment. Comput. Electron. Agric. 2023, 205, 107625. [Google Scholar] [CrossRef]
Yang, W.; Ma, X.; Hu, W.; Tang, P. Lightweight Blueberry Fruit Recognition Based on Multi-Scale and Attention Fusion NCBAM. Agronomy 2022, 12, 2354. [Google Scholar] [CrossRef]
Li, S.; Zhang, S.; Xue, J.; Sun, H. Lightweight target detection for the field flat jujube based on improved YOLOv5. Comput. Electron. Agric. 2022, 202, 107391. [Google Scholar] [CrossRef]
Rajan, S.K.S.; Damodaran, N. MAFFN_YOLOv5: Multi-Scale Attention Feature Fusion Network on the YOLOv5 Model for the Health Detection of Coral-Reefs Using a Built-In Benchmark Dataset. Analytics 2023, 2, 77–104. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. arXiv 2016, arXiv:1612.03144. [Google Scholar] [CrossRef]
Wang, W.; Xie, E.; Song, X.; Zang, Y.; Wang, W.; Lu, T.; Yu, G.; Shen, C. Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network. arXiv 2019, arXiv:1908.05900. [Google Scholar] [CrossRef]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual Event, 19–25 June 2021. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef] [Green Version]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q.H. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Zou, X.; Wu, C.; Liu, H.; Yu, Z. Improved ResNet-50 model for identifying defects on wood surfaces. Signal Image Video Process. 2023, 17, 3119–3126. [Google Scholar] [CrossRef]
Zhang, L.-Q.; Liu, Z.-T.; Jiang, C.-S. An Improved SimAM Based CNN for Facial Expression Recognition. In Proceedings of the 2022 41st Chinese Control Conference (CCC), Hefei, China, 25–27 July 2022; pp. 582–586. [Google Scholar] [CrossRef]
Raimundo, A.; Pavia João, P.; Sebastião, P.; Postolache, O. YOLOX-Ray: An Efficient Attention-Based Single-Staged Object Detector Tailored for Industrial Inspections. Sensors 2023, 23, 4681. [Google Scholar] [CrossRef] [PubMed]
Qian, J.; Lin, J.; Bai, D.; Xu, R.; Lin, H. Omni-Dimensional Dynamic Convolution Meets Bottleneck Transformer: A Novel Improved High Accuracy Forest Fire Smoke Detection Model. Forests 2023, 14, 838. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar] [CrossRef]
Liu, J.; Cai, Q.; Zou, F.; Zhu, Y.; Liao, L.; Guo, F. BiGA-YOLO: A Lightweight Object Detection Network Based on YOLOv5 for Autonomous Driving. Electronics 2023, 12, 2745. [Google Scholar] [CrossRef]
Pan, L.; Duan, Y.; Zhang, Y.; Xie, B.; Zhang, R. A lightweight algorithm based on YOLOv5 for relative position detection of hydraulic support at coal mining faces. J. Real-Time Image Process. 2023, 20, 40. [Google Scholar] [CrossRef]
Zhu, F.; Wang, Y.; Cui, J.; Liu, G.; Li, H. Target detection for remote sensing based on the enhanced YOLOv4 with improved BiFPN. Egypt. J. Remote Sens. Space Sci. 2023, 26, 351–360. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Laurens, V.D.M.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef] [Green Version]
Zhang, Z.; Feng, Y.; Li, Y.; Zhao, L.; Wang, X.; Han, D. Prediction of obstructive sleep apnea using deep learning in 3D craniofacial reconstruction. J. Thorac. Dis. 2023, 15, 90–100. [Google Scholar] [CrossRef]
Moujahid, H.; Cherradi, B.; Al-Sarem, M.; Bahatti, L.; Bakr Assedik Mohammed Yahya Eljialy, A.; Alsaeedi, A.; Saeed, F. Combining CNN and Grad-Cam for COVID-19 Disease Prediction and Visual Explanation. Intell. Autom. Soft Comput. 2022, 32, 723–745. [Google Scholar] [CrossRef]
Li, G.; Suo, R.; Zhao, G.; Gao, C.; Fu, L.; Shi, F.; Dhupia, J.; Li, R.; Cui, Y. Real-time detection of kiwifruit flower and bud simultaneously in orchard using YOLOv4 for robotic pollination. Comput. Electron. Agric. 2022, 193, 106641. [Google Scholar] [CrossRef]
MacEachern, C.B.; Esau, T.J.; Schumann, A.W.; Hennessy, P.J.; Zaman, Q.U. Detection of fruit maturity stage and yield estimation in wild blueberry using deep learning convolutional neural networks. Smart Agric. Technol. 2023, 3, 100099. [Google Scholar] [CrossRef]
Zhong, Z.; Xiong, J.; Zheng, Z.; Liu, B.; Liao, S.; Huo, Z.; Yang, Z. A method for litchi picking points calculation in natural environment based on main fruit bearing branch detection. Comput. Electron. Agric. 2021, 189, 106398. [Google Scholar] [CrossRef]

Figure 1. The examples of random expansion of cucumber sequence original data. (a) It is adjusted sharpness; (b) It is a random rotation; (c) It is a reduced brightness operation; (d) It is an elevated brightness operation; (e) It is added Gaussian noise.

Figure 2. Data Labeling Chart and Example Chart. (a) It is labeled a diagram of a cucumber shoulder; (b) It is the anchor box redundancy schematic; (c) It is a schematic of the missing cucumber shoulder; (d) It is a schematic diagram of the masking.

Figure 3. Structure of the YOLOv5s-S model.

Figure 4. Structure of the YOLOv5 model.

Figure 5. Diagram of the C3CA structure.

Figure 6. The convolution process of ordinary convolution and Ghost Conv. (a) It is the standard convolution; (b) It is the Ghost convolution.

Figure 7. The structure of feature fusion. (a) It is the FPN structure; (b) It is the PAN structure; (c) It is the BiFPN structure.

Figure 8. Feature fusion schematic of Concat and Add.

Figure 9. Training loss curve for the YOLOv5s-S model.

Figure 10. This is the Grad-CAM Images, the darker the red color means the model pays more attention to that spot. (a) The YOLOv5s-S model with PAN; (b) The YOLOv5s-S model with BiFPN_Concat; (c) The YOLOv5s-S model with BiFPN_Add.

Figure 11. The comparison of mAP through box plots. The white squares in the figure represent mean values and the black diamonds represent outliers.

Figure 12. Training loss curves.

Figure 13. Radar charts for different detection models.

Figure 14. Examples of cucumber shoulder recognition results. (a) They are comparison charts of leakage detection; (b) They are comparison charts of error detection; (c) They are comparison charts of recognition accuracy; (d) They are comparison charts of repeated detection.

Table 1. Information of hardware and environment.

Hardware	Configure	Environment	Version
System	Windows 10	Python	3.8.5
CPU	AMD Ryzen 7 5700×	PyTorch	1.10.0
GPU	RTX3060 Ti	labelImg	1.8.6
RAM	16 G	CUDA	11.0
Hard-disk	1.5 T	CUDNN	8.4.0

Table 2. Results of a comparative trial of five attention mechanisms.

Model	mAP (%)	F1 (%)	Parameters (10⁶ M)	FLOPs (G)	Size (MB)
C3SE	86.0	83.4	5.9	13.6	11.6
C3ECA	86.6	84.0	5.9	13.6	11.7
C3CBAM	86.3	83.4	5.9	13.7	11.7
C3SimAM	85.8	83.0	5.1	12.8	10.2
C3CA	87.6	84.7	5.8	13.5	11.6

Table 3. Findings of ablation experiments.

Model	C	G	B	P (%)	R (%)	mAP (%)	Parameter (10⁶ M)	FLOPs (G)	Size (MB)
YOLOv5s				86.5	78.4	83.9	7.0	15.8	13.8
IPV 1	√	×	×	84.5	81.0	86.4	7.0	15.8	14.0
IPV 2	×	√	×	86.6	78.6	86.2	5.8	13.4	11.5
IPV 3	×	×	√	84.1	81.3	85.5	7.0	16.0	13.9
IPV 4	×	√	√	85.5	79.5	85.4	5.9	13.6	11.6
IPV 5	√	×	√	89.9	79.7	87.7	7.1	16.1	14.0
IPV 6	√	√	×	88.8	81.6	87.3	5.9	13.9	11.6
IPV 7	√	√	√	88.5	81.2	87.6	5.8	13.5	11.6

Table 4. Findings of cucumber shoulder using various one-stage detection algorithms.

Model	P (%)	R (%)	mAP (%)	F1 (%)	Parameter (10⁶ M)	FLOPs (G)	Size (MB)
YOLOv3-tiny	82.6	74.1	76.7	78.1	8.7	12.9	16.6
YOLOv4-tiny	63.4	86.5	84.0	73.1	5.9	l6.2	23.6
YOLOv7-tiny	82.3	79.1	83.4	80.7	6.0	13.2	12.3
YOLOv8s	86.0	80.1	85.4	82.9	11.1	28.4	21.5
YOLOv5s	86.5	78.4	83.9	82.3	7.0	15.8	13.8
YOLOv5m	82.5	79.5	84.9	80.9	20.9	49.7	40.3
YOLOv5x	83.7	79.5	83.9	81.5	86.2	203.8	165.1
YOLOv5l	86.1	80.1	86.1	83.0	47.4	115.7	91.0
YOLOv5s-S	88.5	81.2	87.6	84.7	5.8	13.5	11.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, L.; Sun, H.; Zhang, S.; Lu, X.; Wang, R.; Wang, L.; Wang, N. Cucumber Picking Recognition in Near-Color Background Based on Improved YOLOv5. Agronomy 2023, 13, 2062. https://doi.org/10.3390/agronomy13082062

AMA Style

Su L, Sun H, Zhang S, Lu X, Wang R, Wang L, Wang N. Cucumber Picking Recognition in Near-Color Background Based on Improved YOLOv5. Agronomy. 2023; 13(8):2062. https://doi.org/10.3390/agronomy13082062

Chicago/Turabian Style

Su, Liyang, Haixia Sun, Shujuan Zhang, Xinyuan Lu, Runrun Wang, Linjie Wang, and Ning Wang. 2023. "Cucumber Picking Recognition in Near-Color Background Based on Improved YOLOv5" Agronomy 13, no. 8: 2062. https://doi.org/10.3390/agronomy13082062

APA Style

Su, L., Sun, H., Zhang, S., Lu, X., Wang, R., Wang, L., & Wang, N. (2023). Cucumber Picking Recognition in Near-Color Background Based on Improved YOLOv5. Agronomy, 13(8), 2062. https://doi.org/10.3390/agronomy13082062

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cucumber Picking Recognition in Near-Color Background Based on Improved YOLOv5

Abstract

1. Introduction

2. Materials and Methods

2.1. Acquisition and Annotation of Images

2.2. The Improved Model

2.2.1. YOLO5s-S Model

2.2.2. The Algorithm Principle of YOLOv5

2.2.3. The CA Module

2.2.4. Ghost Module

2.2.5. BiFPN

2.3. Environment Construction and Evaluation Indicators

2.3.1. Environment Construction

2.3.2. Evaluation Indicators

3. Results

3.1. Comparison of Feature Fusion

3.2. Comparison of Attentional Mechanism

3.3. Ablation Experiment

3.4. Comparison of Traditional Network Models

3.5. Model Detection for Test Set

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI