1. Introduction
Computer vision has been extensively integrated into crop sowing. More recently, the marriage of computer vision and deep learning techniques has found innovative applications, especially in seeding detection [
1]. The quality of sowing profoundly influences crop growth; optimal sowing allows crops to access water, sunlight, and nutrients more efficiently during their growth phase [
2]. This is particularly true for wheat, where the yield and quality are directly contingent upon the sowing quality. Detecting aspects like the sowing quantity, uniformity, dispersion, and accuracy are critical during the wheat-sowing process. Real-time, accurate monitoring can drastically reduce seed wastage and mitigate the occurrence of poor seedlings [
3]. Before seeders enter mass production, they must undergo performance testing. Conducting these tests in a controlled indoor environment can expedite the product’s test and production cycle [
4]. Thus, evaluating a seeder’s sowing performance is crucial to refining its structural design and optimizing its sowing quality.
Traditional sowing test benches are heavily dependent upon manual labor, with sowing performance assessments often requiring manual calculations in the subsequent stages. However, this detection approach is deemed inefficient, failing to meet the performance detection demands of large-scale seeding equipment. Consequently, advanced sowing detection mechanisms, tailored to assess seeder performance more efficiently, have been continuously developed. Chen [
5] optimized a seed-flow-sensing device using the refraction of a double-convex lens and designed a precision sowing monitoring system that is compatible with the sensing device. This system can detect seed re-sowing and missed sowing. Lu et al. [
6] incorporated an intermittent automatic sampling mechanism to devise a test bench for assessing the sowing performance of strip-sowing seeders, achieving timely and evenly spaced automatic sampling along with sowing uniformity detection. Tang [
7] designed a maize precision seeding machine seed monitoring system with a STM32 microcontroller as the core that can detect the maize seeding situation. Chen [
8] introduced a monitoring method for the sowing rate of a rapeseed planter utilizing a linear CCD sensor, achieving the precise measurement of the seeding rate. However, most of the detection equipment mentioned primarily address issues like repeated and missed sowing, and do not offer the real-time and accurate analysis of parameters such as the sowing quantity, uniformity, dispersion, and accuracy.
Currently, the predominant methods for seed quality detection are photoelectric, capacitive, and image-based. Photoelectric sowing detection operates on the principle that the movement of a seed between a transmitter and receiver leads to a change in the light source, subsequently altering the electrical signal. This method, however, encounters limitations in detection accuracy when seeds overlap. Furthermore, suboptimal seeding environments can also degrade its performance [
9,
10,
11,
12,
13,
14]. Capacitive seeding detection is grounded on the concept that the equivalent dielectric constant of the capacitive sensor alters when seeds pass through. Yet, this method is less responsive to single seeds, making it unsuitable for the precise counting of minor seed quantities, and it displays a high output impedance with a subpar load capacity [
15,
16,
17,
18,
19]. Machine vision sensor detection revolves around capturing real-time seed data using an industrial camera, which then forwards this information to an image-processing center. This center subsequently analyzes individual seed data to evaluate the seeding performance. It is worth noting that the machine vision approach necessitates a conducive detection environment, which is typically found in laboratory settings [
20,
21,
22,
23,
24,
25,
26]. The aforementioned methods cannot achieve the online detection of the uniformity, accuracy, and dispersion of the seeder. This study seeks to address this gap, targeting swift indoor wheat seeder detection. The proposed solution encompasses image capture, wheat seed identification, and performance detection workflows. By using an improved YOLOv5s model, adding an attention mechanism and replacing the original upsampling method, wheat seed recognition can be achieved with high speed and high accuracy. The position of the seeds is then located to achieve the online detection of the uniformity, accuracy, and dispersion of the seeder.
In this study, we selected the YOLOv5s model and enhanced it by adding four CBAM (Convolutional Block Attention Module) attention mechanism modules to its backbone. The CBAM attention mechanism module allows the network to adapt through weight allocation and information filtering, extracting more relevant information from feature information during training. Additionally, we replaced the upsampling method of the neck layer in the original YOLOv5s with CARAFE (Content-Aware ReAssembly of FEatures) upsampling. This method has a large receptive field, enabling the enhanced utilization of surrounding information and the improved recognition of important features in wheat seed images. Furthermore, CARAFE upsampling is lightweight. We deployed the improved YOLOv5s model on a device for the online detection of the seeding performance, successfully meeting detection requirements.
2. Materials and Methods
2.1. Experimental Apparatus
The experimental sowing bench (Intelligent Agricultural Equipment Laboratory of Yangzhou University, Yangzhou, China) is depicted in
Figure 1. It primarily comprises an external groove wheel seeder, a seed box, a seed-dropping conveyor belt, a conveyor-belt-speed-regulating motor, a camera, and a bracket. The seeder is fixed on the bracket, and the seeds fall on the conveyor belt. The travel speed of the implement is simulated by adjusting the conveyor belt speed, while the sowing rate can be modified by altering the seeder speed. The specific parameters of the experimental bench are presented in
Table 1, and the camera parameters are detailed in
Table 2.
2.2. Dataset Preparation
According to the running speed of the seed-dropping conveyor belt, the shooting interval time is set to 2 s, which ensures that under the set speed of the conveyor belt, there is no repetition and overlap in the photos obtained by the camera. When performing sowing performance detection, the conveyor belt is first started, and the seeder is started after the conveyor has begun to rotate stably. The seeder arranges the seeds from the seed box onto the rotating and stable conveyor belt.
The seed images were collected at the Intelligent Agricultural Equipment Laboratory of Yangzhou University. The shooting equipment parameters are shown in
Table 2. Images were collected under various conditions, including strong light, weak light, normal, blurred, and mixed conditions. This was able to improve the ability of the equipment to detect the sowing quality under different environments. After screening the seed images, 1520 images were selected.
The Labelme (MIT’s Computer Science and Artificial Intelligence Laboratory, Boston, MA, USA) annotation tool was used to annotate 1520 images. The seeds in the images were framed with horizontal rectangles. The annotated seeds were labeled with the tag “seed”. The annotated image format was then converted to a txt file. The training and validation sets were divided in a 4:1 ratio, with 1216 images in the training set and 304 images in the validation set. To verify that the division of the dataset would have no impact on the experimental results, and to reduce overfitting to a certain extent, 5-fold cross-validation was used. The augmented data were divided into five parts, with four parts used for the training set and one part used for the validation set.
2.3. Sowing Performance Evaluation Indicator
This paper uses sowing uniformity, accuracy and dispersion as the sowing performance evaluation indicators of the seeder. The sowing uniformity
U refers to the uniformity of the seeds that fall on the conveyor belt. Five images of the seed belt were continuously taken on the conveyor belt, with pixels of 1920 × 1080.
U can be expressed as follows:
where
X is the number of wheat seeds in the image;
is the mean value of
X;
S is the standard deviation of
X; and
z represents the number of images.
The sowing accuracy
μ is the ratio of the number of seeds falling into the seed groove to the total number of seeds discharged. The general seeder opens a seed groove width of 50 mm [
27], and the width of the conveyor belt of the sowing experimental bench is 200 mm. Through the accurate calibration of the camera position, the relationship between the sowing drill width and the image pixels is shown in
Figure 2. The sowing accuracy in this paper refers to the ratio of the number of seeds falling within the pixels of 405~675 in the y direction to the total number of seeds in the image.
μ can be expressed as follows:
where
m1 is the number of seeds in the specified pixel range in the image; and
m2 is the total number of seeds in the image. The larger the
μ value, the higher the accuracy of the seeds falling into the seed groove, which is more conducive to precise sowing.
The sowing dispersion
V refers to the degree of dispersion between the seeds that fall into the seed groove. The sowing dispersion affects the later growth of the crop. The higher the dispersion, the better the development of the crop root system. The sowing dispersion in this paper refers to the degree to which the seeds are dispersed within the width of the seed groove. The seeds that fall within the pixel range of 405~675 in the y direction are located by the algorithm and recorded. The distance
yi between each seed and the straight line formed by the pixel value is 405 in the y direction.
V can be expressed as follows:
where
is the mean of
yi;
is the standard deviation of
yi; and
n is the number of seeds that fall within the pixel range of 405~675 in the y direction. The larger the
V value, the higher the degree of sowing dispersion.
2.4. Seed Counting and Coordinate Positioning
The detected target boxes labeled as “seed” are counted to accurately count the seeds. By calculating the center coordinates of the target boxes, the coordinate positions of the seeds can be obtained.
2.5. Improved YOLOv5s Model
For instance, consider the R-CNN model. Its primary feature is the generation of a target region, ensuring accuracy and recall, before classifying samples. This sequence characterizes it as a two-stage model. While such models possess high accuracy, they lack in terms of speed, leading to their exclusion from this study. In contrast, one-stage models, without the need to pre-generate target regions, can directly determine the final object category probability and its associated position coordinates in the image. This direct approach renders them faster, making them apt for lightweight deployment. The YOLO series stands as a representative example of such networks.
At present, the mainstream target detection tasks mainly revolve around the YOLOv5 series. This series, in comparison to its predecessors, exhibits significant alterations. There is an enhanced focus on small-target detection, coupled with noteworthy improvements in speed and accuracy. YOLOv5 has different models to choose from. In this paper, since there is only one detection category for wheat seeds, and considering the need for real-time detection and easy deployment, the YOLOv5s model with fewer parameters and computations is used as the base model. The YOLOv5s target detection model is divided into four parts, as shown in
Figure 3. The input layer is responsible for passing the image into the model. The backbone network of YOLOv5s uses the CSPDarkNet53 structure, which can extract features from the image. The neck layer is responsible for fusing the features extracted by the backbone network. The detect layer predicts 3 different dimensions of features to obtain the predicted class and position information from the network [
28,
29,
30,
31,
32,
33,
34].
2.5.1. Adding CBAM Attention Mechanisms to Backbone
The attention mechanism strengthens the adaptability of the network by assigning weights and filtering information. During the model-training process, more useful information is extracted from the feature information. The CBAM attention mechanism is composed of a channel attention module and a spatial attention module, which jointly process the input feature layer, as shown in
Figure 4. In this paper, a CBAM attention mechanism module is added after each C3 module in the backbone, as shown in
Figure 5. The CBAM attention mechanism modules all make the YOLOv5s model better at extracting features. After the seed image features are processed by the channel attention module, the input feature layer is subjected to global average pooling and global maximum pooling, respectively. The results of the two pooling operations are processed by a shared fully connected layer. Finally, the results processed by the fully connected layer are added and sent to the Sigmoid activation function. The reason that the aforementioned channel attention module is more capable of extracting representations than a single pooling method is that it employs a method that compresses the spatial dimensions of the input feature map. The spatial attention module for the input end is formed by merging the feature map output by the channel attention module with the initial input feature map, and the spatial attention module focuses on which part of the image information is more important. Therefore, after passing through the CBAM module, the generated feature map can highlight key target wheat information.
2.5.2. Replacing the Upsampling Method with CARAFE Upsampling
The YOLOv5s employs the nearest neighbor interpolation upsampling algorithm to augment the feature map’s resolution. While this method is computationally efficient and straightforward, it might compromise the detection accuracy to an extent. The CARAFE upsampling technique consists of two primary components: upsampling kernel prediction and feature reassembly, as illustrated in
Figure 6. In the upsampling kernel prediction part, feature map channel compression is used to compress the input H × W × C feature map into H × W × Cm by using a 1 × 1 convolution, which can effectively reduce the amount of computation. Content encoding and upsampling the kernel prediction can be achieved by using convolution operations to change the number of channels from Cm to σ
2 × k
up2 for content encoding, where the upsampling rate is σ and the upsampling size is k
up × k
up. Then, the channels are unfolded in the spatial dimension. The prediction results are processed with SoftMax normalization, and the feature map is passed into the feature reorganization part. The features on each layer of the feature map are multiplied by the predicted upsampling kernel to obtain the upsampling result. CARAFE upsampling has a large receptive field and can make good use of surrounding information. At the same time, the upsampling kernel and feature map are semantically related, and upsampling is performed based on the input content. In addition, CARAFE upsampling has the characteristics of being lightweight and does not introduce a large number of parameters and computations. This study improves the upsampling method in the YOLOv5s model by using CARAFE upsampling instead of the original upsampling method, as shown in
Figure 7; thus, the ability of the equipment to recognize important features during upsampling is improved.
4. Discussion
In this study, we introduce an enhanced model derived from YOLOv5s, specifically designed to detect the sowing performance of seed planters. Our proposed model presents a viable alternative to traditional and less efficient manual inspection methods. Using a wheat planter as our primary test subject, we established a specialized sowing inspection platform. By emulating real-world production conditions, we sought to bolster the robustness of our approach. We curated a dataset comprising images of wheat seeds captured under diverse environmental conditions. After refining and training our model on this dataset, it demonstrated an ability to precisely identify wheat seeds across varying environmental conditions in images, fulfilling the requirements of comprehensive sowing performance evaluation.
In this study, the CBAM attention mechanism module was integrated into the original YOLOv5s model. The incorporation of the CBAM module enhanced the network’s ability to extract and identify object features. By embedding the CBAM attention mechanism into the backbone network, we observed a noticeable improvement in the overall object recognition performance.
Furthermore, within the neck layer of the original YOLOv5s model, the upsampling method was replaced with CARAFE upsampling. CARAFE upsampling offers a more expansive receptive field compared to conventional upsampling techniques and introduces fewer parameters, aligning well with our pursuit of model lightweighting. With the introduction of the CARAFE upsampling method in our study, we anticipate an enhanced recognition performance for wheat seed identification.
While the proposed model demonstrates promising results regarding its ability to detect wheat seeds and showcase improvements in both the model size and detection speed—successfully being implemented on the sowing test platform—there still exists room for enhancements in the finer details of the model’s network structure. Such refinements could further elevate the network’s recognition capabilities.
In future research, the effectiveness of sowing should be considered from multiple perspectives. For instance, seed overlapping and occlusion can impact the degree of overlap between the predicted bounding boxes and the actual ground truth, subsequently influencing the recognition performance. Introducing other evaluation metrics, such as the Intersection over Union (IoU) value, would be beneficial. The scope should not be limited solely to wheat seeds; expanding the dataset to include other crops will facilitate research on different seeds and help to apply the methodology to sowing detection for a broader range of crops.