1. Introduction
Rice, being one of China’s most vital food crops, plays a crucial role in ensuring food security through its high quality and yield [
1,
2,
3,
4,
5]. During the reproductive period of rice, paddy weeds compete with rice for soil nutrients and water, resulting in a depletion of water and fertilizer resources. Additionally, weed infestations can promote the development and proliferation of pests and diseases [
6]. Paddy weeds have emerged as a significant source of biohazards that hinder both the yield and quality of rice [
7,
8,
9]. Therefore, the prevention effect of weed control is closely related to the yield of rice.
At present, rice management still adopts the mode of “two closures and one extermination”. Among them, the first closure refers to the initial control of weeds in the rice field by spraying herbicides before replanting. The second closure refers to the application of herbicide at the regreening stage of the rice to further weed control. The “one extermination” refers to the application of medicine to the remaining weeds at the tillering stage to achieve the purpose of eliminating the weeds [
10]. During the “one extermination” period, indiscriminate pesticide spraying is usually used. At this time,
Sagittaria trifolia can exist in large numbers of plants with small leaves, thus constituting a small target for detection. Moreover, rice is at the tillering stage, and weeds usually appear to obscure each other with rice in an irregular and dense distribution. There will also be some
Sagittaria trifolia that are highly similar in appearance to rice at this time of year, resulting in low accuracy for weed detection in complex backgrounds [
11]. In addition, the large exposed area of the water body during this period requires the model to ensure accuracy and efficiency while also being resistant to interference when dealing with solar reflections from the water body.
The use of crude large-area spraying of herbicides is costly, and pesticide utilization is low. The resulting residual herbicide can contaminate paddy field aquatic environments to the detriment of the ecosystems [
12,
13] and affect human health through other media [
14]. The variable application of plant protection UAVs can reduce the number of pesticides under the premise of ensuring the effect of weed prevention and control, which is in line with the development trend of green agriculture [
15,
16] in which the accurate detection of weeds is the first link to achieve accurate application. The first step of variable application is to use the UAV to detect weeds at a high altitude, then obtain the weed distribution in each plot based on the detection consequence for the formulation of variable application strategies and drawing of variable application prescription maps, and finally achieve variable spraying based on the prescription maps through a low-altitude flight of the UAV, i.e., the mode of “high-altitude accurate detection—prescription map generation—low-altitude variable spraying [
17]”. Of these, efficient and accurate detection of weeds in rice paddies as the first link is crucial.
In the tillering stage, the exposed water area of the rice field is still large, and sun-light reflects in the water while the UAVs take the photos. At this time, a large part of the Sagittaria trifolia consists of plants with smaller leaves. In addition, Sagittaria trifolia in this period is often obscured by rice leaves, and some are similar in appearance to rice. These can make weed detection difficult. UAVs have a high degree of maneuverability, flexibility, high resolution, and high quality when collecting images, which helps to complete tasks such as the detection of weed species. Suitable types of UAVs can be selected according to the actual application scenarios, providing strong support for weed control and agricultural production.
In recent years, UAV remote sensing has been used in weed detection. According to the characteristics of the onboard sensors and data processing methods, weed detection methods can be divided based on multispectral imaging [
18] and RGB digital imaging [
19,
20,
21]. Spectral imaging combines spectral and image techniques to acquire spectral feature information and map it to spatially located pixel points [
22]. The WSRI index was proposed to construct the deep convolution neural network evaluation model, achieving detection accuracies of 81.1% for barnyard grass and 92.4% for downy leaves in paddy fields [
23]. However, multispectral sensors lack application feasibility due to high cost, complex feature extraction, and complicated data processing for plants with cluttered backgrounds [
24].
RGB images acquired by UAVs have high spatial resolution [
25], which can be used to achieve accurate detection of weed targets by traditional machine vision-based detection methods and deep convolutional neural network-based detection methods [
26]. Huang extracted the color and texture features of paddy field weeds and used a BP classifier scale parameter of 100 with optimal results, and its average intersection ratio and accuracy were 68.7% and 83.6%, respectively [
27]. Sheng Zhu et al. used Ada Boost with a combined detection accuracy of 90.25% for a 100 pixel by 100 pixel rice weed image [
28]. However, the above algorithms have the problems of low detection accuracy and low efficiency, require manual multiple trials for feature selection, and the classification results depend on the feature performance effect [
29].
Deep learning models can extract key features from complex data through multi-level nonlinear transformation, which greatly improves the accuracy of image recognition [
30,
31]. UNet with ResNet34 as a feature extractor was used to detect weed images with motion blur in sorghum fields, achieving an accuracy of 93.01% [
32]. Cai et al. improved the network by inserting the ECA module into the SPP layer of the network based on PSPNet, and the network had an accuracy of 86.18% for detecting weeds in pineapple fields [
33]. Jie Kang et al. proposed a weed detection algorithm, which combined feature enhancement and multi-scale fusion modules. The accuracy of the algorithm reached 88.4% in the detection of sugar beet weeds [
34].
Although the accuracy of weed detection in the above study can meet the needs of variable spraying, the efficiency of the model has not been sufficiently taken into account, and weed spraying often has to be effective, especially in the face of a large area of paddy field operations [
20]. Given this, some scholars have delved into lightweight models. Qingkuan Meng et al. lightened and improved the SSD with a precision of 88.27% and a detection speed of 32.26 FPS on a 480 pixels × 720 pixels corn weed dataset [
35]. The aforementioned researchers enhanced the model structure to balance detection accuracy and speed, resulting in a notable improvement in model performance. However, studies on weed targets have been for the detection of single weeds with little background interference. In actual farmland environments, the same class of weeds can be affected by the growing environment in complex detail.
In recent years, the one-stage model represented by the YOLO series of algorithms extracts features directly through a convolutional network and generates a bounding box on the predicted feature map while classifying and regressing the generated bounding box, which can satisfy the operational needs of accuracy and efficiency at the same time from the input image to the final prediction result [
36]. Guzel et al. used YOLOv5-small to detect weeds in wheat fields with 81.0% accuracy at the anthesis [
37]. Tetila et al. used YOLOv5 and YOLOv4 for weed target detection on soybean weed data collected by UAVs, respectively, and the model size was only about one-tenth of that of YOLOv4 while the accuracy was on par with YOLOv4 [
38].
In actual agricultural production, there are high requirements for timeliness and accuracy in weed control to achieve better control effects [
39]. The innovative single-stage structure design of this series of models circumvents the complex region candidate box generation link and significantly reduces the model computation, making it possible to be deployed and applied on embedded devices with limited arithmetic power [
40].
YOLOv10 is currently the latest model in the YOLO series. Some scholars have used YOLOv10n, YOLOv10s, and YOLOv10m to grade yellow cauliflower with a mAP50 of 80.0%, 83.3%, and 83.8%, respectively. Their computational volumes were 8.2, 24.5, and 63.4 GFLOPs, respectively [
41]. Although the mAP50 of YOLOv10s is 3.3% higher than YOLOv10n, the amount of computation becomes almost triple. This situation does not apply to embedded devices. This shows that YOLOv10n is more suitable on agricultural equipment with limited arithmetic power. Therefore, this study chooses to apply the YOLOv10n model for weed detection in UAV rice field images.
Many excellent YOLO-based improved models have emerged. For example, integrating a genetic algorithm into the YOLO model to optimize the model hyperparameters has shown strong stability and effectiveness in the task of real-time detection of distribution networks, with a mAP50 and F1 score of 92.2% and 0.867, respectively [
42]. YOLOu-Quasi-ProtoPNet was proposed, which combines YOLOv5 for object detection and Quasi-ProtoPNet for classification, and various deep learning model structures were tried to optimize the performance of the classification part, outperforming similar models [
43].
To improve the detection performance of underwater small targets, FasterNet was integrated into YOLOv8. Evaluation results on the URPC2021 dataset show that the improved model can achieve an average accuracy of 84.7%. The use of FasterNet effectively enhances the model’s ability to capture detailed features of small targets [
44]. The novel semantic segmentation model EGCN proposed by Yang et al. uses CGBlock as the base module and U-Net as the baseline, which can integrate contextual features and local spatial features at both high and low levels. This improves the accuracy of landslide identification [
45]. Lu et al. used DySample to replace the up-sampling module in YOLOv8 for the problem of low image quality. Experimental results show that our enhanced algorithm achieves significant improvements in check accuracy, check completeness, and map50 compared to the original model [
46]. Zhou et al. guaranteed model performance while drastically reducing the number of parameters in deep learning models by using shared convolution [
47].
In order to solve the difficulty of weed detection in paddy fields, this study first proposes to replace the YOLOv10n backbone network to improve the feature extraction capability of the model. Secondly, this research intends to improve the up-sampling and down-sampling module of the neck and the conventional convolution module to improve the feature expression ability of the model. Then, a lightweight detection head is proposed to reduce the number of model parameters and the amount of calculation, so as to improve the inference speed of the model. Finally, based on the detection results of the weeds in the paddy field by the improved YOLOv10n, a herbicide variable spraying strategy is developed, and UAV variable spraying prescription maps are developed and discussed.
2. Materials and Methods
2.1. Image Acquisition
The field trial site is located in the rice experimental field at Shengli Village, Aji Town, Tieling City, Liaoning Province, as shown in
Figure 1. The altitude is 55 m and the field has a temperate continental monsoon climate with an average annual precipitation of 678 mm. The soil is nutritionally rich with high fertility and a pH between 6 and 7. It is irrigated by flood irrigation. The paddy field breeds rice varieties including Fuhe 258 and Meifeng 336.
Sagittaria trifolia is the dominant weed in the area, with a large and uneven distribution.
The experiment was conducted in this demonstration area on 24 June 2024, when most of the Sagittaria trifolia were at the 1–3 leaf stage. A DJI M300 (manufactured by DJI, Shenzhen, China) drone equipped with a high-resolution digital camera with a Zenith P1 (manufactured by DJI, Shenzhen, China) lens was used as the remote sensing platform. The drone’s horizontal hovering accuracy was ±0.1 m, vertical hovering accuracy was ±0.3 m, and it could withstand up to four levels of wind. The digital camera had an effective pixel count of 45 million and a resolution of 8192 pixels × 5640 pixels. The image acquisition time was from 11:00 to 13:00, and the weather during the test period was clear with weak or light winds.
The orthophoto image of the test area was collected, and the flying height of the UAV was selected to be 22 m. A digital image taken at this height can not only avoid airflow disturbance and meet the needs of image registration fusion, but also pass the higher resolution required for capturing the Sagittaria trifolia. In short, it ensures both flight efficiency and detection accuracy. The total area of the demonstration area was 300 acres, and two representative fields were randomly selected as Field 1 and Field 2. Both experimental fields were 12 acres in size, totaling 24 acres. In order to ensure that the drone images can be successfully registered and fused, the drone and camera parameters were set so that there was an 80% overlap area between the two adjacent photos. Image alignment fusion of the collected remote sensing images of the paddy field was performed using DJI Terra (v4.3.0, manufactured by DIJ Shenzhen, China).
2.2. Image Labeling and Dataset Construction
When the drone captures images of a rice field, there will be an 80 percent overlap area between two images taken adjacent to each other. The supposed data annotation was performed directly on the original image, due to the existence of most of the overlapping regions in the adjacent images. This overlap could not only cause data redundancy, but also may cause inconsistent annotation or omission due to human error and other factors, and this inconsistency and omission of annotation would seriously affect the accuracy of the algorithm. To prevent this issue, the study initially utilized DJI Terra for image alignment and fusion of the weed UAV remote sensing data collected in 2024.
To adapt to the computer’s computing performance, meanwhile, and to speed up the convergence model’s speed of training, the image after alignment and fusion was cut into 640 × 640 pixel non-overlapping sub-images. A total of 1000 remote sensing images were obtained for Field 1 after image cutting, and 1008 images were obtained for Field 2, making a final total of 2008 remote sensing images obtained. The aligned, fused, and cropped images are shown in
Figure 2. At the tillering stage, weeds are not evenly distributed, with continuous growth in dense areas and single distribution in sparse areas. Therefore, in the process of manual labeling, we classified
Sagittaria trifolia into two categories: single wild cichlids and continuous wild cichlids, as shown in
Figure 3.
Manual annotation was performed using the image annotation tool Labelimg (v1.8.6). There were three categories of labels: ridge (ridge), single
Sagittaria trifolia (single), and contiguous
Sagittaria trifolia (multiple). The .txt file storing the information of labeled boxes was obtained to produce the YOLO format dataset, as shown in
Figure 2.
Field 1 and Field 2 obtained 503 and 614 images containing
Sagittaria trifolia or ridges and corresponding labels, respectively, for a total of 1117 images and corresponding labels. To prevent data leakage, the model was initially trained and tested by dividing the dataset into training, validation, and test sets in a 7:2:1 ratio, consisting of 782 frames for the training set, 223 frames for the validation set, and 112 frames for the test set. To avoid the model’s tendency to overfit because of a small number of samples and to enhance the model’s generalization ability while keeping the ratio of the training set, validation set, and test set unchanged, the samples and labels of the training set, validation set, and test set were each subjected to data enhancement operations such as random translation and rotation, flipping, shear transformations, random noise, random brightness, random clipping, and simulated sun flare. The size of the dataset was enlarged to three times of the original size, and a final dataset of 2346 frames for the training set, 670 frames for the validation set, and 336 frames for the test set was produced. The samples’ quantity in the dataset is shown in
Table 1.
2.3. YOLOv10n Baseline Model and Pre-Test
The paddy field environment is more complex, so a first pre-test was taken to find out the shortcomings of the model, and then make targeted improvements.
YOLOv10 currently contains n, s, m, l, and x versions, with increasing dimensions and parameters in that order. The overall design is divided into three elements: the backbone network (Backbone), the neck network (Neck), and the detection head (Head). As shown in
Figure 4, the structure of the YOLOv10n network is depicted. Backbone is responsible for feature extraction and consists of three main modules, Conv, C2f, and SPPF. Neck converges the feature maps of different sizes extracted by Backbone so that the shallow features are fused with the deeper ones. Head uses decoupling head to process classification, regression, and confidence tasks separately and converge the outputs.
Due to the limited computing power of the equipment in the agricultural environment, a pre-test was conducted on the field weed dataset using the YOLOv10n base model.
Figure 5 presents the test results. The model failed to accurately detect small target weeds when the weed target was too small. Weeds were also not accurately detected when they were obscured by rice. Furthermore, weeds that look too similar to rice were not accurately detected. In addition, restrictions on the UAV equipment limited the model’s size, the parameter count, and the amount of computation.
2.4. Construction of YOLOv10n-FCDS Based on Improved Weed Detection Model
The pre-test revealed that although YOLOv10n showed strong adaptability, it did not perform well on problems such as high numbers of small-targeted weeds, mutually obscured weeds, and high similarity of weeds to rice. Therefore, the following improvements were made to YOLOv10n:
(1) Using FasterNet to replace the backbone network to reduce redundant computation and improve computational efficiency while improving the model’s effectiveness in feature extraction for small target weeds. (2) Using CGBlock to replace the regular convolution and SCDown down-sampling modules in the neck network to improve the model’s ability to detect weeds that overlap with and are obscured by rice by focusing on surrounding and global contextual information. (3) Introducing the DySample up-sampling module, which improves the algorithm’s anti-interference effectiveness through the point-based sampling strategy and at the same time boosts the algorithm’s effectiveness in identifying weeds that are similar in appearance to rice. (4) Designing a lightweight detection head SCSD-Head relying on shared convolution and scale scaling to further reduce parameters and the number of operations, reduce the algorithm size, and optimize the algorithm’s speed. The revised model architecture is displayed in
Figure 6.
2.4.1. Backbone Feature Extraction Network Design
The backbone, serving as the model’s feature extraction module, plays a crucial role in the detection results. Therefore, the first improvements were made to the backbone network.
The need to increase the efficiency of the UAV has resulted in the need to detect weeds at higher altitudes. Due to the height and weeding period, there are usually more weeds with smaller sizes, as shown in
Figure 5d. A large amount of irrelevant information may therefore be captured in the process of extracting defective features, and with increasing neural network depth, the number of channels of the feature maps increases, resulting in more redundant information attached to the feature maps of multiple channel counts. This not only affects the detection speed of the model but also reduces the detection accuracy.
To solve this problem, this study replaces the backbone with a FasterNet network. The FasterNet network junction can effectively reduce redundant computation and memory access and fully extract spatial features. The replacement backbone network uses PConv, which differs from conventional convolution by applying standard convolution for feature extraction on only a subset of the input channels, while keeping the remaining feature map channels intact. This approach utilizes surplus information in the feature map and further reduces the computational cost. Based on the FasterNet backbone network proposed by PConv, its structure is shown in
Figure 7, which contains four hierarchical levels. In front of each hierarchical level, there is an embedding layer with a batch size of four regular 4 × 4 convolutions, or a merging layer with a batch size of two regular 2 × 2 convolutions which is used for down-sampling of the space with the number of channel expansion. The PConv layer in the FasterNet module can effectively reduce the redundant information generated by the smaller
Sagittaria trifolia targets during the feature extraction process.
Among them, the FasterNet Block consists of one PConv 3 × 3 module and two Conv 1 × 1 modules. The initial target features are first extracted by PConv, then the feature extraction is enhanced by two convolutions, and finally, the deep-level features are extracted by connecting with the input features. The working principle of PConv is shown in
Figure 7.
PConv [
48] extracts features using a filter for channels of dimension h × w × c
p for a feature map of dimension h × w × c as input. Keeping the number of remaining channels constant, the processed channels are spliced with the unprocessed channels, and the output is mapped with constant dimension to the unprocessed feature map, thus having the same channel count. PConv reduces redundant computations and preserves the original number of channels substantially, resulting in higher computing speeds than conventional convolution. Therefore, FasterNet was used as the primary feature extraction backbone to make the algorithm more accurate for smaller
Sagittaria trifolia detection.
2.4.2. Neck Down-Sampling Module Design
The neck of the model is mainly responsible for the fusion of features at different scales, and through better feature fusion, the model can detect and localize objects of different sizes more accurately. Therefore, the down-sampling module of the neck, which is responsible for filtering important features, was improved.
In the complex paddy field environment, there is often a situation where weeds and rice shade each other, as shown in
Figure 5e. At this time, if the model only focuses on the feature information in the labeled box, it often results in the omission of detection or low confidence in the detection box, and so on. Therefore, it is required that the model can observe the surrounding context information and the global context information while focusing on the local information. Therefore, in this study, CGBlock in CGNet was used to replace the regular convolution and SCDown down-sampling module in the YOLOv10n neck (Neck).
The CGBlock includes a local feature extractor
, a surrounding context extractor
, a joint feature extractor
, and a global context extractor
, as depicted in
Figure 8.
CGBlock is composed of two main phases. At the outset, local features and the corresponding surrounding context are learned using and , respectively. Instantiate as a standard convolution layer of size 3 × 3 and learn local features from 8 neighboring feature vectors. Additionally, because the cavity expansion convolution features a comparatively large receptive field and it can efficiently understand the surrounding context, is established as a 3 × 3 cavity expansion convolutional layer. Thus, obtains joint features from the outputs of and . Design as a cascade layer followed by Batch Normalisation (BN) and Parameter ReLU (PReLU) operators. In the subsequent step, extracts the global context to improve the joint features. The global context functions as a weighting vector and is utilized for the channel-level refinement of the joint features to underscore effective parts and suppress useless ones. Instantiate as a layer for global average pooling to aggregate global contexts, which are then further extracted by a multilayer perceptron. Finally, a scale layer is used to quadratically weight the joint features with the extracted global context.
2.4.3. Neck Up-Sampling Module Design
The up-sampling module of the neck is mainly responsible for recovering the detailed information to better capture the boundaries and shapes of the target, which facilitates the detection of Sagittaria trifolia, and therefore an improvement of the up-sampling is necessary.
In the rice field weed detection task, weeds and rice are similar to some extent due to the complexity of the environment, as shown in
Figure 5f. Therefore, this paper introduces the DySample lightweight dynamic up-sampling module to address the problems of large workloads and high demand for feature guidance brought by kernel-based dynamic up-samplers (e.g., CARAFE, FADE, and SAPA, etc.). This bypasses the dynamic convolution and formulates the up-sampling strategy from the point-sampling point of view, which not only effectively enhances the model’s interference resistance but also better detects weeds that have high similarity with rice. It has fewer parameters than the kernel-based dynamic up-sampling module, which saves computational resources and is more conducive to industrial real-time detection.
The main flow of sampling on the DySample module is shown in
Figure 9, where
is a feature map with size C × H
1 × W
1,
is a sampling set of points with size 2g × H
2 × W
2, and 2g represents the x and y coordinates in the first dimension. The grid_sample function uses the location information in the point sampling set δ to resample the feature map
to
of size C × H
2 × W
2, as demonstrated in Equation (1):
Figure 10 shows the set of point samples based on the dynamic range factor
generation process. First, the up-sampling scale factor s and the feature map
of size C × H × W are given. Next, a linear layer is used to generate the offset O. The number of input channels in this linear layer is C and the number of output channels is 2gs
2. In the end, the generated offset O has a size of 2gs
2 × H × W, as shown in Equation (2):
It is subsequently reshaped through pixel shuffling (pixel shuffle) into a high-resolution raw sampled grid G of 2g × sH × sW. Finally, the point sampling set
is obtained by combining the offset O with the original sampling grid G, as shown in Equation (3):
2.4.4. Detection Head Design
The accuracy of the model’s detection is closely tied to the detection head, and for different detection tasks, the detection head can be designed to better match the characteristics of the target so that it can ensure accuracy while reducing redundant operations.
The YOLOv10n original detection head is computationally intensive in training and applications, and when practically applied to resource-constrained platforms such as UAVs, the larger number of parameters may lead to an increase in inference time, thus affecting the responsiveness of the device. Therefore, to improve the computational efficiency while not affecting the accuracy as much as possible, this research introduces a lightweight detection head SCSD-Head relying on shared convolution and scale scaling, the structure of which is shown in
Figure 11.
To minimize the image of lightness on accuracy, GN(Group Norm)Conv was used. Of these, the three GNConv1×1 do not share parameters. The 6 GNConv3×3 modules shown in
Figure 11 use shared parameters. The 6 red modules in the figure, including RegConv and ClsConv, also use shared parameters. The parameter count can be significantly minimized by using a parameter-sharing convolution module, which makes the model lighter, especially on resource-constrained devices. To cope with the problem of inconsistent target scales detected by each detection head, such as the target scales of field ridges and weeds that are very different, three scale layers with unshared parameters were used along with a shared convolution to scale the different scale features separately to ensure accuracy.
2.5. Test Environment Configuration and Parameter Setting
The experimental processing platform was comprised of a 10thGen Intel(R) Core(TM)i9-10980XE CPU with 3.00 GHz (manufactured by Intel, Santa Clara, CA, USA), 64 GB RAM (manufactured by Intel, Santa Clara, CA, USA), and a GPU model NVIDIA Quadro RTX 5000 with 64 G of video memory (manufactured by Intel, Santa Clara, CA, USA). The software environment consisted of Cuda 12.0 + Cudnn 8.7.0 + conda 22.9.0 + Python 3.8.16. The operating system was Windows 10 Professional 64-bit. During model training, the initial learning rate was defined as 0.01. The batch size was defined as 16. The maximum iterations were defined as 300. Early stop was not used. Using a Stochastic Gradient Descent(SGD) optimizer to update the weight parameters, the decay strategy for the learning rate (lr) can be characterized as follows:
where basic_lr represents the basic learning rate, max_iter represents the maximum count, iter_index is the iteration index, and p is the polynomial decay exponent. For this paper, the basic learning rate was set to 0.001, the momentum was set to 0.9, weight decay was set to 1 × 10
−4, and the lower limit for learning rate update was 0. All models were trained using the above settings.
During training, the distance between the probability distribution predicted by the algorithm for the pixel class and the true label class probability distribution was measured by the Cross-Entropy Loss function, which can be calculated as follows:
where M refers to the pixel quantity, N refers to the category count, i refers to the pixel being processed, C indicates the category being considered, b
i indicates the correct labeling category for pixel i; h indicates a probability distribution function that varies from 0 to 1, being 1 when b
i is equal to c and 0 otherwise.; and p
ic denotes the likelihood that pixel i belongs to category c, calculated using the Sigmoid function on the predicted category score. In the process of iterative training, the loss value was used to measure the training effect of the model. Then, backpropagation was used to optimize the weight parameters so that the loss value of the model was reduced. Finally, the loss value was reduced to a more stable value, thus completing the model training.
2.6. Evaluation Indicators
To objectively evaluate the detection effect of the algorithm on weed images in rice fields and to conduct a quantitative assessment of the algorithm’s performance, this research used the quantity of parameters, the quantity of floating point operations per second GFLOPs, the inference speed FPS, the algorithm size, the precision rate Precision, the recall rate Recall, and mAP (mean Average Precision) to assess the performance of the improved algorithm. It is defined as follows:
where N indicates the total number of detection categories. Precision and recall are based on a threshold of IOU = 0.5. After the model tests the samples, there will be four situations, namely true positive (TP), true negative (TN), false positive (FP), and false negative (FN).
2.7. The Principle of Grad-CAM
Grad-CAM (gradient-weighted Class Activation Mapping) is a method for interpreting convolutional neural network decisions. The key idea of Grad-CAM is to multiply the gradient of the output class with the output of that layer and then average it to obtain a rough heat map. This heat map can be enlarged and overlaid on the original image to show which areas the model is most focused on when classifying.
2.8. Test Results and Analyses
The improved model needs to carry out an ablation test to verify whether each improved module has a positive effect on the model, and observe the areas of high concern of the model through heat map. For the deficiencies found in the pre-test, it should then observe whether the model has improved its ability to detect these special data through the comparison of the model’s performance before and after the improvement, and with this, and a targeted test of the model by the reflection of the water to see whether the model has anti-jamming ability in the case of the reflection of sunlight water, it should test whether the model proposed in this paper is good enough through comparison with the classical model. Finally, it should formulate the spraying strategy and draw the variable spraying prescription map.
3. Result
3.1. Results of the Ablation Test
The loss value of the algorithm during training before and after improvement is shown in
Figure 12. As can be seen from
Figure 12, in the early stage of model training, the convergence loss value decreases rapidly; that is, the model converges quickly. This shows that the model has a good learning effect on data features in the first 200 rounds of training, and can learn important features in the data. As the epoch increases with the training, the rate of decline of the model loss value becomes slow and stabilizes after about 200 rounds. Finally, the loss of YOLOv10n-FCDS drops to 1.921. The model weight when the loss value drops to the lowest value is the optimal model weight obtained by training. The loss value of YOLOv10n-FCDS is always lower than that of YOLOv10n during the whole training process, which shows that the overall improvement has enhanced the model performance.
In order to verify the effectiveness of each modification on each module, an ablation test was designed compared with YOLOv10n, and the unimproved model was denoted as A. The different modifications to the model are shown in
Table 2. The same dataset was used for model training, validation, and testing. The results of the tests are displayed in
Table 3.
As can be seen from
Table 3, YOLOv10n-FCDS substantially improves the model detection accuracy with a small increase in the computational and parametric quantities.
The use of FasterNet as the backbone network resulted in a 1.4 percentage point improvement in model accuracy (mAP50) due to the use of more efficient operators, fewer memory accesses, and the use of PConv even more so, while increasing the speed of model inference. After further replacing the neck conventional convolution and SCDown down-sampling with CGBlock, the model’s ability to focus on both surrounding and global contextual information is improved due to the use of surrounding and global feature extractors, which in turn allows the model to better detect weeds that intersect with rice and further improves the accuracy by 0.7 percentage points. After further replacing the up-sampling module of the original neck with DySample, the use of a sampling strategy based on point sampling makes the model more resistant to interference and better able to detect weeds similar in appearance to rice, which in turn leads to a further improvement in the accuracy of the model by 0.6 percentage points. Using the lightweight detection head SCSD-Head, the algorithm’s parameter count decreased by 16% and the computational effort decreased by 18% due to the use of a large number of convolutions with shared parameters. The computational speed increased by 5%, the size of the model was diminished by 16%, and the model accuracy decreased by only 0.01 percentage points due to the use of Group Norm convolutions. From the performance of SCSD-Head, it can be seen that the detector head achieves the module’s original design intention of effectively reducing the parameter amount and computation, improving the FPS of algorithm inference and reducing the model size.
The modified model demonstrates greater proficiency in extracting the significant features of rice weeds, and at the same time, it has little effect on the working efficiency of the model, which can then better complete the task of weed detection in rice fields.
3.2. Heat Map Analysis of the Model Before and After Improvement
In order to more intuitively observe the areas of concern of the model when YOLOv10n-FCDS detected weeds in the paddy field, and also to observe the difference between the internal working mechanism of the model before and after the improvement of YOLOv10n, the extraction of the features of the weeds in paddy fields were compared using the Grad-CAM visualization technique. All of the elements are the last of the encoders of the algorithm.
Figure 13 displays the results.
As can be seen from the YOLOv10n and YOLOv10n-FCDS ridge sections in
Figure 13, YOLOv10n’s focus on the ridge is incomplete. In YOLOv10n-FCDS, due to the use of the CGBlock for down-sampling, the model can fully take into account both the surrounding and global contextual features. This makes the model’s sensory field larger, and thus improves the model’s target detection. In addition, as can be seen from the rice region of YOLOv10n and YOLOv10n-FCDS in the ridge section of
Figure 13, YOLOv10n focuses too much on rice features due to the higher similarity between rice and weeds, whereas due to the use of DySample for up-sampling, YOLOv10n-FCDS’s internal sampling strategy based on point-sampling makes the model better able to distinguish weeds with a high similarity to rice.
As can be seen from the smaller weed targets at the bottom of the YOLOv10n and YOLOv10n-FCDS heat maps in the weed section of
Figure 13, YOLOv10n has insufficient feature extraction capability for smaller single-plant targets, whereas YOLOv10n-FCDS’s capacity for feature extraction from small targets is made possible due to the use of FasterNet as the main feature extraction network, thus improving the original model. In addition, it can also be seen from the YOLOv10n and YOLOv10n-FCDS heat maps of the weed section in
Figure 13 that the YOLOv10n-FCDS area of interest is larger, which is due to the use of the CGBlock, which makes YOLOv10n-FCDS focus more often on the outskirts of the contiguous weeds that are overlapped with or shaded by the rice.
3.3. Comparison of Model Performance Before and After Improvement
To quantitatively describe the change in detection performance against specific types of weeds before and after model improvement, 30 small target weeds, 30 weeds obscured by rice, and 30 weed datasets similar to rice were selected from the test set, respectively. The mAP50 for these three datasets, both pre-and post-model improvement, is displayed in
Table 4. It is evident from
Table 4 that YOLOv10n-FCDS increased the detection precision when compared to YOLOv10n by 2.5, 2.8, and 3.0 percentage points for targeting small-targeted weeds, weeds obscured by rice, and weeds similar to rice, respectively.
The detection results of YOLOv10n-FCDS are shown in
Figure 14. It can also be seen from
Figure 14d that the problem of missed recognition due to small targets has been improved. As can be seen in
Figure 14e, weeds that are obscured by rice can also be accurately detected. It is evident from
Figure 14f that the problem of missed detection of weeds similar in appearance to rice has also been solved. It can also be seen that the confidence level of the overall model detection has also been significantly improved, providing ample evidence of the effectiveness of the improvements.
3.4. Tests of Model Immunity Under Different Intensities of Water Reflections
UAVs are susceptible to light, and during the rice tillering period, the exposed area of the water body is still relatively large and prone to sunlight reflection, which may lead to misidentification of rice or omission of identification of weeds. To verify the performance of the model under different water reflection intensities, different samples under strong and weak light in the test set were selected for further testing. As displayed in
Figure 15, the experimental results are provided. The outcomes point to the fact that the improved model can achieve the detection of weeds under different intensities of water reflections.
3.5. Performance Comparison with Classical Models
YOLOv10n-FCDS was compared with Faster R-CNN, SSD, YOLOv8n, and YOLOv9t, several mainstream target detection models which were trained with the same training parameters and training sets. The parameters, GFLOPs, FPS, Recall, and mAP50 values of different models were compared separately, as shown in
Table 5.
As can be seen from
Table 5, the number of parameters and GFLOPs of Faster R-CNN and SSD are significantly higher than those of other models. The reason for this is that both Faster R-CNN and SSD are two-stage target detection algorithms. This kind of algorithm needs to first use an RPN (Regional Proposal Network) to obtain the candidate region, and then classify the candidate region. Therefore, the parameter number and calculation amount of this model are significantly higher than that of the single-stage object detection model.
The YOLO series models are single-stage models. In this way, the complicated process of calculating the candidate region is omitted, thus reducing the complexity of the model and improving its efficiency. The algorithm YOLOv10n-FCDS, as proposed in this study, maintains a strong equilibrium between processing speed and precision. Specifically, the number of parameters was 3,989,749, GFLOPs was 9.7, FPS was 424.0, Recall was 0.806, and mAP50 was as high as 0.874. Compared to YOLOv8n, although the number of parameters and the computation volume have risen slightly, the model inference speed and accuracy have been significantly accelerated and improved.
In addition, the model YOLOv10n-FCDS proposed in this paper is only 0.1 ms slower compared to YOLOv9t, but the accuracy of the model is improved by 6.5 percentage points. The above advantages of the model are due to the adoption of FasterNet as the backbone feature extraction network in YOLOv10n-FCDS, which makes extensive use of PConv, enabling the model to better extract features of small target weeds and reduce the computation of redundant information. This improves the accuracy while making only small fluctuations in the parameter and the operation amount. Moreover, the use of the CGBlock down-sampling module allows the model to focus on the surrounding and global contextual features, boosting the algorithm’s capacity to detect obscured weeds. In addition, the point-based sampling strategy in the DySample up-sampling module resulted in an improved ability to discriminate between rice and weeds and enhanced the model’s immunity to interference. Finally, the lightweight detection head based on shared convolution and scale scaling proposed in this paper attains a noteworthy reduction in the quantity of parameters and operations with almost constant accuracy.
All in all, YOLOv10n-FCDS has a high detection accuracy (mAP50) while maintaining low latency and several model parameters, which is suitable for applications such as rice field weed detection, where efficiency is required while taking into account the accuracy needed for target detection.
3.6. Development of Spraying Strategies and Mapping of Prescriptions
To demonstrate how weeds are distributed in the field, all the non-overlapping sub-images of Fields 1 and 2 were input into the improved model YOLOv10n-FCDS in this study and these sub-images were used for the automatic detection of weeds in paddy fields.
Figure 16 shows the complete weed detection results of Plots 1 and 2. They are composed of several non-overlapping sub-images detected by YOLOv10n-FCDS and then spliced. In this figure, a single
Sagittaria trifolia target is shown in the yellow target detection box, a continuous barnyard grass target is shown in the red target detection box, and a ridge target is shown in the white target detection box.
Both fields selected for this study were 12 acres in size, so both fields were divided into 12 plots based on 1 plot per acre equivalents. The division scheme and corresponding numbers are shown in the orange dashed grid in
Figure 16. Then, based on the number of weeds in each plot, the amount of herbicide needed in that plot was determined. In this investigation, the leaf volume of continuous weeds is mostly 3 to 5, so the mean value is taken to regard the continuous weeds as 4 weeds, and then the sum of the number of single weeds and the number of weeds converted from continuous weeds to single weeds in each plot is taken as the total weed figures for each plot. As shown in
Figure 17, the statistical results are provided.
The application rates of herbicides in the experimental plots were adjusted according to the weed populations in
Figure 18. Specifically, the conventional application rate for farmers in the Aji Township area is 1.5 L/acre, and plots with weed counts above 600 are applied at this rate, corresponding to the red area in
Figure 18a. Weed counts in the 600 to 450 range were applied at 85 percent of the conventional application rate, i.e., 1.27 L/acre, corresponding to the yellow areas in
Figure 18a,b. When the weed population was in the 450 to 300 range, the application rate was adjusted to 70% of the conventional application rate, i.e., 1.05 L/acre, which corresponds to the green areas in
Figure 18a,b. If the weed population is in the 300 to 150 range, the application rate is reduced to 50 percent of the conventional rate, i.e., 0.75 L/acre, which corresponds to the blue areas in
Figure 18a,b.
If all the plots were applied at the conventional pesticide application rate, a total of 36.00 L of pesticide would be consumed, whereas in applying the pesticide according to the pesticide spraying strategy of the present study, a total of 28.53 L of pesticide would be applied to the two testing fields. In comparison to the conventional method of application, the method of the present study saves about 20.75 percent of the herbicide volume.
4. Discussion
In this study, chemical use was reduced to maximize the effectiveness of weed control while at the same time reducing the amount of chemicals used. Employing a deep learning algorithm, we recognized weeds in rice fields and developed variable application prescription maps based on the results. The experimental findings indicate that the enhanced algorithm YOLOv10n-FCDS improves the accuracy of mAP50 by 2.5 percentage points to 87.4% compared to the pre-improved model YOLOv10n. Application strategies were developed, and variable spraying prescriptions were mapped based on the weed identification results. The application method of this study was calculated to save 20.75 percent of herbicides compared to the traditional crude method of spraying herbicides over a large area.
The large number of plants with small leaves (small target weeds) at the rice tillering stage, the fact that some of the weeds will be shaded by the rice, and that some of the weeds are similar in appearance to the rice, makes the identification of Sagittaria trifolia in paddy fields via UAV-based remote sensing challenging. To tackle the aforementioned issues, this study introduces specific enhancements and creates a rice field weed detection algorithm that optimizes both detection performance and efficiency, enabling effective weed detection in challenging real-world conditions.
Firstly, the use of FasterNet to replace the backbone feature extraction network improved the algorithm’s capacity for detecting weeds in small targets. Secondly, the use of CGBlock to replace the regular convolution and down-sampling module in the neck improves the algorithm’s capacity to detect obscured weeds. The ability of the model to detect weeds with a high degree of similarity to rice was then improved by replacing the neck up-sampling module with a DySample. Finally, a lightweight detection head is proposed that drastically reduces the quantity of model parameters and computational requirements while hardly affecting the accuracy. However, when constructing the dataset in this study, only two rice varieties of rice field weed samples were collected; however, there are also differences between different varieties of rice, which may lead to the limited generalization ability of the identification model. Therefore, in future studies, we need to collect a wide range of rice field weed samples from different rice varieties to improve the generalization ability of the algorithm.
In this research, two field divisions were split into 12 experimental plots, variable application strategies were developed, and variable application prescriptions were mapped based on weed identification. The whole process from collecting rice field weed data to developing variable spray prescription maps took several days to collect and process the data. The complexity and non-real-time nature of this process presents challenges to its widespread implementation. To address these challenges, we are going to create a lightweight version of the model and incorporate the rice field weed detection algorithm into the UAV’s embedded system in our subsequent studies. This will be the focus of our next research, to significantly improve the efficiency of weed recognition and reduce human intervention. We will also aim to simplify the procedure for generating variable spray prescription maps. Overall, our approach offers a potential solution for precision agriculture to reduce pesticide use and reduce the risk of environmental pollution by tailoring the application program to the number of weeds in each plot of the rice field.
5. Conclusions
In this study, a weed detection algorithm for rice fields (YOLOv10n-FCDS) was designed for the purpose of detecting Sagittaria trifolia, a major weed in rice fields. A variable spray prescription map was produced based on the detection results. The main conclusions are summarized below:
(1) The weed detection model YOLOv10n-FCDS was constructed. The baseline model YOLOv10n’s weed detection accuracy (mAP50) was 84.8%. The low accuracy was due to the presence of many small target weeds, obscured weeds, and weeds similar to rice. Improvements were made to the model to address the three issues mentioned above: Firstly, for small target weeds, the original backbone network was replaced by FasterNet in the backbone network section, with a 1.4% improvement in accuracy (mAP50). Secondly, for the obscured weeds, the accuracy of mAP50 was again improved by 0.7% using CGBlock as the neck down-sampling module. Thirdly, for the problem of weeds being similar to rice, the accuracy of mAP50 was again improved by 0.6% using DySample as the up-sampling module for the neck. Finally, a lightweight detection head, SCSD-Head, was proposed, with a 0.1% decrease in accuracy (mAP50), a 16.0% decrease in number of parameters, and a 17.8% decrease in computation. YOLOv10n-FCDS achieved a mAP50 of 87.4%, a 2.6% improvement relative to the baseline model.
(2) The weed detection ability and universality of the weed detection model YOLOv10n-FCDS were tested in detail under a complex rice field environment. In the weed database, small target weeds, obscured weeds, and similar weeds were screened, and the test data sets were constructed, respectively, and the model was tested. The results showed that: Compared to the baseline model, the mAP50 of YOLOv10n-FCDS improved by 2.5% to 87.2% on the small target dataset, 2.8% to 87.1% on the obscured dataset, and 3.0% to 86.9% on the similar dataset. The results fully prove that the weed detection model YOLOv10n-FCDS proposed in this paper is able to adapt to complex rice field environments and can meet the needs of variable herbicide UAV spraying and accurate management.
(3) This study explored an “efficient weed detection + prescription map generation + near-ground variable spraying” weed control model. Based on rice orthophotos collected efficiently (22 m) by a UAV, the weed detection model YOLOv10n-FCDS was used to achieve effective detection of weed populations. The weed count to application conversion rule was developed to generate variable spraying prescription maps to guide UAV near-field variable spraying operations.