A Lightweight Cotton Field Weed Detection Model Enhanced with EfficientNet and Attention Mechanisms

Zheng, Lu; Long, Lyujia; Zhu, Chengao; Jia, Mengmeng; Chen, Pingting; Tie, Jun

doi:10.3390/agronomy14112649

Open AccessArticle

A Lightweight Cotton Field Weed Detection Model Enhanced with EfficientNet and Attention Mechanisms

by

Lu Zheng

^1,2,

Lyujia Long

^1,2,

Chengao Zhu

^1,2,

Mengmeng Jia

³,

Pingting Chen

⁴ and

Jun Tie

^1,2,*

¹

College of Computer Science, South-Central Minzu University, Wuhan 430074, China

²

Hubei Provincial Engineering Research Center of Agricultural Blockchain and Intelligent Management, Wuhan 430074, China

³

School of Media and Communication, Wuhan Textile University, Wuhan 430073, China

⁴

Institute of Agricultural Economics and Technology, Hubei Academy of Agricultural Sciences, Wuhan 430064, China

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(11), 2649; https://doi.org/10.3390/agronomy14112649

Submission received: 1 October 2024 / Revised: 5 November 2024 / Accepted: 8 November 2024 / Published: 11 November 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Cotton is a crucial crop in the global textile industry, with major production regions including China, India, and the United States. While smart agricultural mechanization technologies, such as automated irrigation and precision pesticide systems, have improved crop management, weeds remain a significant challenge. These weeds not only compete with cotton for nutrients but can also serve as hosts for diseases, affecting both cotton yield and quality. Existing weed detection models perform poorly in the complex environment of cotton fields, where the visual features of weeds and crops are similar and often overlap, resulting in low detection accuracy. Furthermore, real-time deployment on edge devices is difficult. To address these issues, this study proposes an improved lightweight weed detection model, YOLO-WL, based on the YOLOv8 architecture. The model leverages EfficientNet to reconstruct the backbone, reducing model complexity and enhancing detection speed. To compensate for any performance loss due to backbone simplification, CA (cross-attention) is introduced into the backbone, improving feature sensitivity. Finally, AFPN (Adaptive Feature Pyramid Network) and EMA (efficient multi-scale attention) mechanisms are integrated into the neck to further strengthen feature extraction and improve weed detection accuracy. At the same time, the model maintains a lightweight design suitable for deployment on edge devices. Experiments on the CottonWeedDet12 dataset show that the YOLO-WL model achieved an mAP of 92.30%, reduced the detection time per image by 75% to 1.9 ms, and decreased the number of parameters by 30.3%. After TensorRT optimization, the video inference time was reduced from 23.134 ms to 2.443 ms per frame, enabling real-time detection in practical agricultural environments.

Keywords:

cotton weed; EfficientNet; TensorRT; intelligent weeding

1. Introduction

Cotton serves as an indispensable component of the global textile industry, with major producing regions such as China, India, the United States, Pakistan, Brazil, and Uzbekistan exerting a profound influence on the agricultural sector. The progression of intelligent agricultural mechanization technologies has catalyzed significant opportunities for transformation and upgrading within the cotton cultivation sector, especially in the realms of precision management and automated operations [1]. The preliminary application of intelligent agricultural machinery, such as precision pesticide application robots and automated irrigation systems, in cotton fields has been facilitated through the integration of cutting-edge information and communication technologies, thereby enhancing the precision and efficiency of crop management [2]. Despite the advancements in modern cultivation techniques, including genetically modified cotton and sophisticated irrigation strategies, which have bolstered yields and pest resistance, weeds remain a formidable adversary to healthy cotton cultivation [3]. They not only compete aggressively with cotton plants for essential soil nutrients, water, and sunlight but can also act as hosts for diseases, exacerbating the decline in both the quantity and quality of cotton output [4]. It is estimated that globally, approximately 43% of weeds have the potential to damage crops [5]. Ineffective weed management in cotton cultivation can lead to yield losses exceeding 50%, and in severe cases, this figure can surge to as high as 90% [6]. Currently, chemical herbicides are a common means of weed control; however, their excessive use can lead to soil and water pollution, increased herbicide resistance in weeds, and public health risks, thereby jeopardizing the sustainability of agriculture.

Machine vision technology is increasingly being applied in the agricultural sector, serving as a pivotal component of intelligent agricultural mechanization. It plays an essential role in weed detection and management in cotton fields. By integrating high-resolution imaging systems and sophisticated computer vision models, intelligent agricultural machinery can not only automatically identify crops, weeds and other targets in the cotton field, but also perform precise application operations according to the detection results, which significantly promotes reduction in and efficiency of pesticides and minimizes the environmental impact [7].

Weeds in cotton fields occupy nutrients and space, greatly hindering the normal growth of cotton and thus severely affecting yield, making them a major threat in the growth of cotton. With the continuous advancement of deep learning technology, its application in the field of object recognition is gaining increasing attention. Dang et al. conducted experiments comparing 25 YOLO models across 7 versions, including YOLOv3, YOLOv4, YOLOR, and YOLOv7, on the CottonWeedDet12 dataset. Each model demonstrated superiority in various performance metrics, with YOLOv3 and YOLOv4 outperforming others under the specific experimental conditions and parameter settings of the authors [8]. Ahmad et al. evaluated three different image classification models and, using the YOLOv3 algorithm, achieved an average accuracy of 54.3% in detecting four major types of weeds in the Midwest of the United States [9].

In the application of convolutional neural network (CNN) technology for crop recognition, Peteinatos et al. achieved an accuracy rate of 77–98% in plant detection and weed species differentiation by training RGB camera-captured images of corn and sunflower fields using three types of convolutional neural networks (VGG16, ResNet-50, and Xception), followed by image testing [10]. Farooq et al. employed local binary pattern (LBP) analysis to extract superpixel features from weed imagery and then learned the complex spatial characteristics of these images using a convolutional neural network (CNN). Subsequently, they assessed the efficacy of a Support Vector Machine (SVM) in classifying such advanced spatial features on hyperspectral and remote sensing datasets [11].

In the domain of weed identification using unsupervised frameworks, Shorewala et al. employed convolutional neural networks (CNNs) to perform the unsupervised segmentation of the foreground pixels of crops and weeds [12]. They utilized an enhanced CNN model to achieve the precise delineation of weed areas, and the generalization performance of this method was validated across various crop datasets. Dos Santos Ferreira et al. conducted an in-depth investigation of unsupervised learning algorithms, proposing an unsupervised learning framework that integrates deep feature representation with image clustering techniques [13]. They conducted empirical research on these algorithms and validated the experiments on two public datasets. The results demonstrated that the method could effectively distinguish between Poaceae and broadleaf weeds, highlighting high classification performance and advantages akin to those of supervised learning.

Research on weed identification based on deep learning has achieved significant advancements in model optimization, dataset construction, feature extraction, and hardware acceleration, thereby offering effective technical support and solutions for addressing practical challenges in agriculture. Mu et al. introduced an enhanced Faster R-CNN model designed for detecting weeds in images with complex field backgrounds, facilitating the development of precise automated weeding systems. [14]. Ilyas et al. proposed an unsupervised domain adaptation method that learns to disregard minor variations in low-level statistical information, effectively addressing the recognition of crops and weeds in previously invisible fields [15]. Mu et al. employed a local variance preprocessing technique for background segmentation and data augmentation, which amplifies weed features while suppressing background characteristics, achieving an accuracy rate of 97.98% on the dataset [16]. Wang et al. combined the MAVIC AIR drone with the YOLO_CBAM network, achieving the efficient detection of the invasive weed Solanum rostratum Dunal, thereby significantly enhancing detection accuracy and efficiency [17]. Fan et al. integrated a visual perception module on a cotton field spraying robot and used an improved Fast-RCNN algorithm for weed identification, enabling precise spraying on target weeds, which further improved the weeding efficacy [18].

However, the complexity of real-world cotton field environments, such as the similarity of visual features among different types of weeds at various growth stages and the mutual occlusion of weeds, poses significant challenges for weed detection tasks [19]. An analysis of weed detection in real cotton field scenarios, leveraging current deep learning-based crop detection methods, reveals several key issues: The complexity of the cotton field environment, coupled with the similar features among different weed categories and the overlapping of different weed species [20], complicates feature extraction and results in low detection accuracy and a limited number of weed recognition categories [21]. Furthermore, model deployment on edge devices is constrained by limited computational resources, storage capacity, and energy consumption, making it difficult to deploy models with large computational loads and parameter sizes. Optimizing the model by reducing its parameter count and computational complexity not only ensures efficient operation on resource-constrained devices but also helps extend battery life and improve overall system performance in field applications.

This study adopted object detection models in deep learning as the research foundation and addressed the limitations of current detection models in extracting weed features within real cotton field scenarios. Specifically, it tackled the issues of limited detection categories, missed detections, and false positives encountered in these scenarios. Additionally, this paper examines the challenges associated with the improved detection models, such as large memory occupation and a large number of parameters.

The primary objective of this study was to develop an enhanced lightweight weed detection model, YOLO-WL, specifically tailored for the real-time monitoring of weeds in cotton fields. To accomplish this goal, we reconstructed the YOLOv8 architecture by incorporating EfficientNet, which not only reduces model complexity but also enhances detection speed. Furthermore, we introduce cross-attention (CA) to bolster feature sensitivity, and integrate Adaptive Feature Pyramid Network (AFPN) alongside efficient multi-scale attention (EMA) mechanisms to refine feature extraction and improve the detection accuracy. Through comprehensive testing and optimization, including the application of TensorRT to achieve real-time performance, we aimed to deliver an effective weed detection solution that is well-suited for deployment on resource-constrained edge devices.

2. Materials and Methods

2.1. Materials

This study utilized the public cotton field weed dataset, CottonWeedDet12 [6], for model training and evaluation. The dataset was compiled from cotton fields at the research farm of Mississippi State University, employing a range of handheld smartphones and high-resolution digital color cameras. It encompasses a total of 5648 images and 9370 bounding box annotations across 12 distinct categories, including waterhemp, morningglory, purslane, spotted spurge, carpetweed, ragweed, eclipta, prickly sida, palmer amaranth, sicklepod, goosegrass, and cutleaf groundcherry. Example images from the CottonWeedDet12 dataset are shown in Figure 1.

The original data were annotated in JSON format but were converted to the YOLO label format to meet the experimental requirements. Additionally, to align more closely with the performance specifications of the experimental hardware, the images were uniformly cropped. After processing, the dataset was randomly divided into training, validation, and test sets in ratios of 65%, 20%, and 15%, respectively. The adjusted weed dataset in some cotton field and the distribution of various weed types are presented in Table 1.

2.2. Improved Efficient Convolutional Lightweight Weed Detection Model in Cotton Field

The YOLO-WL model proposed in this paper is an optimized lightweight version of the YOLOv8 model. It maintains the performance of the YOLOv8 model while replacing the backbone network with EfficientNet, an efficient convolutional neural network with exceptional performance in classification tasks [22]. At the same time, cross-attention is incorporated into the improved backbone network. By augmenting the model’s capability to extract both image and location features of weeds in cotton fields, the YOLO-WL model achieves a balance between model lightweightness and accuracy retention. This optimization makes the model more suitable for real-time detection requirements and facilitates its rapid deployment in intelligent cotton field weeding devices.

2.2.1. Overall Structure of the Model

To enhance the real-time capabilities required for weed detection in cotton fields, the YOLOv8 model has been optimized by replacing its backbone network with EfficientNet and integrating cross-attention mechanisms into the enhanced backbone. These modifications augment the model’s feature extraction capabilities, thereby improving its accuracy and robustness of the model, particularly when processing complex scenes and small targets. The overall structure of the model is depicted in Figure 2, where the dashed lines indicate the improved sections, divided into two parts: the backbone and the head.

2.2.2. Improved Efficient Convolutional Lightweight Backbone Network

The identification and removal of weeds in cotton fields can be realized by edge devices, such as weeding robots. However, the main processing unit of these edge devices is constrained by limitations in computational power and storage capacity. A critical research focus remains the enhancement of weed detection speed by reducing model parameters and simplifying computational processes, without compromising recognition accuracy.

In order to realize a lightweight model for the detection of weeds accompanying seedlings in cotton fields to adapt to the real-time operation requirements of robots or automated systems, there are certain requirements on the accuracy and running speed of the model. This study replaced the original backbone of YOLOv8 by introducing cross-attention (CA) [23] into the efficient convolutional network EfficientNet-CA. By optimizing the network structure, faster inference speed is achieved while ensuring the accuracy of the model, which makes it possible to maintain the accuracy while meeting the real-time requirements when dealing with the task of detecting weeds accompanying seedlings in cotton fields.

The EfficientNet model, optimized through compound scaling, enables a more in-depth learning of image features, including the recognition of subtle differences, thereby achieving higher accuracy in weed and crop identification. This compound scaling approach enables EfficientNet to process high-resolution inputs at a reduced computational expense, yielding clearer image analysis without compromising performance. In compound scaling, the depth, width, and resolution of the network are adjusted by a common scaling factor. The formula for c, the compound scaling coefficient, is shown in Equation (1):

\{\begin{matrix} d e p t h : d = α^{ϕ} \\ w i d t h : w = β^{ϕ} \\ r e s o l u t i o n : r = γ^{ϕ} \end{matrix} s . t . α \cdot β^{2} \cdot γ^{2} \approx 2, α \geq 1, β \geq 1, γ \geq 1

(1)

where d denotes the network depth, w represents the network width, and r signifies the resolution of the input image. α, β, and γ are scaling factors, while ϕ is a hyperparameter that governs the network’s scale. Under constrained conditions, ϕ is initialized to 1, followed by a meticulous adjustment of the hyperparameters using a grid search method. The optimal combination is determined to be α = 1.2, β = 1.1, and γ = 1.15. The model corresponding to this set of parameters is EfficientNet-B0.

In building upon this foundation, EfficientNet employs a block known as MBConv (Mobile Inverted Bottleneck Convolution) as the fundamental building unit of the network. The structure of the MBConv module is depicted in Figure 3. This block comprises a depthwise separable convolution layer and a linear transformation layer. The depthwise separable convolution operates by first applying convolution to each channel individually and then merging the results across all channels. Given the real-time processing requirements, the model is designed for rapid inference and low latency to facilitate timely image processing and decision making. The use of depthwise separable convolution layers can significantly reduce the number of network parameters and computational load, enhancing the model’s efficiency and making it particularly well suited for embedded systems and mobile applications.

In real cotton field environments, different categories of weeds may be visually similar, and it is challenging for traditional convolutional networks to distinguish them accurately. The cross-attention mechanism [23] enables the model to focus more intently on features that are particularly significant for differentiating weed types, such as shapes, textures, or color variations. By concentrating on the key information, this mechanism enhances the model’s recognition accuracy in complex backgrounds. CA enables the model to utilize its computational resources more efficiently by dynamically adjusting the model’s attention point to focus on the key parts of the weed image, thus reducing the computation of non-critical regions and realizing the model’s light weight. Cross-attention can be viewed as a similar approach to Squeeze and Excitation Attention (SE) [24], where global average pooling operations are used to model cross-channel information. Generally, channel-wise statistical data can be generated through global average pooling, where global spatial location information is compressed into channel descriptors. Distinct from SE, CA embeds the spatial location information of the weed feature into the channel attention map to enhance feature aggregation. The computational process of the CA mechanism is shown in Figure 4.

The cross-attention (CA) mechanism decomposes the original input tensor into two parallel one-dimensional feature encoding vectors to model cross-channel dependencies using spatial position information. One of the parallel channels directly originates from a one-dimensional global average pooling along the horizontal dimension, which can be considered a collection of position information along the vertical dimension. Let the original input tensor

X \in R^{C \times W \times H}

represent the intermediate feature map; thus, the one-dimensional global average pooling that encodes global information along the horizontal dimension for the c-th channel at height H is computed as shown in Equation (2).

z_{c}^{H} (H) = \frac{1}{W} \sum_{0 \leq i \leq W} (H, i)

(2)

In this equation, x_c represents the input feature of the c-th channel;

X \in R^{C \times W \times H}

is the input tensor; C is the number of input channels; and H and W are the spatial dimensions of the input features. Through this encoding process, CA captures long-range dependencies along the horizontal dimension while retaining precise positional information along the vertical dimension. Similarly, another channel is derived directly from one-dimensional global average pooling along the horizontal dimension, which can be considered an assembly of positional information along the vertical dimension. This channel employs one-dimensional global average pooling along the vertical dimension to capture long-distance spatial interactions and retains precise positional information along the horizontal dimension, thereby enhancing the focus on spatial regions of interest. The pooled output of width W in channel C is calculated as shown in Equation (3).

z_{c}^{H} (W) = \frac{1}{H} \sum_{0 \leq i \leq H} (j, W)

(3)

The input features can encode the global feature information of weeds in the context of a cotton field and help the model capture global information along two spatial dimensions, neither of which involves convolutions. Additionally, two parallel one-dimensional feature encoding vectors are generated. These vectors share a 1 × 1 convolution with dimensionality reduction capabilities. The 1 × 1 kernel is designed to help the model to capture local cross-channel interactions and share similarities with channel convolutions. CA further decomposes the output of the 1 × 1 convolutional kernel into two parallel one-dimensional feature-encoding vectors, each followed by a separate 1 × 1 convolution, and then a Sigmoid. The attention map weights, learned from the two parallel channels, are utilized to aggregate the original intermediate feature maps into the final output. Consequently, CA preserves long-range dependencies not only by encoding channel and spatial information but also by enhancing the model’s focus on salient spatial regions.

CA achieves an impressive performance by embedding precise weed location information into the channels and capturing long-range spatial interactions. Two one-dimensional global average pools were designed to encode global information along two spatial dimensions and capture remote spatial interactions along different dimensions.

Agricultural robots and automated systems require rapid response for real-time weed detection. The introduction of EfficientNet, optimized for computational cross-attention mechanisms, as the backbone network for the improvement of YOLOv8 can achieve rapid inference, meeting the needs of real-time processing.

2.2.3. Head Network of Efficient Multi-Scale Attention Progressive Feature Pyramid Combination

To address the challenges of high similarity among detection targets and severe overlap and occlusion between plants, the C2f module in the first and second layers of the backbone network was improved based on the efficient multi-scale attention (EMA) module. This improvement enables the extraction of a richer set of features from the target plants while reducing computational complexity. The EMA is an advanced feature extraction module designed to retain as much information as possible in each channel while minimizing computational demands. Compared to traditional feature extraction methods, the EMA-enhanced C2f module pays more attention to the information of each channel, and not limited to a select few. In taking into account the feature grouping and multi-scale structure, the module effectively constructs short- and long-term dependencies, resulting in superior performance. The network diagram of C2f_EMA is shown in Figure 5.

Although the C2f in the backbone network was enhanced with EMA (efficient multi-scale attention) to improve its capability for multi-scale feature extraction, challenges remain in encoding weeds with scale variations for cotton field weed detection tasks. Utilizing a top-down and bottom-up feature pyramid network for feature fusion can result in the loss and degradation of feature information. To better integrate the multi-scale weed feature information extracted from the improved backbone network, a progressive feature pyramid network was incorporated into the original YOLOv8 head network. Fusing two adjacent low-level features and progressively merging higher-level features avoids large semantic gaps between non-adjacent levels, preventing the loss or degradation of cotton and weed feature information during transmission and interaction. This enhancement helps the model better capture the distinctions between different types of weeds in the cotton field, thereby improving the detection accuracy. In considering feature fusion at each spatial location, where conflicts in multiple target information may arise, an adaptive spatial fusion operation is further employed to mitigate these inconsistencies and to refine the precision of weed target detection in cotton field scenarios.

In order to improve the speed and effectiveness of weed detection in cotton fields, the AFPN (Adaptive Feature Pyramid Network) [25] and C2f of the head network were modified based on EMA [26] to improve the feature fusion capabilities and the detection performance of the model. The scale and morphology of weeds in cotton fields are diverse, and the AFPN can capture the feature information of different scale targets effectively through the adaptive aggregation of different scale feature maps and dynamic adjustment of weights. This adaptability enables the network to better adapt to weeds of different sizes and shapes, improving the accuracy and robustness of detection. The improved head network is shown in Figure 6.

Through organically combining the progressive feature pyramid of the head network’s AFPN with efficient multi-scale attention (EMA), the feature fusion capability of the model’s head network is enhanced, allowing it to better capture the target features of weeds across various scales and shapes, thereby improving the detection accuracy and robustness. In refining the C2f module based on the efficient multi-scale attention (EMA), the detection rate of the network is accelerated, and the detection performance for small targets is improved, thereby enhancing the model’s efficacy in weed detection tasks within real cotton field scenarios.

2.3. Experimental Environment

The operating system used in the experiment was CentOS, with the CPU model being Intel (R) Xeon (R) CPU E5-2630 v4 @2.20 GHz (10 cores). The hardware configuration included three NVIDIA Tesla P40 GPUs, each with 22 GB of video memory. The system was supported by 32 GB of RAM and a 1 TB disk drive. The experiments were conducted using Python 3.9.17, with the deep learning framework PyTorch 2.0.1 and the GPU acceleration library CUDA 11.4. During the network training phase, the batch size was set to 8, and the number of iterations was set to 300, as shown in Table 2.

2.4. Model Evaluation Metrics

In this study, we used accuracy (A), precision (P), recall (R), average precision (AP), and frames per second (FPS) to evaluate the model performance. Accuracy is an indicator that evaluates the overall correctness of the model’s predictions. Precision reflects the proportion of instances predicted by the model to be positive samples that are truly positive. Recall measures the proportion of all true positive instances that are correctly identified by the model. Average precision is a comprehensive indicator of precision and recall, which can more fully evaluate the model’s performance under different recall rates. [27]. Frames per second is used to evaluate the model’s running speed in practical applications, which is particularly important for scenarios with high real-time requirements. These evaluation metrics together form an important tool for comprehensively evaluating model performance in object detection tasks. The above five metrics are mathematically represented by Equations (4)–(8), where TP indicates that both the true value and the predicted value are positive, and TN indicates that both are negative. FP indicates that the true value is negative and the predicted value is positive; FN indicates that the true value is positive and the predicted value is negative.

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

R e c a l l = \frac{T P}{T P + F N}

(5)

A P = \int_{0}^{1} P (R) d r

(6)

m A P = \frac{\sum_{1}^{n} A P}{n}

(7)

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

In addition, Param and weight size were selected as evaluation metrics to evaluate the lightweight performance of the algorithm. Param refers to the number of parameters of the network, and weight size indicates the storage space occupied by all the learnable parameters (weights) in the model. The specific objective of this study was to validate the lightweight performance of the improved YOLOv8n.

3. Results

3.1. Comparative Experiment Analysis of Lightweight Backbone Networks

To verify the effectiveness of EfficientNet for model lightweight, we designed comparative experiments of lightweight backbone networks. Comparisons were made between EfficientNet and other lightweight backbones such as MobileNeXt [28], PP-LCNet [29], and ShuffleNetV2 [30]. The comparative results are listed in Table 3.

From Table 3, it can be seen that compared to the baseline model, the number of parameters and weight size were reduced when the backbone network was replaced with MobileNeXt, PP-LCNet, ShuffleNetV2, and EfficientNet models. Compared with MobileNeXt, PP-LCNet, and EfficientNet, the number of parameters decreased by 3.11, 4.21, and 3.48, respectively, and ShuffleNetV2 increased by 11.05. The weight size was reduced by 1.1, 1.8, 2.6, and 1.4, respectively. In terms of detection accuracy, the average accuracy rate was reduced by 0.68, 2.08, 12.68, and 0.18, respectively.

In summary, replacing the backbone network of YOLOv8 with the EfficientNet network model had the lowest accuracy loss; the model size was reduced by 1.1 compared with YOLOv8, model parameters were reduced, and inference speed was improved. Therefore, the improved cotton field weed detection model, which replaces the backbone network of YOLOv8 with EfficientNet, is more advantageous on the basis of lightweight and more suitable for the subsequent implementation of unmanned targeted intelligent weeding equipment.

3.2. Comparative Experiment Analysis of Backbone Networks with Attention Mechanisms

To further reduce the loss of detection accuracy while maintaining the lightweight of the model, the cross-attention (CA) mechanism was introduced into the model’s backbone network. To verify the effectiveness of the CA mechanism, it was compared with other common attention mechanisms. The CA module, EMA module, CBAM attention module [31], SE module [24], SimAM attention module [32], SK attention module [33], and CoT attention [34] modules were each integrated into the backbone network EfficientNet, and the comparative experimental results are presented in Table 4.

From Table 4, it can be observed that compared to the baseline model, the lightweight cotton field weed detection models with the backbone network EfficientNet that incorporated EMA, CBAM, SE, SimAM, SK, CoT, and CA mechanisms all exhibited fluctuations in average accuracy, with increases of 0.32, and decreases of 0.48, 0.08, 0.18, 1.28, 0.42, and 0.12, respectively; the detection time of a single image was reduced, with reductions of 9.2, 10.8, 10.7, 10.2, 10.7, 10.0, and 8.6, respectively.

In terms of model FPS, the backbone network models with EMA, CBAM, SE, SimAM, SK, CoT, and CA mechanisms all showed an improvement, with increases of 307.01, 447.62, 421.30, 321.30, 421.30, 291.67, and 397.49, respectively.

In terms of the weight of the models, the introduction of EMA, CBAM attention, SE attention, SimAM attention, SK attention, CoT attention and CA decreased the weight by 1.3, 1.3, 1.4, 1.4, 0.9, 1.3, and 1.3, respectively. The number of parameters decreased by 3.09, 3.47, 3.09, 3.10, 2.29, 3.01, and 3.09, respectively.

In summary, compared with YOLOv8, the backbone network detection model with the introduction of CA increased the average accuracy by 0.12, reduced the detection time of single image by 8.6, increased the FPS by 397.49, decreased weight by 1.30, and decreased the number of parameters by 3.09. Based on the light weight of the model, the model achieved optimal detection accuracy and performance.

3.3. Comparative Experiment Analysis of the Improved Head Network C2f Module Based on Attention Mechanisms

Since the backbone network of YOLOv8 was replaced with the more lightweight EfficientNet as previously discussed, we tried to improve the C2f module in the feature fusion part of YOLOv8 to ensure the detection accuracy of the model while controlling the light weight. To verify the effectiveness of this improvement, a comparative analysis was conducted on the C2f modules of the feature fusion part improved based on different attention mechanisms, and the comparison of test results was shown in Table 5.

The comparative data presented in Table 5 prove the effectiveness of the lightweight model proposed in this paper. Relative to the benchmark model, the average accuracy of the C2f module within the head network, when enhanced with SE attention, CA, TripleT attention [35], ECA [36], and EMA decreased by 0.88, 0.48, 0.08, 0.48, and 0.58, respectively.

In terms of single-image detection time, the C2f modules of the head network improved based on SE attention, CA, TripleT attention, and EMA all decreased by 9.7, 10.8, 10.3, and 10.8, respectively, while the C2f module improved based on ECA increased by 6.5; in terms of FPS, the C2f modules of the head network improved based on SE attention, CA, ECA, TripleT attention, and EMA all increased by 254.63, 447.62, 26.62, 337.97, and 447.62, respectively.

In terms of weight size, the C2f modules of the head network improved based on SE attention, CA, ECA, and TripleT attention all decreased by 1.3, while the C2f module improved based on EMA decreased by 0.9. In terms of the parameter number, the C2f modules of the head network improved based on SE attention, CA, ECA, TripleT attention, and EMA decreased by 3.47, 3.47, 3.46, 3.49, and 2.71, respectively.

In summary, the improved model based on the EMA-improved C2f module had a 10.8 reduction in single-image detection time, 447.62 increase in FPS, 0.9 reduction in weight magnitude, and 2.71 reduction in the number of parameters, which further improved the detection accuracy while controlling the model’s light weight.

3.4. Ablation Study Analysis

In order to prove the effectiveness of the various improvement strategies in the YOLO-WL cotton weed detection model, ablation experiments were conducted on each improved strategy, and the specific experimental contents and detection results are shown in Table 6. In the table, EfficientNet and CA represent the improvements made to the model’s backbone network structure, while C2f_EMA refers to the improvements made to the model’s head network.

As can be seen from Table 6, the first set of experiments replaced the backbone network of the YOLOv8 detection model with EfficientNet, and it was found that there was a loss in the average accuracy of the model, with a decrease of 0.18%, but there was a significant improvement in the model’s single-image detection time as well as in the model’s weight size, with the detection time per image shortened to 1.7 ms and the weight size narrowed to 4.5 MB. The second set of experiments introduced cross-attention in the backbone network, in which the average accuracy was improved by 0.12%, but the detection time per image was improved to 2.7 ms, and the weight size was reduced to 1.3 MB. The third set of experiments introduced the backbone network’s C2f module improvement based on the efficient multi-scale attention (EAM) improvement into the head network and found that the average accuracy of the model continued to lose 0.58, but the performance of the model in terms of single-image detection time increased by 0.2 ms, and the weight size increased by 0.5 MB compared with the first group of experiments.

The last set of experiments represented the YOLO-WL detection model proposed in this paper, which integrated the improvements of the three sets of experiments. Compared with the first set of experiments, the model’s average accuracy increased by 0.5, the detection time per image increased by 0.2 ms, and the weight size increased by 0.1. Compared with the second set of experiments, the model’s average accuracy increased by 0.2, the detection time per image increased by 1.0 ms, and the weight size remained unchanged. Compared with the third set of experiments, the model’s average accuracy increased by 0.9, the detection time per image remained unchanged, and the weight size decreased by 0.4.

3.5. Experimental Analysis of Improved Models

With the incorporation of the EfficientNet backbone network with cross-attention into the baseline model and integrating the C2f module featuring efficient multi-scale attention with the progressive feature pyramid fusion module into the head network, the YOLO-WL cotton field weed detection model maintained a lightweight architecture while ensuring high accuracy for weed detection in real-world cotton fields. This enhancement renders the model particularly suitable for real-time weed detection tasks in cotton field in real scene.

To comprehensively evaluate the performance of the improved model, this study conducted a meticulous evaluation of 566 cotton field weed images from the test dataset. Table 7 details the detection results for various weed categories against the backdrop of cotton fields using the model developed in this research. The findings in Table 7 demonstrate the successful detection of 12 weed species in cotton fields, including waterhemp, morningglory, purslane, spotted spurge, carpetweed, ragweed, eclipta, prickly sida, palmer amaranth, sicklepod, goosegrass, and cutleaf groundcherry. Figure 7 compares the detection results of YOLO-WL and YOLOv8, revealing that the improved model presented in this paper effectively addressed certain issues of missed and false detections.

The mutual occlusion and similar features between weeds and cotton plants increase the difficulty of detection. However, the model enhanced in this study can accurately extract the morphological characteristics of weeds and cotton plants, thereby achieving the precise detection of cotton plants and various weed categories. For instance, certain weeds may be challenging to identify due to occlusion by cotton plants or other weeds, but the improved YOLOv8 model, with advanced fusion strategies, integrates the positional and semantic information of the occluded weeds, significantly enhancing the detection accuracy of such obscured weeds.

In summary, the YOLOv8 model refined in this study demonstrated excellent performance in the task of cotton field weed detection. It consistently achieved satisfactory detection outcomes across a range of scenarios, including small and multiple-target detections, as well as under conditions affected by occlusions and variations in lighting.

3.6. Comparative Experiment Analysis of Detection Networks

To qualitatively evaluate the detection results of the improved model YOLO-WL, the improved model was compared with YOLOv7, YOLOv7-tiny, YOLOv5, YOLOv3, and the original YOLOv8 model on cotton field weed images in the test set. The comparison results are shown in Table 8.

As can be seen from Table 8, on the cotton field weed dataset, the lightweight cotton field weed detection model YOLO-WL based on efficient convolutional improvement proposed in this study had an average accuracy of 92.30%, detection time of 1.9 ms per image, FPS of 526.32, weight size of 4.6 MB, and parameter amount of 7.98 MB. Compared to YOLOv3, YOLOv5, YOLOv7, YOLOv7-tiny, and YOLOv8, the detection time per image was reduced by 52.3, 10.8, 23.6, 118.5, and 10.8, respectively; the FPS was improved by 507.87, 447.58, 487.1, 518.01, and 447.62, respectively; the weight size was optimized by 786.4 MB, 0.4 MB, 7.2 MB, 2.6 MB, 280.4 MB, and 1.3 MB, respectively, which are all superior to those of other detection models, significantly increasing model speed.

3.7. Model Acceleration Test Based on Tensor RT

The model parameters trained on the server side are often stored in a specific model with a specific format, and when the model is actually deployed on the edge side, such as in a weeding robot, there will be different degrees of loss and reduction compared with the parameters trained on the server.

In actual cotton field farming activities, to achieve efficient operation on the edge end, real-time detection technology and the convenience of deployment are crucial. In particular, when working in conjunction with the growth cycle of cotton plants, the requirements for detection speed are even more stringent. To achieve high-efficiency detection of cotton field weed, this study employed model acceleration technology based on Tensor RT.

As a leading inference optimization engine, Tensor RT excels in achieving low-latency and high-throughput inference execution. It speeds up the inference process by deeply analyzing the feature structure of the network and reconstructing and optimizing it. The uniqueness of Tensor RT is that it does not need to compromise between real-time operation and high-precision detection and is able to meet the real-time requirements while guaranteeing the detection accuracy.

For the YOLO-WL model, Tensor RT implemented multi-dimensional optimization measures, including layer fusion, convolutional model selection, network pruningm and precision adjustment, which effectively reduced computational complexity and memory usage and improved inference speed. After optimization, Tensor RT converted the model into a highly optimized computational graph form, enabling the model to fully utilize the parallel computing advantages of the GPU for rapid inference. The network model was converted from the pt file format to the engine file format compatible with Tensor RT, and the model computation precision was set to 16-bit floating-point numbers. Finally, the optimized detection model was successfully deployed to the cotton field weeding edge device.

TensorRT took full advantage of GPU parallel computing and model optimization to ensure the accuracy of the model’s detection while greatly improving the real-time performance of weed detection in cotton fields [37]. In order to verify the generalization ability and authenticity of the optimized model in actual scenes, field tests were conducted in the cotton fields of Bole City, Boltala Mongolian Autonomous Prefecture, Xinjiang Uygur Autonomous Region, cotton fields in Shuanghe Park of the 5th Division of Xinjiang Construction Corps, and a cotton field in Dawu County, Xiaogan City, Hubei Province, using actual cotton field weed videos. The experimental results are recorded in Table 9, clearly showing the superior performance of the TensorRT-accelerated model in the task of cotton field weed detection.

The video-reasoning speed of the original YOLOv8 model for weeds in the cotton field was 30.851 ms. The video-reasoning speed of the improved YOLO-WL model for weeds in the cotton field was 23.134 ms, and after acceleration by TensorRT, the video-reasoning time was reduced to 2.443 ms, and the model reasoning speed was increased 9.47 times.

In combining the detection results in Figure 8 and the data in Table 8 and Table 9 to analyze the weed detection models proposed in this paper, compared with the YOLOv8 detection model, the YOLO-WL model exhibits a marked enhancement in detection success rate and speed under real-world conditions. In taking all the above experiments into account, it can be fully proven that the YOLO-WL model has good effects in the cotton field weed detection task in real scenarios, and at the same time, it provides effective theoretical support for the realization of intelligent uncrewed targeted cotton field weeding in real scenarios.

4. Discussion

As China’s population ages and the agricultural labor force gradually shrinks, improving agricultural productivity has become an urgent challenge. In recent years, deep learning technologies have been widely applied in the agricultural sector, yielding significant advancements, particularly in crop growth monitoring [38], pest and disease detection [39], precision irrigation [40], and soil quality analysis [41]. Many studies have focused on leveraging remote sensing technology drones [42], and sensors [43] for real-time data acquisition and analysis to enhance crop yield and resource efficiency. However, despite the successful application of these technologies across various fields, existing models often demand substantial computational resources, which limits their deployment on resource-constrained edge devices. Among these applications, weed detection in cotton fields, as a critical task in precision agriculture, requires highly efficient, low-resource solutions to meet the demands of edge device deployment. To address this challenge, this study proposes a lightweight detection model, YOLO-WL, based on an improved YOLOv8 architecture, specifically designed for real-time weed detection in cotton fields.

The proposed YOLO-WL model integrates EfficientNet, cross-attention (CA), efficient multi-scale attention (EMA), and an Adaptive Feature Pyramid Network (AFPN) to significantly reduce the number of parameters and computational load while maintaining a detection performance comparable to that of YOLOv8. Specifically, the model reduces the number of parameters by 30.3%, decreases the weight size by 22.0%, and enhances detection speed by a factor of 5.7. These improvements enable YOLO-WL to be rapidly deployed on resource-constrained edge devices, providing an efficient and real-time solution for weed monitoring in cotton fields.

Although the YOLO-WL model has demonstrated excellent performance in experiments, certain limitations remain. First, the publicly available U.S. datasets used in this study do not fully represent the local agricultural conditions in China. To address this, we plan to develop a dedicated cotton field object detection dataset in Bole City, Bortala Mongol Autonomous Prefecture, Xinjiang Uygur Autonomous Region, to further enhance the model’s adaptability. Second, the model is still in the experimental phase and has not yet been integrated with automated weeding equipment. In the future, we aim to combine YOLO-WL with automated weeding devices to realize a fully automated workflow from weed detection to weeding operations, thereby reducing labor input and increasing agricultural productivity. Additionally, by integrating multi-modal sensor data (e.g., spectral and infrared data), we expect to further improve detection accuracy in complex environments.

In conclusion, this study effectively addresses the pressing need for efficient and adaptable weed detection solutions in precision agriculture. The YOLO-WL model demonstrates that lightweight, real-time detection can be achieved with minimal resource consumption, offering a practical solution for sustainable agriculture. Looking forward, we will continue to innovate and refine the model to enhance its generalizability and further integrate it with advanced agricultural equipment, thereby supporting the transition to fully automated and intelligent agricultural systems.

5. Conclusions

Based on the cotton weed dataset, this study developed a cotton weed dataset that integrates multi-scale features and multi-attention to address the challenges of identifying various weed species and different weed sizes and realize accurate, intensive detection. The improved lightweight YOLOv8n model was combined with improved EfficientNet, CA (cross-attention), EMA (efficient multi-scale attention), and an AFPN (Adaptive Feature Pyramid Network). The experimental results show that the improved YOLOv8 can detect weeds accurately and quickly and is suitable for devices with limited memory and computing resources. The results indicate that the improved YOLOv8 achieves an average accuracy, average image detection time, FPS (frames per second), weight size, and parameter number of 92.30%, 1.9 ms, 526.32, 4.6 MB, and 7.98, respectively. Compared to various YOLO versions, the model is lighter while still maintaining accurate detection, which helps in the early detection of weeds, improves monitoring efficiency, and reduces labor costs. Compared to the baseline model, the detection accuracy improved by 0.32%, the number of parameters decreased by 30.3%, the weight size reduced by 22.0%, and the detection speed increased by 5.7 times. Future research will focus on addressing the limitations and challenges highlighted in this study. Additionally, research will further optimize the model to accommodate a wider range of application scenarios and integrate multimodal sensor data (e.g., spectral and infrared) to enhance detection accuracy in complex environments. Through these initiatives, we aim to advance smart agricultural technologies to achieve more efficient weed management and sustainable agricultural development.

Author Contributions

Conceptualization, L.Z. and L.L.; methodology, L.L. and J.T.; software, L.L. and C.Z.; validation, P.C. and L.Z.; formal analysis, L.Z.; investigation, P.C.; resources, L.Z.; data curation, L.L. and M.J.; writing—original draft preparation, L.L. and L.Z.; writing—review and editing, M.J. and J.T.; visualization C.Z.; supervision, L.Z.; project administration, J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Hubei Province Key Research and Development Special Project of Science and Technology Innovation Plan under grant number 2023BAB087, Wuhan Knowledge Innovation Special Dawn Project under grant number 2023010201020465, open competition project for selecting the best candidates, Wuhan East Lake High-techDevelopment Zone under grant number 2024KJB328, and Fund for Research Platform of South-Central Minzu University under grant number CZQ24011.

Data Availability Statement

The data presented in this study are available from [Zenodo] at [https://doi.org/10.5281/zenodo.7535814].

Acknowledgments

Many thanks to Zhiqing Luo and Wei Xia for their help and contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, K.; Wang, Z.; Coleman, G.; Bender, A.; Yao, T.; Zeng, S.; Song, D.; Schumann, A.; Walsh, M. Deep learning techniques for in-crop weed recognition in large-scale grain production systems: A review. Precis. Agric. 2024, 25, 1–29. [Google Scholar] [CrossRef]
Rani, S.V.J.; Kumar, P.S.; Priyadharsini, R.; Srividya, S.J.; Harshana, S. Automated weed detection system in smart farming for developing sustainable agriculture. Int. J. Environ. Sci. Technol. 2022, 19, 9083–9094. [Google Scholar] [CrossRef]
Lauwers, M.; De Cauwer, B.; Nuyttens, D.; Cool, S.R.; Pieters, J.G. Hyperspectral classification of Cyperus esculentus clones and morphologically similar weeds. Sensors 2020, 20, 2504. [Google Scholar] [CrossRef]
Xu, K.; Yuen, P.; Xie, Q.; Zhu, Y.; Cao, W.; Ni, J. WeedsNet: A dual attention network with RGB-D image for weed detection in natural wheat field. Precis. Agric. 2024, 25, 460–485. [Google Scholar] [CrossRef]
Li, J.; Chen, D.; Yin, X.; Li, Z. Performance evaluation of semi-supervised learning frameworks for multi-class weed detection. Front. Plant Sci. 2024, 15, 1396568. [Google Scholar] [CrossRef] [PubMed]
MacRae, A.W.; Webster, T.M.; Sosnoskie, L.M.; Culpepper, A.S.; Kichler, J.M. Cotton yield loss potential in response to length of Palmer amaranth (Amaranthus palmeri) interference. J. Cotton Sci. 2013, 17, 227–232. [Google Scholar]
Mendoza-Bernal, J.; González-Vidal, A.; Skarmeta, A.F. A Convolutional Neural Network approach for image-based anomaly detection in smart agriculture. Expert Syst. Appl. 2024, 247, 123210. [Google Scholar] [CrossRef]
Dang, F.; Chen, D.; Lu, Y.; Li, Z. YOLOWeeds: A novel benchmark of YOLO object detectors for multi-class weed detection in cotton production systems. Comput. Electron. Agric. 2023, 205, 107655. [Google Scholar] [CrossRef]
Ahmad, A.; Saraswat, D.; Aggarwal, V.; Etienne, A.; Hancock, B. Performance of deep learning models for classifying and detecting common weeds in corn and soybean production systems. Comput. Electron. Agric. 2021, 184, 1–30. [Google Scholar] [CrossRef]
Peteinatos, G.G.; Weis, M.; Andújar, D.; Rueda Ayala, V.; Gerhards, R. Potential use of ground-based sensor technologies for weed detection. Pest Manag. Sci. 2014, 70, 190–199. [Google Scholar] [CrossRef]
Farooq, A.; Jia, X.; Hu, J.; Zhou, J. Multi-resolution weed classification via Convolutional Neural Network and superpixel based local binary pattern using remote sensing images. Remote Sens. 2019, 11, 1692. [Google Scholar] [CrossRef]
Shorewala, S.; Ashfaque, A.; Sidharth, R.; Verma, U. Weed density and distribution estimation for precision agriculture using semi-supervised learning. IEEE Access 2021, 9, 27971–27986. [Google Scholar] [CrossRef]
Dos Santos Ferreira, A.; Freitas, D.M.; Da Silva, G.G.; Pistori, H.; Folhes, M.T. Unsupervised deep learning and semi-automatic data labeling in weed discrimination. Comput. Electron. Agric. 2019, 165, 104963. [Google Scholar] [CrossRef]
Mu, Y.; Feng, R.; Ni, R.; Li, J.; Luo, T.; Liu, T.; Li, X.; Gong, H.; Guo, Y.; Sun, Y.; et al. A faster R-CNN-based model for the identification of weed seedling. Agronomy 2022, 12, 2867. [Google Scholar] [CrossRef]
Ilyas, T.; Lee, J.; Won, O.; Jeong, Y.; Kim, H. Overcoming field variability: Unsupervised domain adaptation for enhanced crop-weed recognition in diverse farmlands. Front. Plant Sci. 2023, 14, 1234616. [Google Scholar] [CrossRef]
Mu, Y.; Ni, R.; Fu, L.; Luo, T.; Feng, R.; Li, J.; Pan, H.; Wang, Y.; Sun, Y.; Gong, H.; et al. DenseNet weed recognition model combining local variance preprocessing and attention mechanism. Front. Plant Sci. 2023, 13, 1041510. [Google Scholar] [CrossRef]
Wang, Q.; Cheng, M.; Huang, S.; Cai, Z.; Zhang, J.; Yuan, H. A deep learning approach incorporating YOLO v5 and attention mechanisms for field real-time detection of the invasive weed Solanum rostratum Dunal seedlings. Comput. Electron. Agric. 2022, 199, 107194. [Google Scholar] [CrossRef]
Fan, X.; Chai, X.; Zhou, J.; Sun, T. Deep learning based weed detection and target spraying robot system at seedling stage of cotton field. Comput. Electron. Agric. 2023, 214, 108317. [Google Scholar] [CrossRef]
Chen, P.; Xia, T.; Yang, G. A new strategy for weed detection in maize fields. Eur. J. Agron. 2024, 159, 127289. [Google Scholar] [CrossRef]
Jin, X.; Sun, Y.; Che, J.; Bagavathiannan, M.; Yu, J.; Chen, Y. A novel deep learning-based method for detection of weeds in vegetables. Pest Manag. Sci. 2022, 78, 1861–1869. [Google Scholar] [CrossRef]
Singh, V.; Singh, D.; Kumar, H. Efficient application of deep neural networks for identifying small and multiple weed patches using drone images. IEEE Access 2024, 12, 71982–71996. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Lin, H.; Cheng, X.; Wu, X.; Shen, D. Cat: Cross attention in vision transformer. In Proceedings of the 2022 IEEE international Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 13–23 June 2018; IEEE: New York, NY, USA; pp. 7132–7141. [Google Scholar]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for Object Detection. arXiv 2023, arXiv:2306.15988. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Javanmardi, S.; Ashtiani, S.H.M.; Verbeek, F.J.; Martynenko, A. Computer-vision classification of corn seed varieties using deep convolutional neural network. J. Stored Prod. Res. 2021, 92, 101800. [Google Scholar] [CrossRef]
Zhou, D.; Hou, Q.; Chen, Y.; Feng, J.; Yan, S. Rethinking bottleneck structure for efficient mobile network design. In Proceedings of the Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 680–697. [Google Scholar]
Cui, C.; Gao, T.; Wei, S.; Du, Y.; Guo, R.; Dong, S.; Lu, B.; Zhou, Y.; Lv, X.; Liu, Q.; et al. PP-LCNet: A Lightweight CPU Convolutional Neural Network. arXiv 2021, arXiv:2109.15099. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Qin, X.; Li, N.; Weng, C.; Su, D.; Li, M. Simple attention module based speaker verification with iterative noisy label detection. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6722–6726. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1489–1500. [Google Scholar] [CrossRef]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3139–3148. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Bolouri, F.; Kocoglu, Y.; Pabuayon, I.L.B.; Ritchie, G.L.; Sari-Sarraf, H. CottonSense: A high-throughput field phenotyping system for cotton fruit segmentation and enumeration on edge devices. Comput. Electron. Agric. 2024, 216, 108531. [Google Scholar] [CrossRef]
Muruganantham, P.; Wibowo, S.; Grandhi, S.; Samrat, N.H.; Islam, N. A systematic literature review on crop yield prediction with deep learning and remote sensing. Remote Sens. 2022, 14, 1990. [Google Scholar] [CrossRef]
Zhu, H.; Lin, C.; Liu, G.; Wang, D.; Qin, S.; Li, A.; Xu, J.-L.; He, Y. Intelligent agriculture: Deep learning in UAV-based remote sensing imagery for crop diseases and pests detection. Front. Plant Sci. 2024, 15, 1435016. [Google Scholar] [CrossRef]
Kashyap, P.K.; Kumar, S.; Jaiswal, A.; Prasad, M.; Gandomi, A.H. Towards precision agriculture: IoT-enabled intelligent irrigation systems using deep learning neural network. IEEE Sens. J. 2021, 21, 17479–17491. [Google Scholar] [CrossRef]
Kussul, N.; Lavreniuk, M.; Skakun, S.; Shelestov, A. Deep learning classification of land cover and crop types using remote sensing data. IEEE Geosci. Remote Sens. Lett. 2017, 14, 778–782. [Google Scholar] [CrossRef]
Maimaitijiang, M.; Sagan, V.; Sidike, P.; Daloye, A.M.; Erkbol, H.; Fritschi, F.B. Crop monitoring using satellite/UAV data fusion and machine learning. Remote Sens. 2020, 12, 1357. [Google Scholar] [CrossRef]
Navaneethan, S.; Sampath, J.L.; Kiran, S.S. Development of a Multi-Sensor Fusion Framework for Early Detection and Monitoring of Corn Plant Diseases. In Proceedings of the 2023 2nd International Conference on Automation, Computing and Renewable Systems (ICACRS), Pudukkottai, India, 11–13 December 2023; pp. 856–861. [Google Scholar]

Figure 1. Example images from the CottonWeedDet12 Dataset: (a) waterhemp; (b) morningglory; (c) purslane; (d) spotted spurge; (e) carpetweed; (f) ragweed; (g) eclipta; (h) prickly sida; (i) palmer amaranth; (j) sicklepod; (k) goosegrass; (l) cutleaf groundcherry.

Figure 2. Structure diagram of the YOLO-WL model.

Figure 3. MBConv network architecture diagram.

Figure 4. Network architecture diagram.

Figure 5. C2f_EMA model structure diagram.

Figure 6. Network structure of the improved feature fusion module.

Figure 7. Comparison of test results before and after model improvement: (a) original; (b) YOLOv8; (c) the improved model.

Figure 8. Tensor RT acceleration test.

Table 1. Description of weeds in cotton field.

English	Latin	Quantity
Waterhemp	Debregeasia orientalis	1967
Morningglory	Ipomoea nil	1372
Purslane	Portulaca oleracea	996
Spotted Spurge	Euphorbia maculata	1010
Carpetweed	Trigastrotheca stricta	995
Ragweed	Ambrosia artemisiifolia	917
Eclipta	Eclipta prostrata	877
Prickly Sida	Eryngium foetidum	509
Palmer Amaranth	Amaranthus palmeri	350
Sicklepod	Stag beetle	249
Goosegrass	Eleusine indica	215
Cutleaf Groundcherry	Physalis angulata	144

Table 2. Experimental environment configuration.

Environmental Parameter	Value
Operating system	CentOS
CPU	Intel(R)Xeon(R)CPU E5-2630 v4(2.20 GHZ 10 cores)
Memory	22 GB × 3
GPU	NVIDIA Tesla p40
Programming language	Python3.8
Experimental framework	Pytorch2.0.1
GPU acceleration library	CUDA11.3
Dataset	CottonWeedDet12

Table 3. Experimental comparison of different lightweight backbone networks.

Models	mAP (%)	Detection Time Per Image (ms)	FPS	Model Size (MB)	Params (MB)
YOLOv8	91.98	12.7	78.70	5.9	11.45
YOLO-MobileNeXt	91.30	13.6	75.53	4.8	8.34
YOLO-PP-LCNet	89.90	2.3	434.78	4.1	7.24
YOLO-ShuffleNetV2	79.30	3.1	322.58	3.3	22.5
YOLO-EfficientNet	91.80	1.7	588.24	4.5	7.97

Table 4. Experimental comparison of different lightweight backbone networks.

Models	mAP (%)	Inference Time (ms)	FPS	Model Size (MB)	Params (MB)
YOLOv8	91.98	12.7	78.70	5.9	11.45
EfficientB0-EMA	92.30	3.5	385.71	4.6	8.36
EfficientB0-CBAM	91.50	1.9	526.32	4.6	7.98
EfficientB0-SE	91.90	2.0	500.00	4.5	8.36
EfficientB0-SimAM	91.80	2.5	400.00	4.5	8.35
EfficientB0-SK	90.70	2.0	500.00	5.0	9.16
EfficientB0-CoT	92.40	2.7	370.37	4.6	8.44
EfficientB0-CA	92.10	2.1	476.19	4.6	8.36

Table 5. Experimental comparison of C2f modules improved based on different attention mechanisms.

Models	mAP (%)	Inference Time (ms)	FPS	Model Size (MB)	Params (MB)
YOLOv8	91.98	12.7	78.70	5.9	11.45
C2f_SE	91.10	3.0	333.33	4.6	7.98
C2f_CA	91.50	1.9	526.32	4.6	7.98
C2f_ECA	91.50	19.2	52.08	4.6	7.99
C2f_TripleT	91.90	2.4	416.67	4.6	7.96
C2f_EMA	91.40	1.9	526.32	5.0	8.74
YOLO-WL	92.30	1.9	526.32	4.6	7.98

Table 6. Ablation experiment results of improved cotton field detection model.

Models	EfficientNet	CA	C2f_EMA	mAP (%)	Inference Time (ms)	Model Size (MB)
YOLOv8	X	X	X	91.98	12.70	5.9
YOLO-EfficientNet	√	X	X	91.80	1.70	4.5
YOLO-CA	X	√	X	92.10	2.70	4.6
YOLO-C2f_EMA-AFPN	√	X	√	91.40	1.90	5.0
YOLO-WL	√	√	√	92.30	1.90	4.6

Table 7. Results of weed detection model in different categories.

Weeds	mAp (%)	Precision (%)	Recall (%)
Waterhemp	81.5	92.4	55.7
Morningglory	90.0	93.0	86.4
Purslane	96.8	94.4	92.0
Spotted Spurge	89.8	84.8	81.2
Carpetweed	94.9	96.1	85.8
Ragweed	85.7	96.3	76.1
Eclipta	92.9	97.1	87.9
Prickly Sida	99.5	96.2	100.0
Palmer Amaranth	93.6	93.5	90.4
Sicklepod	93.5	90.9	94.4
Goosegrass	92.5	92.6	90.9
Cutleaf Groundcherry	96.6	97.9	89.8

Table 8. Performance results of different detection models on the test set.

Models	mAP (%)	Inference Time (ms)	FPS	Model Size (MB)	Params (MB)
YOLOv3	93.10	54.2	18.45	791.0	395.13
YOLOv5	91.80	12.7	78.74	5.0	9.56
YOLOv7-tiny	92.87	25.5	39.22	11.8	23.04
YOLOv7	92.65	120.4	8.31	285.0	139.34
YOLOv8	91.98	12.7	78.70	5.9	11.45
YOLO-WL	92.30	1.9	526.32	4.6	7.98

Table 9. Performance results of different detection models on the test set.

Models	mAP (%)	Inference Time (ms)
YOLOv8	91.98	30.851
YOLO-WL	92.3	23.134
YOLO-WL-RT	92.1	2.443

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, L.; Long, L.; Zhu, C.; Jia, M.; Chen, P.; Tie, J. A Lightweight Cotton Field Weed Detection Model Enhanced with EfficientNet and Attention Mechanisms. Agronomy 2024, 14, 2649. https://doi.org/10.3390/agronomy14112649

AMA Style

Zheng L, Long L, Zhu C, Jia M, Chen P, Tie J. A Lightweight Cotton Field Weed Detection Model Enhanced with EfficientNet and Attention Mechanisms. Agronomy. 2024; 14(11):2649. https://doi.org/10.3390/agronomy14112649

Chicago/Turabian Style

Zheng, Lu, Lyujia Long, Chengao Zhu, Mengmeng Jia, Pingting Chen, and Jun Tie. 2024. "A Lightweight Cotton Field Weed Detection Model Enhanced with EfficientNet and Attention Mechanisms" Agronomy 14, no. 11: 2649. https://doi.org/10.3390/agronomy14112649

APA Style

Zheng, L., Long, L., Zhu, C., Jia, M., Chen, P., & Tie, J. (2024). A Lightweight Cotton Field Weed Detection Model Enhanced with EfficientNet and Attention Mechanisms. Agronomy, 14(11), 2649. https://doi.org/10.3390/agronomy14112649

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Cotton Field Weed Detection Model Enhanced with EfficientNet and Attention Mechanisms

Abstract

1. Introduction

2. Materials and Methods

2.1. Materials

2.2. Improved Efficient Convolutional Lightweight Weed Detection Model in Cotton Field

2.2.1. Overall Structure of the Model

2.2.2. Improved Efficient Convolutional Lightweight Backbone Network

2.2.3. Head Network of Efficient Multi-Scale Attention Progressive Feature Pyramid Combination

2.3. Experimental Environment

2.4. Model Evaluation Metrics

3. Results

3.1. Comparative Experiment Analysis of Lightweight Backbone Networks

3.2. Comparative Experiment Analysis of Backbone Networks with Attention Mechanisms

3.3. Comparative Experiment Analysis of the Improved Head Network C2f Module Based on Attention Mechanisms

3.4. Ablation Study Analysis

3.5. Experimental Analysis of Improved Models

3.6. Comparative Experiment Analysis of Detection Networks

3.7. Model Acceleration Test Based on Tensor RT

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI