1. Introduction
Cotton is a vital crop in China, which is closely related to the daily life of the nation, and China’s seed cotton production reached 18.122 million tons in 2022, ranking first in the world’s production [
1]. Within the cotton plant ecosystem, cotton top buds play a crucial role in shaping the growth trajectory and yield [
2]. Cotton topping is critical for reducing the growth of ineffective branches, regulating nutrient distribution, and promoting early and increased boll setting while minimizing shedding.
Cotton topping is typically performed through three methods: manual topping [
3], chemical topping [
4], and mechanical topping [
5]. Manual topping is labor-intensive and inefficient; chemical topping is prone to environmental pollution; and mechanical topping is prone to omission and misdirection. Domestic research institutes have investigated mechanical equipment for cotton topping [
6,
7], focusing largely on automating control of the operations. In the complex environment of a cotton field, factors such as variations in lighting conditions, differences in cotton plant growth, weed growth, and changes in the size and color of cotton top buds can significantly impact the accuracy of detection and spatial positioning.
The ongoing advancements in artificial intelligence and deep learning have led to substantial progress in object detection through convolutional neural networks (CNNs). At present, mainstream object detection techniques are classified into two categories: two-stage and one-stage methods. Two-stage methods mainly include an RCNN [
8], Faster R-CNN [
9], and Cascade R-CNN [
10]; these detection algorithms offer higher detection accuracy but slower detection speeds. One-stage methods mainly include SSDs [
11], Retinanet [
12], and the YOLO [
13] series; these detection algorithms have average accuracy but fast detection speeds and are widely used in several fields.
The YOLO algorithm has seen numerous and diverse applications [
14,
15]. However, maintaining model lightweightness while ensuring high detection accuracy during object detection remains a research challenge. In order to design a lightweight target detector for vehicle-mounted applications, Chen Xue et al. [
16] proposed a new method called a Sparsely Connected Asymptotic Feature Pyramid Network (SCAFPN). Jin Gao et al. [
17] used a cross-layer feature fusion network to retain model lightweightness in a model designed to detect cherry tomatoes in unstructured environments. While these methods can effectively decrease the number of model parameters, significant modifications to the model structure might lead to performance degradation or require extensive tuning.
In recent years, researchers have proposed various crop detection methods [
18,
19,
20]. Traditional approaches rely on color, shape, texture, and other background aspects of the crops, extracting features through algorithms for crop recognition. For example, Longsheng Fu et al. [
21] employed RGB and depth features in an R-CNN-based approach to detect apples in densely foliated fruit wall trees, facilitating robotic harvesting. Guichao Lin et al. [
22] developed a reliable algorithm based on Red-Green-Blue-Depth (RGB-D) images for detecting and localizing citrus in real outdoor orchard environments for robotic picking. However, the small size of cotton top buds, along with leaf shading, light conditions, and uneven growth, poses challenges for cotton top bud detection.
Numerous researchers have conducted extensive studies on the detection and identification of small targets in complex environments [
23,
24,
25], making significant progress. Yifan Bai et al. [
26] proposed a real-time recognition algorithm (Improved YOLO) to accurately identify strawberry seedlings, addressing issues of small flower and fruit size, similar color, and overlapping occlusion. Yanxu Wu et al. [
27] developed an enhanced end-to-end RGB-D multimodal object detection network for tea bud detection based on YOLOv7, which has an AP50 of 91.12% in the face of complex outdoor tea photography. In cotton topping object detection, it is essential to maintain the accuracy of small object detection, improve detection speed, and ensure the model’s lightweight nature to support the migration and deployment of the model on cotton topping machinery.
Over the past few years, numerous improved object detection models have been introduced to tackle the difficulties associated with detecting cotton top buds [
28,
29]. To address the problem of cotton top bud detection, Peng Song et al. [
30] proposed an improved Cascade R-CNN to detect cotton top bud regions on RGB images, and three-dimensional (3D) coordinates of objects were obtained by combining color images and depth images from RGB-D cameras. C. Wang et al. [
31] drastically reduced the parameter count in the YOLOv3 model by applying deep separable convolution and enhanced the model’s ability to learn multi-scale features through a hierarchical multi-scale approach. Xuan PENG et al. [
32] added an object detection layer to the YOLOv5s structure, incorporating the CPP-CBAM attention mechanism with the SIoU bounding box regression loss function to improve cotton top bud detection accuracy. While these methods enhance detection accuracy to some extent, they suffer from slow detection speeds and do not account for the shooting angles of the cotton top buds.
To address the limitations and shortcomings of existing research, this paper focuses on cotton top bud recognition in complex environments and proposes an accurate, real-time recognition algorithm for natural environments based on the YOLOv8n object detection algorithm. This approach effectively handles variations in angles, shapes, and occlusion scenarios. The research improves the detection accuracy of small objects by modifying the YOLOv8n network structure, including changes to the loss function, lightweight convolution, and fusing of the pyramid feature network. Ultimately, the proposed object detection model not only enhances detection accuracy but also lays a robust foundation for future research endeavors.
The key contributions of this study are as follows:
- (1)
We propose replacing the C2f module of the YOLOv8n backbone network with the Cross-Stage Partial Networks and Partial Convolution (CSPPC) lightweight module to reduce redundant computations and optimize memory access.
- (2)
The neck network employs the Efficient Reparameterized Generalized-FPN (Efficient RepGFPN) to achieve high-precision detection without significantly increasing computational cost.
- (3)
The study introduces the Inner CIoU loss function to compute regression loss, regulating the generation of auxiliary bounding boxes with a scale factor ratio to expedite convergence. The enhanced model’s effectiveness in detecting cotton top buds under natural conditions has been validated through experiments, offering technical support for the advancement of intelligent cotton topping machinery.
The method was assessed and benchmarked against existing techniques using a cotton top bud dataset. The results indicate that the proposed method attains higher precision, recall, AP50, and F1 scores while maintaining real-time processing speed, significantly enhancing detection performance compared to existing methods.
The structure of the remaining sections of the paper is as follows:
Section 2 elaborates on each module of the proposed Bud-YOLO model.
Section 3 presents the experimental setup, results, and discussion.
Section 4 summarizes the paper and highlights its main contributions.
2. Materials and Methods
2.1. Dataset Sample Collection
The cotton top bud dataset used in this paper was obtained from a cotton field in the 10th Regiment of Alaer City, Xinjiang, China. The images were collected using a smartphone HUAWEI P50E (manufactured by Huawei in Shenzhen, Guangdong Province, China) from mid-June to mid-July 2022 and had a resolution of 4069 × 3072 pixels.
During the image data acquisition process, we used two angles, namely a top shot and a side shot, and top buds with less than 30% occlusion were selected for photography. In addition to the influence of the objective factors of acquisition time on the detection of cotton top buds, their shape is also complex. The morphology of the cotton top bud varies at different developmental stages, as shown in
Figure 1. The cotton top bud is smaller and lighter in color in the early stage of development, and more plump and darker in color in the later stage of development.
The collected images of cotton top bud encompass different angles, occlusion situations, and morphologies to ensure data diversity and enhance the model’s robustness. Together, these samples constitute the dataset, with a total of 800 raw images collected. To verify the effectiveness of the model training, cotton top bud images were collected from mid-June to late June 2023, However, these images were not included in the training, validation, and testing datasets and were only used for prediction.
2.2. Annotation Alteration of the Dataset
We focused on cotton top bud detection under natural conditions. Manual labeling is necessary before training on the cotton top bud image data, for which we utilized the LabelImg tool [
33]. Each cotton top bud was boxed and labeled as “bud”, and the labeling focused on the location and category information of the cotton top bud. The labeled files were saved in PASCAL VOC format as XML files. Following annotation, the cotton top bud image data were converted to the YOLO dataset format.
2.3. Dataset Augmentation
The sample size of the data in the cotton top bud dataset was insufficient for the model to converge during training. To improve the generalization of the model and prevent overfitting due to a lack of training data, we used data augmentation techniques, including brightness and contrast adjustments, as shown in
Figure 2, resulting in a total of 4000 object samples.
2.4. Dividing the Dataset
The labeled dataset was partitioned into training, validation, and test subsets with a ratio of 8:1:1. This distribution yielded 3200 images for training, 400 images for validation, and 400 images for testing.
2.5. Bud-YOLO Model for the Detection of Cotton Top Buds
2.5.1. Selection of the YOLOv8 Model
Ultralytics released the YOLOv8 algorithm in January 2023 [
34], representing a significant technological improvement over the YOLOv5 object detection model. There are five versions of YOLOv8: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, which progressively increase in width and depth. Considering the model size and its deployability on mobile platforms, the YOLOv8n model was chosen as the base for the experiments.
The YOLOv8 model consists of three main components: a backbone, neck, and head. The backbone utilizes the C2f and SPPF modules for feature extraction, adjusting the number of channels to improve the efficiency of this process. The neck network retains the PAN-FPN architecture from the YOLOv5 model to achieve bidirectional fusion of low- and high-level features, thereby improving target detection across multiple scales. The head network comprises three detection layers designed to detect features of varying scales generated by the neck network, employing an anchor-free approach to improve detection accuracy and flexibility across multiple scales. The network structure of YOLOv8 is illustrated in
Figure 3:
2.5.2. Bud-YOLO Network Structure
For the cotton top bud dataset, we propose a lightweight detection algorithm: Bud-YOLO. The algorithm reduces the model size and improves the computational efficiency while increasing AP50 in object detection and reducing false detections and omissions, thereby enhancing model robustness.
The backbone network uses the CSPPC module, reducing redundant computations and memory access with minimal impact on detection accuracy. The neck network employs an Efficient Reparameterized Generalized-FPN (Efficient RepGFPN) to ensure high-accuracy detection without significantly increasing computational cost. Finally, the Inner CIoU function is introduced to compute the regression loss, with auxiliary bounding boxes generated based on the scale factor ratio to compute the loss and accelerate convergence. The network structure of Bud-YOLO is illustrated in
Figure 4:
2.5.3. CSPPC Module
In this study, the CSPPC module proposed by Liu [
35], which has two PConvs [
36] in series in the output process, was used to replace the conventional convolution and reduce the number of parameters. This module replaces the conventional C2f and is incorporated into the algorithm’s backbone network. This integration removes redundant channel characteristics, minimizes computational redundancy and memory accesses, accelerates detection speed, and enables the more efficient extraction of spatial features. Suppose the input size is
, the convolution kernel is
, and the output size is
. Then, the FLOPs for regular convolution are shown in Equation (1) and the memory accesses are shown in Equation (2):
To maintain memory continuity, consecutive channels in the front or back segments are chosen to represent the entire feature map. Then, the FLOPs of PConv are shown in Equation (3), and the memory accesses are shown in Equation (4):
The architecture of the CSPPC module is depicted in
Figure 5. The CSPPC module significantly reduces the model size, facilitating seamless deployment on mobile devices and lowering the costs associated with hierarchical device development.
2.5.4. Efficient RepGFPN Module
In a Feature Pyramid Network (FPN), the purpose of multi-scale feature fusion is to combine features from various layers of the backbone network, enhancing the expressiveness of the output features and improving model performance. Traditional FPNs introduce a top-down path to fuse multi-scale features.
In this paper, we utilize the Efficient RepGFPN proposed by Xianzhe Xu et al. [
37]. This Efficient RepGFPN enhances the FPN concept for object detection by fusing multi-scale features more efficiently, capturing both high-level semantics and low-level spatial details. The main improvements of the Efficient RepGFPN include the following:
- (1)
Adopting different channel dimensions for feature maps at different scales; optimizing performance within computational resource constraints.
- (2)
Reducing latency by eliminating the additional up-sampling operation in Queen-Fusion.
- (3)
Combining CSPNet with an Efficient Layer Aggregation Network (ELAN) and reparameterization to improve feature fusion without significantly increasing computational requirements.
The architecture of the Efficient RepGFPN network is shown in
Figure 6.
2.5.5. Improved Loss Function
The CIoU loss function in the YOLOv8n model effectively captures geometric differences in bounding boxes, thereby enhancing model positioning accuracy. However, it exhibits slower convergence and higher loss values when applied to the cotton terminal bud dataset, primarily due to the considerable variation in bud shapes. The CIoU loss function is defined as shown in Equation (8):
where the actual bounding box and the anchor box are denoted as
and
.
In Equations (6)–(8), the width and height of the actual bounding box are denoted as and , respectively, and those of the anchor box are denoted as and ; measures the consistency of the aspect ratio; denotes the equilibrium parameter; denotes the prediction frame; denotes the labeling frame; is the diagonal distance encompassing both the predicted and true boxes; denotes the Euclidean distance between the centroid of the prediction frame and the labeling frame; and denotes the CIoU loss function. From Equation (6), it is evident that when the aspect ratios of the predicted and labeled boxes are identical, equals 0. At this point, the effectiveness of the CIoU loss function is affected, leading to varying sensitivity to objects of different scales, which is particularly unfavorable for small target localization. However, due to the large number of small targets in the cotton top bud image, it is easy to miss detections using this loss function.
To address the aforementioned issues, this paper incorporates the Inner IoU loss function, as proposed by Hao Zhang et al. [
38]. This method accelerates convergence by utilizing an auxiliary bounding box without introducing any additional loss terms. By distinguishing different regression samples and using various ratios of auxiliary bounding boxes to calculate the loss, the process of border regression can be effectively accelerated.
As shown in
Figure 7,
denotes the centroid of the actual bounding box and Inner actual bounding box, and
denotes the centroid of the anchor box and Inner anchor box. The variable
corresponds to scale factors, which usually range from 0.5 to 1.5.
The Inner CIoU loss replaces the standard CIoU loss in the original loss function, and is defined as follows:
In Equations (9)–(13), represents the transverse coordinate of the auxiliary bounding box’s left boundary, while denotes its right boundary. The scaling factor controls the size of the auxiliary bounding box. The longitudinal coordinates of the auxiliary bounding box’s lower and upper boundaries are represented by and , respectively. Similarly, and represent the transverse coordinates of the left boundary of the auxiliary anchor frame, while and correspond to the longitudinal coordinates of the lower and upper boundaries of the auxiliary anchor frame. The term “inter” refers to the area where the auxiliary anchor frame intersects the auxiliary bounding box, and the term “union” describes the merged area of these two regions. The IoU of Inner IoU is denoted by , and represents the Inner CIoU loss function.
2.6. Performance Evaluation Indicators
Model performance is typically evaluated based on three key factors: accuracy, real-time processing capability, and complexity. The commonly used accuracy metrics for target detection models include precision (P), recall (R), F1 score (F1), average precision (AP), and mean average precision (mAP), which are defined as follows:
TP represents the number of actual positive samples correctly predicted as positive, while FP denotes the number of actual negative samples predicted as positive. FN indicates the number of actual positive samples predicted as negative. P is the proportion of predicted positive samples that are actually positive. R represents the proportion of actual positive samples that are correctly predicted by the model. To balance the trade-off between precision and recall, the F1 score was introduced. AP represents the average precision for a specific class of targets across various recall points, corresponding to the area under the Precision–Recall (PR) curve. mAP is the average of the AP values across n target classes. In this study, n was set to 1. The average precision is expressed in terms of AP, specifically AP50 when the IoU threshold is set to 0.5.
Real-time performance was evaluated using Frames Per Second (FPS). Higher FPS values indicate better real-time detection performance. These metrics were used to evaluate the model’s performance in detecting cotton top buds. Complexity metrics include FLOPs and model size, the latter referring to the size of the best model after training.
3. Results and Discussion
3.1. Experimental Platform
The hardware environment of the server platform was Intel(R) Xeon(R) Gold 6152 CPU (manufactured by Intel, Santa Clara, CA, USA) and NVIDIA GeForce RTX 3090 (24GB) GPU (manufactured by Nvidia, Santa Clara, CA, USA). The software environment was the Linux operating system, CUDA version 12.1, Python 3.10 programming language, and Pytorch 2.3.1 deep learning framework.
3.2. Experimental Parameters
The model received images with dimensions of 640 × 640 pixels as inputs. To optimize performance while considering the parameters, computational requirements, and memory usage associated with networks of varying depths and widths, the hyperparameters were set as follows: the epoch was 150, the batchsize was 32, the workers was 8, the initial learning rate was 0.01, the weight decay coefficient was 0.0005, the momentum parameter was 0.937, and the optimizer was Adam. To mitigate overfitting, we implemented an early stopping mechanism that terminated training if AP50 failed to exhibit significant improvement over 30 consecutive iterations.
3.3. Comparative Performance Analysis against Alternative Models
Based on the cotton top bud dataset, the Bud-YOLO model was compared with six target detection algorithms: YOLOv5s, YOLOv7-tiny, YOLOv8n, YOLOv9T, YOLOv10n, and Faster R-CNN. All models underwent training and validation in a controlled experimental environment. The detection results are presented in
Table 1.
The comparison experiments indicate that the AP50 of the Bud-YOLO model was 2.3%, 13.9%, 0.1%, 1.9%, 0.5%, and 2.5% higher than those of YOLOv5s, YOLOv7-tiny, YOLOv8n, YOLOv9T, YOLOv10n, and Faster R-CNN, respectively. The recall of the Bud-YOLO model exceeded these models by 6.5%, 21.2%, 0.1%, 9.3%, 4.9%, and 37.5%, respectively. Similarly, the F1 score of the Bud-YOLO model was 5.6%, 17.5%, 0.9%, 5.5%, 3.1%, and 23.2% higher than those of YOLOv5s, YOLOv7-tiny, YOLOv8n, YOLOv9T, YOLOv10n, and Faster R-CNN. The precision of the Bud-YOLO model was 0.977, with a recall of 0.99, an AP50 of 0.992, an F1 score of 0.983, and an FPS of 69.3. Although its FPS was lower than that of YOLOv8n and YOLOv10n, the Bud-YOLO model outperformed them in all other evaluation metrics, making it more suitable for detecting cotton top buds.
Figure 8 illustrates an example of the detection results obtained using the Bud-YOLO model. It demonstrates that the model is capable of effectively detecting cotton top bud under varying brightness levels, shading conditions, and shapes.
3.4. Effect of Inner CIoU Loss on the Model
To evaluate the impact of the Inner CIoU loss function on cotton top bud detection in Bud-YOLO, we conducted training experiments with the Bud-YOLO model using both the CIoU and Inner CIoU loss functions. The optimal weight files were then obtained for comparative analysis, and the results are shown in
Table 2. Compared with the CIoU loss function, the model using the Inner CIoU (ratio = 1.0) loss function showed improvements of 0.1%, 2.4%, 0.3%, and 1.3% in P, R, AP50, and F1 score, respectively. This combined advantage of incorporating the Inner CIoU loss function in model training is significant.
Additionally, this study verifies the convergence of the Bud-YOLO loss function.
Figure 9 illustrates the curves of the two loss functions across the number of iterations. The four curves represent the edge loss when different scale factors for Inner CIoU and CIoU are used, respectively.
As observed in
Figure 9, all four edge losses eventually converge as the number of iterations increases. However, compared to CIoU, Inner CIoU (ratio = 1.0) has a smaller loss value and exhibits greater stability. Therefore, selecting Inner CIoU (ratio = 1.0) as the border loss function in this study could improve the model’s detection performance for cotton top buds.
3.5. Ablation Experiments
The Bud-YOLO model proposed in this study is based on the YOLOv8n framework and is divided into three parts for improvement. To verify the validity of each improvement stage, ablation experiments were conducted using the experimental dataset.
As shown in
Table 3, all performance metrics of the model changed following the modifications. The “√” in the table indicates that the method was used in the improvement based on the YOLOv8n model. The FLOPs and model size of the CSPPC-only model were significantly reduced. Although the model size increased by 0.6 MB with the addition of the Efficient RepGFPN compared to the CSPPC-only model, the values of P, R, and F1 score increased by 0.2%, 1.5%, and 0.9%, respectively. Compared to YOLOv8n, the Bud-YOLO model experienced a 1.3% decrease in P, a 2.25% decrease in FPS, a 3% increase in R, a 0.1% increase in AP50, and a 0.9% increase in F1 score. Despite a reduction in inference speed, the Bud-YOLO model maintained a high frame rate of 69.3 FPS.
3.6. Detection Performance in Complex Scenarios in Cotton Fields
Detection performance evaluation experiments in complex cotton field scenes were conducted to assess the models’ effectiveness for target detection by considering three factors: shooting angle, occlusion, and varying morphologies. Images taken from mid-June to late June 2023 were randomly selected for comparison and analysis. The detection effectiveness of the YOLOv8n and Bud-YOLO models on cotton top bud images under varying conditions was compared.
Figure 10 displays some detection results, where red boxes indicate correct recognition and blue boxes indicate non-recognition. Compared to the YOLOv8n model, the Bud-YOLO model can detect cotton top buds of various morphologies, including those with different viewing angles, occlusion conditions, and different shapes. These results demonstrate the robust performance of the Bud-YOLO model.
3.7. Discussion
Comparison, ablation, and detection experiments were conducted to verify the performance of the improved Bud-YOLO model for detecting cotton top buds in natural scenes. The comparison experiments demonstrated that the Bud-YOLO model achieved the highest mAP value. Although the FPS of the Bud-YOLO model is lower than that of YOLOv8n and YOLOv10n, the speed loss is acceptable, and its inference speed meets real-time requirements. The ablation experiments indicate that the CSPPC module improves the inference speed of the model and reduces its size without significantly affecting the AP50 of the algorithm. Additionally, the inclusion of the Efficient RepGFPN module enhances the recall of the model without adding more parameters, mitigating missed detections of cotton top buds with varying shapes and under occlusion conditions. The Inner CIoU loss function enhances the P, R, AP50, and F1 score, and simultaneously accelerates the convergence of the model, stabilizing the loss value at 0.15. The model accounts for leaf shading, different angles, and various shapes of cotton top buds, exhibiting a high detection rate and strong resistance to external environmental conditions. Therefore, the model is more robust and effective in detecting cotton top buds under complex natural scenarios.
4. Conclusions
This study proposes a Bud-YOLO detection algorithm capable of accurately identifying cotton top buds in real-time. Initially, a dataset of cotton top bud images in complex natural scenes was constructed, labeled using LabelImg (version 1.8.1). A total of 800 labeled images were selected, and through data expansion, a dataset containing 4000 images was generated. A network architecture for the accurate real-time detection of cotton top buds was proposed, utilizing the CSPPC lightweight convolution module to replace the C2f module in the backbone network, thereby reducing redundant computations and memory access with minimal impact on detection accuracy. Incorporating an Efficient RepGFPN in the neck network maintains high accuracy in cotton top bud detection without significantly increasing computational costs. Finally, the Inner CIoU loss function was introduced to compute the regression loss, with the generation of auxiliary bounding boxes controlled by a proportional factor ratio to compute the loss and accelerate convergence. Comparative experimental results indicate that the Bud-YOLO model achieved a precision of 0.977, a recall of 0.99, an AP50 of 0.992, an F1 score of 0.983, and an FPS of 69.3, meeting the real-time detection requirements. The performance evaluation experiments demonstrate that the Bud-YOLO model achieves a high detection rate in complex natural scenes, including varying shooting angles, occlusion, and different morphologies.
In future work, we plan to extend the cotton thimble image dataset by incorporating images taken under various weather conditions (e.g., sunny, cloudy, and rainy) and different lighting scenarios. This will enhance the model’s generalization across different environmental conditions, improving the reliability of cotton toppers in field operations. Although the Bud-YOLO model shows promising results, further optimization such as model pruning should be explored to reduce the model size and complexity, aiming for more efficient deployment in real-world applications.