Next Article in Journal
Precision and Efficiency in Dam Crack Inspection: A Lightweight Object Detection Method Based on Joint Distillation for Unmanned Aerial Vehicles (UAVs)
Previous Article in Journal
Machine Learning-Based Environment-Aware GNSS Integrity Monitoring for Urban Air Mobility
Previous Article in Special Issue
Drones in Precision Agriculture: A Comprehensive Review of Applications, Technologies, and Challenges
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Recognition of Maize Tassels Based on Improved YOLOv8 and Unmanned Aerial Vehicles RGB Images

1
College of Forestry, Beijing Forestry University, Beijing 100083, China
2
Beijing Key Laboratory of Precision Forestry, Beijing Forestry University, Beijing 100083, China
3
Beijing Ocean Forestry Technology Co., Ltd., Beijing 100083, China
*
Author to whom correspondence should be addressed.
Drones 2024, 8(11), 691; https://doi.org/10.3390/drones8110691
Submission received: 10 October 2024 / Revised: 8 November 2024 / Accepted: 18 November 2024 / Published: 19 November 2024
(This article belongs to the Special Issue Advances of UAV in Precision Agriculture)

Abstract

:
The tasseling stage of maize, as a critical period of maize cultivation, is essential for predicting maize yield and understanding the normal condition of maize growth. However, the branches overlap each other during the growth of maize seedlings and cannot be used as an identifying feature. However, during the tasseling stage, its apical ear blooms and has distinctive features that can be used as an identifying feature. However, the sizes of the maize tassels are small, the background is complex, and the existing network has obvious recognition errors. Therefore, in this paper, unmanned aerial vehicle (UAV) RGB images and an improved YOLOv8 target detection network are used to enhance the recognition accuracy of maize tassels. In the new network, a microscale target detection head is added to increase the ability to perceive small-sized maize tassels; In addition, Spatial Pyramid Pooling—Fast (SPPF) is replaced by the Spatial Pyramid Pooling with Efficient Layer Aggregation Network (SPPELAN) in the backbone network part to connect different levels of detailed features and semantic information. Moreover, a dual-attention module synthesized by GAM-CBAM is added to the neck part to reduce the loss of features of maize tassels, thus improving the network’s detection ability. We also labeled the new maize tassels dataset in VOC format as the training and validation of the network model. In the final model testing results, the new network model’s precision reached 93.6% and recall reached 92.5%, which was an improvement of 2.8–12.6 percentage points and 3.6–15.2 percentage points compared to the mAP50 and F1-score values of other models. From the experimental results, it is shown that the improved YOLOv8 network, with high performance and robustness in small-sized maize tassel recognition, can accurately recognize maize tassels in UAV images, which provides technical support for automated counting, accurate cultivation, and large-scale intelligent cultivation of maize seedlings.

1. Introduction

Maize is a globally important food crop, widely planted in various climatic conditions, which occupies a significant position in agricultural production [1]. The growth process of maize is complex, and its growth cycle is divided into several stages, among which the tasseling period is crucial and the appearance of the flower cob can reflect the health status and growth information of maize and have a direct impact on the final harvest [2]. To ensure the high quality and abundant production of maize, it is essential to monitor and manage maize at the tasseling stage efficiently and accurately.
Traditionally, the monitoring of maize mainly relies on manual inspection. Although manual inspection methods can provide detailed observation results, the disadvantages of high cost and high labor intensity have led to the fact that they are not only time-consuming but also susceptible to human interference, which makes it difficult to meet the high requirements for accuracy and efficiency in large-scale production [3]. Therefore, it is of great practical significance to develop a maize inspection model with high automation and accuracy. With the progress of science and technology, satellite remote sensing has become an important tool for agricultural monitoring, which can provide wide and efficient coverage, making large-area crop monitoring more feasible [4,5,6]. Currently, many researchers and scholars have applied remote sensing technology to the field of monitoring the phenology of maize. For example, Longchamps et al. [7] implemented a new method for time series of remotely sensed data for monitoring maize phenology on GHISA spectral images collected by the Hyperion hyperspectral imaging sensor on board the EO-1 spacecraft using a two-dimensional normalized difference space (ND-space) approach. Niu et al. [8] used Landsat imagery to generate the first Chinese annualized maize phenology dataset with a fine spatial resolution (30 m) and a long period (1985–2020). However, although satellite remote sensing technology can provide crop growth information on a large scale, the long period of satellite data acquisition and return visits do not allow for real-time monitoring, resulting in data that may lag behind the actual situation. In addition, the relatively low spatial resolution of satellite remote sensing makes it difficult to capture the fine features of small targets such as maize tassels.
To overcome the drawbacks of traditional techniques, the unmanned aerial vehicle (UAV) platform in low-altitude remote sensing is proposed as an effective solution. UAVs can carry ultra-high-resolution cameras and acquire high-definition images in real time from the air, with strong flexibility and timeliness [9,10,11]. Compared with traditional satellite remote sensing, UAVs can cover large crop areas in a shorter time and provide higher image resolution, thus capturing more detailed information [12,13]. Moreover, UAVs are easy to deploy and have a high degree of flexibility in data collection, which can meet the data collection needs in most cases. These advantages have contributed to the rapid development of UAV remote sensing technology in the field of agriculture, making it one of the current hotspots in agricultural research. Lu et al. [14] used UAV RGB imaging to enable the detection of maize seedling plants. Zhao et al. [15] used canopy features such as canopy cover, canopy height, canopy volume, and the super green index extracted from UAV images to achieve better results in the yield prediction of cotton. The above study shows that the UAV remote sensing platform has good application scenarios in the field of agriculture and provides a reference for using UAV data as a data source to realize the detection of maize tassels.
Meanwhile, the growth of technology in computer vision, especially the advancement of target detection algorithms, provides powerful support for image analysis [16,17]. The YOLO (You Only Look Once) algorithm series is extensively utilized in computer vision due to its efficient and accurate detection performance [18]. YOLOv8, as the mainstream version of the series, has significant advantages over other versions in terms of both detection speed and accuracy. Many scholars have applied it to different research areas, for example, Karim et al. [19] realized real-time weed detection recognition on edge devices by improving the YOLOv8n model, while Liu et al. [20] proposed a novel lightweight detection algorithm for Faster-YOLO-AP, which achieved 84.12% mAP50–95 in the detection of orchard apples. However, in targeting the detection of maize tassels, which is a series of smaller targets, the existing YOLOv8 model still has certain shortcomings, including problems regarding its ability to recognize specific details and poor robustness in complex environments, thus failing to identify the maize tassels accurately.
To tackle the identified problems and shortcomings, this study proposes an approach based on the upgraded YOLOv8 algorithm combined with UAV RGB images to recognize small targets of maize tassels in complex backgrounds. The method adds a microscale target detection head to the standard version of YOLOv8, proposes the Spatial Pyramid Pooling with Efficient Layer Aggregation Network (SPPELAN), forms new connections in the feature extraction and fusion part, and introduces an attention mechanism module that combines the Convolutional Block Attention Module (CBAM) and the Global Attention Mechanism (GAM) in the neck network part. The main objectives of this paper are as follows:
1
Create a new maize tassel UAV image dataset in JPG format and label the data in VOC format, naming it the maize tassels dataset (MTD).
2
Assess the performance, reliability, and efficiency of the improved YOLOv8 model in recognizing small-target maize tassels in complex backgrounds.
3
Compare the improved YOLOv8 model with current better-performing target detection models (Faster R-CNN [21], RT-DETR [22], YOLOv5 [23], YOLOV9 [24], and YOLOV10 [25]).

2. Materials and Methods

2.1. The Study Area

The study area (116°47′ E, 40°11′ N) is located in maize farmland in Shunyi District, Beijing, China, with the maize cultivar Jingke 633 planted at 30 cm intervals, covering an area of about 11,200 m2 (Figure 1). Shunyi District has a temperate continental semi-moist monsoon climate, with hot and humid summers, cold and dry winters, and four distinct seasons. The average annual temperature is 11.5 °C, the average annual rainfall is 610 mm, the annual sunshine hours are 2746 h, and the frostless season is about 195 days.

2.2. Data Collection and Dataset Construction

On August 9, 2022, a DJI Phantom 4 UAV (DJI Technology Co., Ltd., Shenzhen, China) was used to capture visible light images of maize tassels at the tassel stage in the study area (Figure 2). The DJI Phenton4 UAV has a 1 inch visible color sensor with 20 megapixels of effective pixels and the equivalent focal length of the camera is 32 mm. Path planning was performed using the DJI Pilot v2.5.1.15 software (as shown by the solid yellow arrow in Figure 2). To minimize disturbances such as maize dumping caused by UAV flights, the flight altitude was set to 15 m, the ground resolution was about 1.5 cm, and the side and heading overlap rate was set to 65%. To prevent image blurring due to a slow shutter while the UAV is in flight, the shutter speed was set to 1/2000 and the save format was JPEG. A total of 660 images were captured and then spliced together using Pix4D v4.4.9 software to generate an orthophoto, which was roughly divided into four regions.
Since the visible light data of maize tassels collected by the UAV were too large and the computer performance was limited, the orthophoto image generated from the visible light data was first cropped by deleting images with no corn or in which corn occupies a smaller part of the image. After cropping, the regions were divided into four, and the size of the image after cropping was 640 × 640. Then, the cropped images were manually labeled with rectangular frames using the image labeling software (LabelImg1.8.6) in the environment of Pytorch1.12.0. To reduce the images with complex backgrounds, the principle of a minimum rectangular box was used for manual labeling. Finally, the labeled data were split into training, validation, and test sets according to a ratio of 8:1:1 to create the maize tassels dataset (MTD) in VOC format.

2.3. Data Augmentation

To enhance the robustness and generalization of the network model under the influence of diverse environmental factors, the labeled maize tassels dataset was subjected to data enhancement. Currently, the main data enhancement strategies include two forms: geometric and photometric distortion. Photometric distortion mainly randomly adjusts the brightness of the image, degree of saturation, contrast ratio, hue, etc.; geometric distortion includes rotating, cropping, random zooming, flipping, and other operations. In recent years, data fusion of multiple images has become the focus of data enhancement. The mosaic method is a new type of data augmentation introduced by Bochkovskiy et al. [26] in 2020, which consists of randomly cropping, scaling, and arranging four images and retaining the annotation information in the used images. This enhancement method greatly enriches the size of the dataset and the random scaling increases the background features of small-sized targets, enhances the model performance in complex scenes, and makes the model robust. In this study, we will use the combination of traditional enhancement and mosaic enhancement for data enhancement, as illustrated in Figure 3.

2.4. SPPF and SPPELAN

Spatial Pyramid Pooling—Fast (SPPF) is an improved version of Spatial Pyramid Pooling (SPP) [27], whose structure is shown in Figure 4. It retains the ability of SPP to handle inputs of different sizes while being more advantageous in terms of computational performance. SPPF reduces computational complexity and memory usage by optimizing the pooling operation and feature splicing method, thus improving computational efficiency and speed. However, when it detects smaller targets, the pooling operation makes the feature map resolution lower, resulting in the loss of some fine-grained local information of small-sized targets, which results in a lower recognition and detection accuracy.
To solve the drawbacks of SPPF mentioned above, this study proposes SPPELAN (Spatial Pyramid Pooling with Efficient Layer Aggregation Network) [24], whose structure is shown in Figure 5. SPPELAN integrates different levels of feature graphs, and this approach can combine low-level detailed features with high-level semantic features. Through Layer Aggregation, SPPELAN can provide a richer representation of feature information, which helps the model to capture the details and background information of small targets more accurately, and thus improves the detection of small-sized targets.

2.5. Attentional Mechanisms

2.5.1. CBAM Attentional Module

The Convolutional Block Attention Module (CBAM), a simple and efficient attention mechanism, was developed by Woo et al. [28] in 2018. It consists of two separate sub-modules, the Channel Attention Module (CAM) and the Spatial Attention Module (SAM), which perform attention weighting on channel and space, respectively. The structure of the CBAM attention module is shown in Figure 6.
The specific process of the CBAM attention mechanism is as follows: first, we take the input feature map F (H × W × C) and transfer the pooled two feature maps sized 1 × 1 × C to a shared multi-layer perceptual (MLP) through global max pooling and global average pooling. After this, the output feature maps of the MLP are summed and passed through the sigmoid activation function to generate the channel attention map Mc, and then Mc is multiplied by the input feature map F to obtain the channel-weighted feature map F′. The generated feature map F′ is input into spatial attention and then passed through spatial dimension-based global max pooling (global max pooling), which is a method of pooling the two feature maps. Pooling and global average pooling are based on spatial dimensions, and we use these to splice the two and complete a 7 × 7 convolution operation to obtain the H × W × 1 feature map, finally generating the spatial attention feature Ms through the sigmoid activation function and multiplying Ms with the input feature map F′. We then obtain the feature map F′ with the channel weighting multiplication operation to finally obtain the attention-weighted feature map F″ with the channel and spatial dimensions.

2.5.2. Global Attentional Mechanism

The global attention mechanism (GAM) is a global attention mechanism proposed by Liu et al. [29] in 2021, which aims to solve the problem of insufficient retention of feature information in both channel and spatial dimensions of the traditional attention mechanism. The structure of GAM attention is shown in Figure 7.
The GAM uses a similar sequential channel-space attention approach as the CBAM but with two sub-modules redesigned. Instead of using pooling among the channel attention, feature information is retained using 3D alignment resubstitution. After that, a two-layer multilayer perceptron (MLP) with inverse permutations and sigmoid activation functions is used to output the channel attention map Mc, which is multiplied by the input feature map F to obtain the feature map with channel attention F′ (as in Equation (1)). In the spatial attention mechanism, to increase the attention to spatial information, the feature fusion is performed with two 7 × 7 convolutions, and the pooling operation, which may reduce the use of information, is discarded. Finally, as in the case of channel attention, a sigmoid activation function is used to generate the spatial attention map Ms. Multiplying Ms by the input feature map F′ yields the weighted feature map F″ with global dimensions (as in Equation (2)).
F ' = M C F   ×   F
F = M S F   ×   F

2.6. Target Detection Network Structure of Improved YOLOv8

YOLOv8, as one of the most popular deep learning models in the field of target detection, is characterized by high accuracy and flexibility. The network architecture of YOLOv8 is mainly divided into three parts, which are the backbone network parts of feature extraction, the neck part of feature fusion, and the head part of target prediction, the architecture of which is shown in Figure 8. The cross-stage local network (CSPNet) and Spatial Pyramid Pooling—Fast (SPPF) from YOLOv5 are still used in the backbone network, but feature extraction module C3 is replaced by the C2f module, which is lighter and has the ability of cross-stage feature fusion. The neck part still uses a path aggregation network (PANet) but removes the convolution operation for upsampling in YOLOv5. The head part uses a decoupled header that treats classification and regression computations separately and switches from Anchor-Based in YOLOv5 to Anchor-Free, which does not rely on predefined anchor frames, increasing the ability to adapt in complex backgrounds and multi-scale targets.
YOLOv8 currently provides five network structures with different sizes (YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x) for different users in various scenarios. In this study, the YOLOv8n network with less depth and width was chosen so as not to waste computational resources. Although the YOLOv8 network demonstrates good detection and inference results, it faces challenges when applied to detecting small targets like maize tassels. In order to more precisely identify maize tassels in UAV RGB images, this study enhances the YOLOv8 network, with the modified architecture illustrated in Figure 9.
This article proposes several modifications to enhance the underlying YOLOv8 architecture:
(1)
A Microscale Detection Head (MDH) was integrated into the YOLOv8 architecture at the head section, which was derived through a 4-fold downsampling process, allowing it to generate larger prediction maps (160 × 160) compared to other detection heads. This addition enables the extraction of higher-resolution shallow feature maps, making it possible to better capture the finer details of smaller maize tassels.
(2)
We replaced the SPPF module with the SPPELAN module in the feature extraction part of the backbone network, which can combine low-level detail features and high-level semantic features to obtain richer target feature information.
(3)
By incorporating the concept of residual networks, an additional connection was introduced within the backbone network (as indicated by the gray dotted line in Figure 9), and the new network connection enabled the shallow feature information to be transferred to the deeper network, which strengthens the backpropagation of the network gradient, avoids the loss of feature information of the small-sized target, and reduces the occurrence of the phenomenon of the network gradient attenuation.
(4)
To highlight the feature information of the maize tassels more, the attention module was added to the neck part of the feature fusion. The new attention module consists of the GAM and CBAM together, and its structure is shown in Figure 10. The new attention module uses the global channel attention part of the GAM and the spatial attention module of the CBAM. Firstly, the passband weights are updated using the method of 3D rearranging in GAM combined with multi-layer perceptron (MLP) and then multiplied with the input feature maps to obtain the new feature maps with the channel weights, which avoids spatial dimensionality reduction due to average pooling and global pooling in CBAM and thus information loss. In the spatial attention part, the SAM module of the CBAM is still used, and the feature maps generated by the channel attention of GAM are used as inputs, after which the sigmoid activation function generates the new spatial attention weights through the splicing of two pooling operations and a convolution operation. In contrast to the spatial attention part of the GAM, the information of channel attention is effectively learned while reducing the computation of the module. Finally, the acquired spatial weights are multiplied with the input channel attention-weighted feature map to obtain the dual attention feature map. Integrating the GAM–CBAM attention module in the neck part of YOLOv8 can effectively mitigate the loss of information in the space so that the model focuses on the maize tassels region to obtain the feature information, which increases the robustness of the model and can improve the recognition ability of the model.
The working framework of the improved YOLOv8 model is shown in Figure 11, where the cropped RGB images are subjected to data enhancement operations of geometric and photometric distortion and mosaicing in the data processing stage, after which the enhanced dataset is fed into the improved YOLOv8 model. In the prediction stage, the images from the test set are fed into the improved YOLOv8 model, the maize tassels are predicted using the trained optimal weights, and the final prediction results are output.

2.7. Accuracy Evaluation

In this study, five metrics (precision, recall, Mean Average Precision (mAP), F1-score, and FPS (frames per second)) were used as metrics to evaluate the model performance. If the intersection and integration ratio (IOU) between the manually labeled detection frame and the maize tassels prediction frame is greater than 0.5, the maize tassels prediction frame is labeled as TP; otherwise, it is labeled as FP. If the maize tassels are not predicted but manually have a labeled detection frame, it is labeled as FN. Precision (P) metrics are used to evaluate the precision of maize tassels prediction, recall (R) metrics are used to indicate the number of samples predicted correctly in the real number of correctly predicted samples in the real samples, mAP indicates the average accuracy of all categories, and the F1-score indicates the weighted sum average of both precision and recall, whereby the higher this value is, the better the model performance and robustness is. The formula for calculating the above indicators is as follows:
P = T P T P + F P   ×   100 %
R = T P T P + F N   ×   100 %
m A P = 1 N i = 1 N A P i
F 1 s c o r e = 2   ×   P   ×   R P   +   R   ×   100 %
where N is the total number of categories and APi is the average precision of the i-th category.

3. Results

3.1. Model Training and Validation

The hardware and software environments for this experiment are shown in Table 1. Since the VisDrone2019 UAV image dataset contains many targets of different sizes, the improved YOLOv8 model is first pre-trained using the large VisDrone2019 dataset; then, the maize tassels dataset is used for training by transfer learning with the number of iterations set to 200; in the training process, the Adam optimizer is utilized, with a momentum set at 0.937 and a starting learning rate of 0.001, and the learning rate was reduced using the cosine annealing method. To fully utilize the computational resources, the batch size is set to 16.
The trend of different precision metrics in the training of the model is shown in Figure 12A. During the first 50 epochs of the training process, the precision and recall of the model rise sharply, and the loss value shows a downward trend; the model tends to stabilize the precision metrics after 100 epochs, signaling that the model is approaching its optimal performance; after 200 epochs of training, the slopes of the curves of the accuracy metrics of the enhanced YOLOv8 model tend to be close to 0, and the loss values are approaching the minimum, indicating that the model reaches convergence. The training is concluded at this point to prevent the model training from overfitting.
From the curve of precision versus confidence (Figure 12B), the precision is close to 0.9 when the confidence level is greater than 0.2; when the confidence level is less than 0.75, the recall is kept at a high level; and the F1-score is greater than 0.8 at confidence levels between 0.1 and 0.8. This indicates that the improved YOLOv8 model shows higher confidence intervals in the larger confidence interval of the accuracy and stability. The above indicators show that the improved YOLOv8 model has good prediction performance.

3.2. Ablation Experiment

In this study, several improvements were made to the base version of YOLOv8, and to assess the effectiveness of the improvement modules and their interactions with each other, ablation experiments were used to analyze each improvement component [30]. The ablation results are shown in Table 2.
From the results in Table 2, after the addition of the microscale detection head, the A1 model increased the recall by 2.6% and the precision by 2.5% compared to the base YOLOv8 model. This indicates that the microscale detection head can effectively avoid the missed detection of smaller-sized maize tassels. A2 adds a new network connection compared to A1, and both precision and recall are improved. The effectiveness of different attention mechanisms was evaluated to highlight the feature information of the maize tassels and reduce the interference caused by the background. From the results of the A3 and A4 models, it can be seen that there is a significant gap between the addition of CBAM attention compared to GAM attention, and the precision of the A4 model is even slightly lower than that of the A2 model, which suggests that there is conflict in the roles of the added microscale detector head and the new connectivity with the CBAM attention, while the GAM attention performs better. Combining GAM and CBAM attention in the A5 model resulted in a high precision and recall of 91.1% and 90.3%. The results show that combining GAM and CBAM attention can make up for the shortcomings of the single CBAM attention mechanism and even improve the precision of the model to a greater extent, showing that it is an ideal combination. Finally, the performance of the proposed new spatial pyramid structure is also examined, and the results show that the A6 model has a better performance in both precision and recall compared to the A5 model, whose precision and recall are improved by 2.5% and 2.2%, respectively, and has a greater improvement in both precision and recall for several other models. This indicates that the A6 model has good performance in recognizing and detecting maize tassels in terms of accuracy and stability.
To more intuitively show the feature extraction ability of the model and the interpretability of the network, in this paper, two ways of visualizing convolutional neural networks, Grad-CAM [31] and HiRes-CAM [32], are utilized to visualize the features of YOLOv8 before and after the model improvement, as shown in Figure 13. Comparing the heat maps of the visualization results, when the microscale detector head is not added, YOLOv8 can only detect larger-sized maize tassels and pays little attention to smaller target tassels of maize. In contrast, the microscale detection head of the A1 model compensated for this shortcoming, but still suffered from background noise. The final improved A6 model can limit the useless information brought by the background noise so that the model always focuses on the maize tassels’ region, thus improving the accuracy of the model’s recognition detection.

3.3. Comparative Analysis of Various Deep Learning Networks

In this study, comparative experiments are used to examine the generalization and robustness of the enhanced YOLOv8 model, and the current mainstream target detection algorithms (Faster R-CNN, RT-DETR, YOLOv5, YOLOV9, and YOLOV10) are selected in the comparative experiments. The results of the experiments are shown in Table 3. The recognition prediction ability of Faster R-CNN is the worst in the face of maize tassels with complex backgrounds and small target size, with a large gap compared to other models, and due to its huge number of network parameters, it takes the longest time in validation, with only 2.28 FPS. Compared to Faster R-CNN, RT-DETR has obvious advantages in precision, recall, and F1-score, which are significantly higher by 10.1%, 12.0%, and 11.2%, respectively, and the detection speed reaches 62.5 FPS. YOLOv5 has the fastest detection speed, which reaches 83.3 FPS, but the precision, recall, and average precision are not as good as RT-DETR.YOLOv9 and YOLOv10 are very close to each other in that all precision indexes are close and belong to the same level, but YOLOv10 has a faster detection speed. However, the above models do not meet the requirements for detecting maize tassels well. Compared with YOLOv9 and YOLOv10, the improved YOLOv8 model has obvious improvement in precision and recall and has the highest F1-score among all models, which reaches 93.1%. Still, its detection speed is affected by the increase in network computation. Nevertheless, the improved YOLOv8 network still balances both accuracy and detection speed, with 93.6% precision and 92.5% recall while maintaining a detection speed of 58.8 FPS. In addition, the improved YOLOv8 is higher than other models in other indexes, which shows that the improved YOLOv8 model can effectively recognize maize tassels.
To demonstrate the detection ability of the enhanced YOLOv8 model more clearly, this study selected five images in the test set, and the detection outcomes are displayed in Figure 14. It can already be seen that Faster R-CNN has poor recognition ability for small-sized maize tassels under a complex background and high density and can only detect targets with obvious features of larger sizes. RT-DETR and YOLOv5 have a greater improvement in recognition ability compared to Faster R-CNN, but there is still the phenomenon of missed detection. YOLOv9 and YOLOv10 have better recognition ability in small-sized maize tassels recognition, offering a major improvement in comparison to the previous algorithms, but in the edge region, there is still the phenomenon of misdetection and the omission of detection. The improved YOLOv8 proposed in this paper not only detects the maize tassels in the middle region of the image but also recognizes and detects even incomplete maize tassels due to cropping in the edge region. In addition, maize tassels in the presence of mutual adherence, occlusion, and localized incompleteness are also detected due to the incorporation of the dual attention mechanism. This shows that the improved YOLOv8 algorithm has significant advantages over other algorithms.

4. Discussion

Determining how to accurately and quickly obtain information about the number of maize tassels under severe shading and complex backgrounds is an important issue in cultivating maize. During the growth and development of maize, accurately identifying individual maize seedlings is challenging due to the high planting density, but during the tassel stage, the tassel grows at the top and exhibits distinct features, which can serve as a target for identification and detection purposes. Therefore, this study is based on the optimization and improvement of the mainstream YOLOv8 deep learning model to achieve accurate recognition detection of maize tassels. According to the outcomes of the experiment, it is seen that the improved YOLOv8 network has obvious advantages compared to other mainstream target detection models, indicating that it is feasible to detect maize tassels with high accuracy using the improved YOLOv8 network.
The improved accuracy of the model is first due to the inclusion of microscale detection heads since maize tassels represent only a small percentage of the area in the visible image captured by the UAV. The feature maps detected by the three detection heads within the standard version of YOLOv8 are magnified at a considerable rate (32×, 16×, and 8×), so the original model network finds it difficult to recognize and predict targets of small size [33], and these detection heads cause a large amount of feature information of the maize tassels to be lost, consequently impacting the model’s recognition and detection results. The introduction of micro-scale prediction heads (4 times downsampling) in the model can better utilize the feature information retained in the feature map. However, as the size of the feature map increases, the image noise from the complex background and the mutual occlusion of the lateral leaves produced by the growth of the maize lead to the complication of the maize tassels information in the background. Therefore, an attention mechanism module that incorporates a combination of the GAM and CBAM was added to the neck feature fusion part of the model to focus the model on the feature region of the maize tassels. It is worth noting that in the channel attention module of the CBAM, the dimensionality of the input feature maps is reduced using pooling operations [34], which is not conducive for the model to achieve increased focus on the key information of the small targets, and this contributes to the poor performance of the model when only CBAM attention is used. Therefore, the global channel attention in the GAM is utilized to replace the channel attention in the CBAM, and the global channel attention no longer uses the pooling operation but carries out the rearrangement between different dimensions, which allows the feature information to be efficiently grasped without dimensionality reduction operation and realizes the interaction of information in different dimensions, significantly improving the model performance. Furthermore, there are multiple convolutions in the spatial attention of the GAM, which leads to its large computational parameters, and its combination with the spatial attention of the CBAM can effectively reduce the computational complexity so the newly constructed GAM–CBAM attention mechanism strikes a balance between the improvement of accuracy and computational complexity.
While the enhanced YOLOv8 algorithm greatly boosts the detection accuracy, the speed of its detection decreases, with an FPS of 58.8, which is still a gap compared with other target detection models. Therefore, the model network parameters can be further optimized in the future, for example, by pruning and other operations to achieve a lightweight model, minimize parameter size, and enhance computational efficiency, to enable real-time maize tassel detection on mobile and embedded devices. In addition, improving the detection accuracy of small targets may also lead to more false alarms in similar backgrounds. This is because, in complex backgrounds, the features of target objects may be similar to background noise or other objects, which interfere with the model’s judgment. Therefore, the false recognition caused by similar backgrounds can be filtered out in subsequent research by learning contextual information or expanding the dataset to cover more background features. Meanwhile, images collected by UAVs may be misaligned, distorted, and reimaged in the process of splicing and acquisition, resulting in the poor differentiation of maize tassel regions and, thus, image data annotation quality. Therefore, future research should focus on a lightweight feature extraction algorithm and image data quality improvement.

5. Conclusions

In this study, we propose an improved YOLOv8 algorithm for small-sized target detection, which adds a microscale target detection head to the basic version of the YOLOv8 algorithm, making it capable of capturing the characteristics of smaller-sized maize tassel information. At the same time, combining the idea of a residual network, the higher-resolution shallow features from the backbone network are incorporated into the feature fusion layer, enabling the formation of a new network link to reduce the information loss of the smaller-sized maize tassels. In addition, to minimize the impact of the complex background, the attention mechanism combining the GAM and CBAM is integrated into the feature fusion part, which further improves the detection ability of dense maize tassels in drone-captured images. The results of the study show that the improved YOLOv8 network has a greater improvement in recognition accuracy for smaller-sized maize tassels compared to the base version of YOLOv8. Compared with the prevailing widely used target detection algorithms (Faster R-CNN, RT-DETR, YOLOv5, YOLOv9, and YOLOv10), the improved YOLOv8 shows different degrees of improvement in precision, recall, average precision, and F1-score. However, the complexity of the optimized YOLOv8 network is considerable, and the image quality of the dataset remains to be improved. In the future, we should explore lightweight networks that can be deployed on edge devices and improve image quality through operations such as pruning and distillation to achieve the goal of real-time detection. In summary, the improved YOLOv8 network can achieve the accurate recognition of maize tassels in the context of complex backgrounds with heavy occlusion. It can provide the support of application technology for consumer-level UAVs in precision agriculture breeding, phenotype detection, and yield prediction of maize and other crops, which has excellent application value.

Author Contributions

Conceptualization, J.W. and R.W.; methodology, J.W.; software, J.W. and R.W.; validation, J.W., S.W., and X.W.; formal analysis, J.W.; investigation, J.W.; resources, R.W.; data curation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, J.W., X.W., and S.X.; visualization, J.W. and S.X.; supervision, J.W.; project administration, R.W.; funding acquisition, R.W. All authors have read and agreed to the published version of the manuscript.

Funding

The National Natural Science Foundation of China ‘biomass precision estimation model research for large-scale region based on multi-view heterogeneous stereographic image pair of forest’ (41971376).

Data Availability Statement

All the data can be made available upon request to the corresponding author.

Conflicts of Interest

The author Wei Shi is employed by Beijing Ocean Forestry Technology Co., Ltd. The remaining authors declare no conflicts of interest. The Beijing Ocean Forestry Technology Co., Ltd. had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Bantchina, B.B.; Qaswar, M.; Arslan, S.; Ulusoy, Y.; Gündoğdu, K.S.; Tekin, Y.; Mouazen, A.M. Corn yield prediction in site-specific management zones using proximal soil sensing, remote sensing, and machine learning approach. Comput. Electron. Agric. 2024, 225, 109329. [Google Scholar] [CrossRef]
  2. Chen, J.; Fu, Y.H.; Guo, Y.; Xu, Y.; Zhang, X.; Hao, F. An improved deep learning approach for detection of maize tassels using UAV-based RGB images. Int. J. Appl. Earth Obs. Geoinf. 2024, 130, 103922. [Google Scholar] [CrossRef]
  3. Alzadjali, A.; Alali, M.H.; Sivakumar, A.N.V.; Deogun, J.S.; Scott, S.; Schnable, J.C.; Shi, Y. Maize Tassel Detection From UAV Imagery Using Deep Learning. Front. Robot. AI 2021, 8, 600410. [Google Scholar] [CrossRef] [PubMed]
  4. Wang, Y.; Feng, L.; Sun, W.; Wang, L.; Yang, G.; Chen, B. A lightweight CNN-Transformer network for pixel-based crop mapping using time-series Sentinel-2 imagery. Comput. Electron. Agric. 2024, 226, 109370. [Google Scholar] [CrossRef]
  5. Ye, Z.; Guo, Q.; Wei, J.; Zhang, J.; Zhang, H.; Bian, L.; Guo, S.; Zheng, X.; Cao, S. Recognition of terminal buds of densely-planted Chinese fir seedlings using improved YOLOv5 by integrating attention mechanism. Front. Plant Sci. 2022, 13, 991929. [Google Scholar] [CrossRef]
  6. Sun, M.; Gong, A.; Zhao, X.; Liu, N.; Si, L.; Zhao, S. Reconstruction of a Monthly 1 km NDVI Time Series Product in China Using Random Forest Methodology. Remote Sens. 2023, 15, 3353. [Google Scholar] [CrossRef]
  7. Longchamps, L.; Philpot, W. Full-Season Crop Phenology Monitoring Using Two-Dimensional Normalized Difference Pairs. Remote Sens. 2023, 15, 5565. [Google Scholar] [CrossRef]
  8. Niu, Q.; Li, X.; Huang, J.; Huang, H.; Huang, X.; Su, W.; Yuan, W. A 30-m annual maize phenology dataset from 1985 to 2020 in China. Earth Syst. Sci. Data 2022, 14, 2851–2864. [Google Scholar] [CrossRef]
  9. Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small Object Detection Algorithm Based on Improved YOLOv8 for Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1734–1747. [Google Scholar] [CrossRef]
  10. Lee, D.-H.; Park, J.-H. Development of a UAS-Based Multi-Sensor Deep Learning Model for Predicting Napa Cabbage Fresh Weight and Determining Optimal Harvest Time. Remote Sens. 2024, 16, 3455. [Google Scholar] [CrossRef]
  11. Zhao, X.; Zhang, W.; Xia, Y.; Zhang, H.; Zheng, C.; Ma, J.; Zhang, Z. G-YOLO: A Lightweight Infrared Aerial Remote Sensing Target Detection Model for UAVs Based on YOLOv8. Drones 2024, 8, 495. [Google Scholar] [CrossRef]
  12. Zhang, H.; Sun, W.; Sun, C.; He, R.; Zhang, Y. HSP-YOLOv8: UAV Aerial Photography Small Target Detection Algorithm. Drones 2024, 8, 453. [Google Scholar] [CrossRef]
  13. Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A SmallObject-Detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
  14. Lu, C.; Nnadozie, E.C.; Camenzind, M.; Hu, Y.; Yu, K. Maize plant detection using UAV-based RGB imaging and YOLOv5. Front. Plant Sci. 2024, 14, 1274813. [Google Scholar] [CrossRef] [PubMed]
  15. Zhao, L.; Um, D.; Nowka, K.; Landivar-Scott, J.L.; Landivar, J.; Bhandari, M. Cotton yield prediction utilizing unmanned aerial vehicles (UAV) and Bayesian neural networks. Comput. Electron. Agric. 2024, 226, 109415. [Google Scholar] [CrossRef]
  16. Sun, H.; Shen, Q.; Ke, H.; Duan, Z.; Tang, X. Power Transmission Lines Foreign Object Intrusion Detection Method for Drone Aerial Images Based on Improved YOLOv8 Network. Drones 2024, 8, 346. [Google Scholar] [CrossRef]
  17. Ferreira, D.; Basiri, M. Dynamic Target Tracking and Following with UAVs Using Multi-Target Information: Leveraging YOLOv8 and MOT Algorithms. Drones 2024, 8, 488. [Google Scholar] [CrossRef]
  18. Huang, M.; Mi, W.; Wang, Y. EDGS-YOLOv8: An Improved YOLOv8 Lightweight UAV Detection Model. Drones 2024, 8, 337. [Google Scholar] [CrossRef]
  19. Karim, M.J.; Nahiduzzaman, M.; Ahsan, M.; Haider, J. Development of an early detection and automatic targeting system for cotton weeds using an improved lightweight YOLOv8 architecture on an edge device. Knowl. Based Syst. 2024, 300, 112204. [Google Scholar] [CrossRef]
  20. Liu, Z.; Abeyrathna, R.M.R.D.; Sampurno, R.M.; Nakaguchi, V.M.; Ahamed, T. Faster-YOLO-AP: A lightweight apple detection algorithm based on improved YOLOv8 with a new efficient PDWConv in orchard. Comput. Electron. Agric. 2024, 223, 109118. [Google Scholar] [CrossRef]
  21. Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  22. Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
  23. Zhang, Z.; Ao, D.; Zhou, L.; Yuan, X.; Luo, M. Laboratory Behavior Detection Method Based on Improved Yolov5 Model. In Proceedings of the 2021 International Conference on Cyber-Physical Social Intelligence (ICCSI), Beijing, China, 18–20 December 2021; pp. 1–6. [Google Scholar]
  24. Wang, C.-Y.; Yeh, I.-H.; Liao, H. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  25. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  26. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  27. Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Virtual, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
  28. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.-S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
  29. Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
  30. Zhu, L.; Geng, X.; Li, Z.; Liu, C. Improving YOLOv5 with Attention Mechanism for Detecting Boulders from Planetary Images. Remote Sens. 2021, 13, 3776. [Google Scholar] [CrossRef]
  31. Selvaraju, R.R.; Das, A.; Vedantam, R.; Cogswell, M.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2016, 128, 336–359. [Google Scholar] [CrossRef]
  32. Draelos, R.L.; Carin, L. Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks. arXiv 2020, arXiv:2011.08891. [Google Scholar]
  33. Luan, T.; Zhou, S.; Liu, L.; Pan, W. Tiny-Object Detection Based on Optimized YOLO-CSQ for Accurate Drone Detection in Wildfire Scenarios. Drones 2024, 8, 454. [Google Scholar] [CrossRef]
  34. Su, H.; Wang, X.; Han, T.; Wang, Z.; Zhao, Z.; Zhang, P. Research on a U-Net Bridge Crack Identification and Feature-Calculation Methods Based on a CBAM Attention Mechanism. Buildings 2022, 12, 1561. [Google Scholar] [CrossRef]
Figure 1. Study area.
Figure 1. Study area.
Drones 08 00691 g001
Figure 2. Image data collection process.
Figure 2. Image data collection process.
Drones 08 00691 g002
Figure 3. Data augmentation: (A) luminance and geometric distortion and (B) mosaic.
Figure 3. Data augmentation: (A) luminance and geometric distortion and (B) mosaic.
Drones 08 00691 g003
Figure 4. Spatial Pyramid Pooling—Fast (SPPF) schematic diagram.
Figure 4. Spatial Pyramid Pooling—Fast (SPPF) schematic diagram.
Drones 08 00691 g004
Figure 5. Spatial Pyramid Pooling with Efficient Layer Aggregation Network (SPPELAN) schematic diagram.
Figure 5. Spatial Pyramid Pooling with Efficient Layer Aggregation Network (SPPELAN) schematic diagram.
Drones 08 00691 g005
Figure 6. A diagram illustrating the structure of the convolutional block attention module (CBAM).
Figure 6. A diagram illustrating the structure of the convolutional block attention module (CBAM).
Drones 08 00691 g006
Figure 7. The structural diagram of the global attention mechanism (GAM).
Figure 7. The structural diagram of the global attention mechanism (GAM).
Drones 08 00691 g007
Figure 8. Basic YOLOv8 network architecture diagram.
Figure 8. Basic YOLOv8 network architecture diagram.
Drones 08 00691 g008
Figure 9. Improved YOLOv8 network architecture diagram.
Figure 9. Improved YOLOv8 network architecture diagram.
Drones 08 00691 g009
Figure 10. Improved structure of the attention module.
Figure 10. Improved structure of the attention module.
Drones 08 00691 g010
Figure 11. Improved YOLOv8 workflow.
Figure 11. Improved YOLOv8 workflow.
Drones 08 00691 g011
Figure 12. (A) Trends in various accuracy metrics throughout the model training process. (B) Performance evaluation results of the improved YOLOv8 with various IoU thresholds.
Figure 12. (A) Trends in various accuracy metrics throughout the model training process. (B) Performance evaluation results of the improved YOLOv8 with various IoU thresholds.
Drones 08 00691 g012
Figure 13. Visualization results.
Figure 13. Visualization results.
Drones 08 00691 g013
Figure 14. Comparison of different deep learning algorithms for maize tassels detection and recognition performance.
Figure 14. Comparison of different deep learning algorithms for maize tassels detection and recognition performance.
Drones 08 00691 g014
Table 1. The experimental setup involving software and hardware.
Table 1. The experimental setup involving software and hardware.
NameParameters and Versions
Central Processing Unit (CPU)Intel Core i7-14700K @ 3.40 GHz
Random Access Memory (RAM)32 GB
Hard Disk Drive (SSD)SHPP41-1000GM (1 TB)
Graphic Card (GPU)NVIDIA 4070TiSuper (16 GB)
Operating System (OS)Microsoft Windows 11 Professional
Programming Environment (ENVS)Pytorch1.12.0 + Python3.8.10
Table 2. Ablation experiment results.
Table 2. Ablation experiment results.
MethodModel NumberMDHNew ConnectionsAttention MechanismSPPELANPrecision
(%)
Recall
(%)
GAMCBAMGAM + CBAM
YOLOv8-------86.685.2
Improved
model
A1-----89.187.8
A2----89.988.1
A3---90.689.1
A4---89.588.5
A5---91.190.3
A6--93.692.5
Table 3. Ablation experiment results.
Table 3. Ablation experiment results.
ModelPrecision (%)Recall (%)mAP50 (%)F1-Score (%)FPS
Faster R-CNN80.275.884.277.92.28
RT-DETR90.387.889.489.162.5
YOLOv585.486.185.185.783.3
YOLOv991.285.191.488.133.5
YOLOv1091.088.293.989.566.7
Improved YOLOv893.692.596.893.158.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wei, J.; Wang, R.; Wei, S.; Wang, X.; Xu, S. Recognition of Maize Tassels Based on Improved YOLOv8 and Unmanned Aerial Vehicles RGB Images. Drones 2024, 8, 691. https://doi.org/10.3390/drones8110691

AMA Style

Wei J, Wang R, Wei S, Wang X, Xu S. Recognition of Maize Tassels Based on Improved YOLOv8 and Unmanned Aerial Vehicles RGB Images. Drones. 2024; 8(11):691. https://doi.org/10.3390/drones8110691

Chicago/Turabian Style

Wei, Jiahao, Ruirui Wang, Shi Wei, Xiaoyan Wang, and Shicheng Xu. 2024. "Recognition of Maize Tassels Based on Improved YOLOv8 and Unmanned Aerial Vehicles RGB Images" Drones 8, no. 11: 691. https://doi.org/10.3390/drones8110691

APA Style

Wei, J., Wang, R., Wei, S., Wang, X., & Xu, S. (2024). Recognition of Maize Tassels Based on Improved YOLOv8 and Unmanned Aerial Vehicles RGB Images. Drones, 8(11), 691. https://doi.org/10.3390/drones8110691

Article Metrics

Back to TopTop