1. Introduction
Corn is a crucial staple food worldwide [
1], and its myriad products and derivatives find applications across various domains. Breeding is the key technology to increase corn yield, and plant phenomics plays a pivotal role in modern breeding [
2]. Scrutinizing phenotypic parameters among individual corn plants facilitates targeted crop improvement, aligns breeding with specific demands, and augments overall yield. During breeding, rapid acquisition of phenotypic parameters is a crucial task [
3]: it can not only improve the efficiency of phenotypic parameter extraction but can also monitor plant growth [
4]. Traditionally, extracting phenotypic parameters for corn involved manual measurements with rulers and protractors, leading to inefficiencies, substantial measurement errors, and potential harm to plants. One advanced method employs high-precision point cloud generation and 3D reconstruction of the plant [
5]. Existing studies on representation extraction primarily focus on 3D reconstruction. Zermas [
6] utilized high-resolution RGB imagery from UAVs and handheld cameras for corn 3D reconstruction, extracting parameters such as plant count, leaf area index, individual and average plant height, individual leaf length, leaf location, and angles relative to the stem. Zhao [
7] used a single RGB image for the 3D reconstruction of plants and extracted parameters such as height, canopy size, and trunk diameter. Zhu [
8] utilized an improved skeleton extraction algorithm to reconstructed its three-dimensional (3D) model of a tomato. Li [
9] utilized LiDAR and an RGB camera to obtain the plant heights of corn plant in corn field. However, the following challenges exist when using 3D point cloud data: leveraging complex equipment, venue restrictions, intricate programs, and a lack of unified algorithms for point cloud data processing. This study addresses these issues by exploring binocular images for 3D keypoint extraction to efficiently extract 3D phenotypic parameters from corn plants using inexpensive and widely available binocular cameras.
The traditional binocular data 3D keypoint detection method is mainly applied to the KITTI dataset [
10], employing cost volume to regress only one 3D center point coordinate of each bounding box. For example, Stereo CenterNet [
11] focuses on regressing the 3D center, utilizing binocular images for regression, and subsequently obtaining the 3D bounding box. In addition, 3D keypoint [
12] can use 3D information and monocular images to directly regress the 3D center of the object.
To overcome the above limitations and take advantage of prior information from 2D keypoint detection models, based on YOLOv7-Pose [
13], the 3D keypoint detection method proposed in this work utilizing binocular images first employs a modified 2D detect or to jointly output 2D bounding boxes and keypoints in left images. The same detection outputs 2D bounding boxes in the right images, allowing the creation of a union bounding box. By utilizing stereo-matching networks with left and right bounding boxes, depth maps are obtained. The depth map combined with keypoints facilitates the extraction of 3D keypoints and subsequent phenotypic parameter extraction.
Although 2D keypoint detection technology originated in human pose detection [
14], Du [
15] used keypoint detection and point clouds to detect the 3D poses of tomatoes, Zheng [
16] utilize pixel position and depth of keypoints to calculate the size of vegetables. Application in this work for corn plant keypoint detection is based on the YOLOv7-Pose method and stereo matching, leveraging coordinate point regression for faster detection speed. This study contributes by describing a binocular corn keypoint dataset and proposing the stereo corn phenotype extraction algorithm (SCPE) for corn plant phenotypic parameter extraction. SCPE comprises two modules: the YOLOv7-SlimPose model and the phenotypic parameter extraction module, the latter incorporating stereo matching, skeleton construction, and parameter calculation. The key contributions of this study are as follows:
- (1)
This work proposes a novel approach named SCPE for extracting phenotypic parameters for corn plants with binocular images. A binocular image dataset of corn plants has been constructed, with keypoints and bounding boxes accurately marked for each object.
- (2)
This work designed the YOLOv7-SlimPose model through a comprehensive analysis of the YOLOv7-Pose. Structure optimization, loss function adjustment, and model pruning were executed within the core architecture of YOLOv7-Pose. This model is designed to achieve precise detection of corn-bounding boxes and keypoints using much fewer parameters.
- (3)
This work proposes the phenotypic parameter extraction module to utilize the output of YOLOv7-SlimPose. Leveraging the model’s output, this module constructs skeletons of leaves and stems and extracts phenotypic parameters for corn plants. Moreover, the module facilitates corn plant growth monitoring functions [
17], such as detecting lodging, monitoring the number of leaves, and assessing the overall normalcy of growth.
The remainder of the study is organized as follows:
Section 2 provides an overview of the dataset construction, detailing the data augmentation methods employed during training, and presents the specifics of the proposed method, SCPE.
Section 3 provides the performance of the SCPE algorithm in terms of keypoint detection and the results of phenotypic parameter extraction.
Section 4 discusses the findings, and
Section 5 provides a brief conclusion of the proposed SCPE in this work.
2. Materials and Methods
2.1. Materials
2.1.1. Data Acquisition
The main task of this work is extracting phenotypic parameters of corn plants using binocular images, and the images utilized were obtained from the experimental field at Yangzhou University in Yangzhou, Jiangsu Province, China. The data collection took place from 15 July to 1 August 2023. The specific corn variety under investigation was SuYuNuo.1 (Jiangsu Zhonghe seed industry Co., Ltd., Nanjing, China).
To capture the entire developmental process of corn, images of the selected corn plants were taken daily from the heading to the maturity stage. During this period, a number of image pairs of corn plants were obtained that play an important role in phenotypic parameter extraction. Additionally, images were captured at different times and under various weather conditions to enhance the diversity of data under different lighting conditions, preventing overfitting of the model.
To ensure keypoint detection precision, a blue curtain was positioned behind the corn plants to minimize the impact of the background clutter caused by the presence of similar corn plants during image capture. The ZED2I binocular depth camera was employed for capturing binocular photographs and depth maps. This work collected 1000 sets of raw data and expanded it to 4000 pairs through data augmentation. Each set included the camera’s internal parameters, binocular images, and depth maps. The RGB image resolution was 1080 × 720 pixels, the depth maps had the same resolution of 1080 × 720 pixels, and the images were saved. in PNG format. These data were divided into train and test sets at a ratio of 9:1. Last, 20 different data points from the raw data for phenotypic parameter extraction were selected, and the extracted phenotypic parameters were used to be compared with the manual measurement values. The composition and use of data are shown in
Table 1.
2.1.2. Acquisition of Labels
In the process of data acquisition, the LabelMe annotation tool was used to label both bounding boxes and keypoints. The annotation file comprehensively captures the center coordinates, width, and height of the bounding box, along with the 2D coordinates of keypoints, as shown in
Figure 1. For each corn plant, two types of objects were marked: leaves and stems, with seven types of keypoints. As depicted in
Figure 2, seven types of keypoints have been annotated, and the specific parameters are detailed as follows:
- (1)
Root point: The root point at which the root of the corn plant connects to the ground.
- (2)
Top point: The top point refers to the highest (relative to the ground) point of the corn plant.
- (3)
Leaf connection point: The leaf connection point refers to the point at which a leaf is attached to the main stem.
- (4)
Leaf highest point: The highest point refers to the uppermost point of the leaf.
- (5)
Leaf angle point: The leaf angle point refers to one-quarter the distance from the leaf connection point to the highest point of the leaf.
- (6)
Leaf tip point: The leaf tip point refers to the tip of the leaf.
- (7)
Stalk point: The stalk point refers to the point at which the corncobs connect to the stem of the corn plant.
Figure 1.
Example of keypoints labeled in images, bounding boxes of stem and leaves are not shown.
Figure 1.
Example of keypoints labeled in images, bounding boxes of stem and leaves are not shown.
Figure 2.
Diagram of the keypoints of the plant. The stem object contains three types of keypoints, and the leaf object contains four types of keypoints.
Figure 2.
Diagram of the keypoints of the plant. The stem object contains three types of keypoints, and the leaf object contains four types of keypoints.
Based on the coordinates of the keypoints in the left image, the depth value
z is extracted from the depth map, as illustrated in
Figure 3. Unlike the KITTI dataset, where the left image label is not intrinsically linked to the right image label, this work employed the internal parameters of the camera to project the bounding box and keypoints from the left to the right image. The 3D coordinates
of pixel
were obtained using the formula given by Equation (
1), providing annotation information for the right image.
where
is the location of the pixel relative to the camera center, and
and
are the horizontal and vertical focal lengths, respectively.
2.1.3. Data Augmentation
To enrich the training dataset and enhance the model’s generalization, this work implemented various data augmentation techniques. These methods include mosaic [
18], random flip [
19], scale, and color-space conversion. Horizontal mirroring of images annotated with 2D boxes and keypoints to effectively double the dataset size was employed. Additionally, for improved color representation, the images were converted from RGB to HSV. The use of the HSV color space is particularly advantageous for capturing primary color tones, brightness levels, the leaf’s highest point, and the contrast between lightness and darkness. In the context of corn plant images, distinct color characteristics are evident for the background, corn, stems, and leaves, respectively. By incorporating data augmentation during training, the model’s generalization is significantly enhanced, especially in diverse lighting conditions. This approach contributes to a more robust and general model.
Figure 4 includes examples of augmented images and raw images.
2.1.4. Phenotypic Parameter of Corn Plant
Considering the importance of phenotypic parameters in plant phenotyping and the ease of identifying the corresponding keypoints in the plant, this work attempted to extract four key phenotypic parameters from corn plants. These parameters include plant height, angle between leaf and stem length, leaf length, and ear position. By leveraging the marked keypoints and depth information extracted from the depth map, the 3D coordinates of these keypoints can be obtained. By utilizing these 3D coordinates and associating them with the respective components, this work constructed the skeleton of the corn plant and extracted the phenotypic parameters. The calculation process was based on the following rules:
- (1)
Height of Plant:
The height of the plant was defined as the distance from the root to the top of the plant. To calculate plant height, this work used the root, stalk, top, and leaf connecting points. The Euclidean distance between two adjacent points was calculated in the order of Y, and these distances were summed to obtain plant height.
- (2)
Angles between Leaf and Stem:
The angle between each leaf and stem [
20] of each leaf varied. They needed to be calculated individually. Each angle was calculated by determining the angle between the line connecting the averaged leaf connection point and the angle point of the same leaf and the line connecting the leaf connection point and the point above.
- (3)
Length of Leaf:
The length of the leaf in one corn plant of each leaf varied. For the four keypoints in one leaf, the Euclidean distances were calculated between the leaf connection point and the angle point, between the angle point and the leaf’s highest point, and between the leaf’s highest point and the leaf tip point. Adding these Euclidean distances together, the length of the leaf can be obtained.
- (4)
Ear Position:
The ear position [
21] refers to the location of the corn cob on the entire corn stem. The Euclidean distance can be calculated between the root point and the stalk point in the stem. Using the previously calculated plant height, the ear position can be obtained.
2.2. Overall Technical Route
To address the challenge of extracting the phenotypic parameters of corn plants, this study introduces the SCPE for extracting phenotypic parameters. SCPE comprises the YOLOv7-SlimPose model and phenotypic parameter extraction module. The overall flowchart of the proposed method is depicted in
Figure 5.
The first is to create a corn plant keypoint dataset to train the YOLOv7-SlimPose model. In the keypoint detection stage. The YOLOv7-SlimPose model performed both bounding box and keypoint detection tasks within a single model. It outputs bounding boxes and keypoints for the left and the bounding box of the right images. Then, in the phenotypic parameter extraction stage, the PSMNet was utilized to generate the depth map from the bounding box in the left and right images. Using the Z values obtained from the keypoints in the depth map, the 3D coordinates were calculated for the camera’s internal reference as in
Section 2.1.2. Through these 3D coordinates, the skeleton of the corn plant was constructed, leading to the extraction of phenotypic parameters for corn plants. This comprehensive approach integrates both object detection and depth information to provide a robust solution for extracting phenotypic parameters. The pseudocode is shown in Algorithm 1.
Algorithm 1 Stereo Corn Phenotype Extraction Algorithm (SCPE) |
Input: Left & Right images Output: Phenotypic Parameters step 1: keypoint detection by YOLOv7-SlimPose; step 2: if keypoint detected then | pyramid stereo matching network else | go back to step 1 end step 3: Skeleton Construction; step 4: Phenotypic parameter Computation; |
2.3. Standard YOLOv8 Model
YOLOv7-Pose is the kernel of the first stage of SCPE. The standard YOLOv8 Model is also the basic framework for the critical point detection stage. YOLOv7-Pose, a keypoint detection network based on the YOLO structure [
22], is different from the approach of encoding raw images into heat maps, as in CenterNet [
23]. Instead, it directly outputs the results end-to-end, leading to a significant enhancement in the training speed. YOLOv7-Pose employs two multi-scale feature fusion methods, from bottom to top and from top to bottom within the same framework. This design predicts all keypoints with an anchor for detection, effectively accomplishing the keypoint detection task without introducing excessive computational overhead.
The primary architecture of YOLOv7-Pose comprises the backbone network, neck layer, and head layer. The backbone network is responsible for extracting image features from multiple scales. The neck layer fuses the features from the backbone network at each scale. The head layer utilizes four feature maps obtained by two decoupled heads to predict objects of different sizes and keypoints.
As shown in
Figure 6, the CBS convolution module, which constitutes an efficient convergence network (ELAN) structure, is a pivotal component. It consists of a two-dimensional convolution kernel of different sizes, a batch normalization function, and an SiLU activation function. Multiple-base CBS convolutions form an ELAN structure. The input information from the backbone layer was fused with the features in the neck layer using bottom-to-top and top-to-bottom strategies. Finally, the output of each head layer was connected to two decoupled heads to predict the bounding box and keypoints of the corn plant.
In the head layer, each object bounding box is characterized by six data elements: the anchor horizontal coordinates
, anchor vertical coordinates
, predicted box width
W, predicted box height
H, detection box confidence
, and the class confidence
. Each keypoint consists of three data elements: the horizontal coordinate
, the vertical coordinate
, and the confidence
. Therefore, for each object, the network predicted 6 elements for the target box probe head and 21 elements for keypoint detection heads, totaling 27 elements, as shown in Equation (
2).
2.4. YOLOv7-SlimPose
This work optimized the structure and bounding box loss function of YOLOv7-Pose in the keypoint detection stage of SCPE, resulting in a more streamlined model with fewer parameters, named YOLOv7-SlimPose, dedicated to keypoint detection in corn plants.
Figure 6 illustrates the architecture of YOLOv7-SlimPose compared with the original YOLOv7-POSE. The bounding box loss function in YOLOv7-SlimPose has been modified from the complete intersection over union (CIoU) to the minimum point distance-based IoU (MPDIoU) [
24]. This change addresses issues encountered when the predicted box shares the same aspect ratio as the real labeled box but differs significantly in width and height values. To reduce the model’s computational demands and size, the neck component was optimized using GSConv and GSCNeck. GSConv replaces the convolutional layer with a large number of parameters in the neck area, whereas GSIN is inspired by inception and GSConv replaces the ELEN-H module in YOLOv7-Pose. After training, the model size was additionally reduced through pruning techniques.
2.4.1. Bounding Box Loss Function
The YOLOv7 bounding box employs the complete intersection over union (CIoU) loss [
25], which, compared with the original IoU loss, takes into account the Euclidean distance to the center and considers cases of overlapping centroids and different aspect ratios. Equations (
3)–(
6) illustrate the calculation:
where
A and
B are the real and predicted boxes, IoU is the ratio of the intersection area to the concatenated area of the real box and predicted box, and
b and
are the coordinate positions of the centroids of the predicted and real boxes, respectively, where
w and
h are the width and height of the bounding box, respectively.
c is a weighting factor, and
v is a penalty factor for the ratios of the width and height of the prediction boxes to the real boxes.
is the square of the Euclidean distance between the two points.
However, the CIoU loss encounters challenges when the predicted box has the same aspect ratio as the real labeled box but significantly different width and height values. To address this issue, this work selects MPDIoU loss as a suitable alternative to replace the CIoU loss. MPDIoU incorporates three key factors: overlapping or non-overlapping areas, center-point distance, and deviations in width and height. The loss calculation process is simplified by minimizing the point distance between the predicted bounding box and the truth-value bounding box. This replacement aims to address the difficulties in the optimization process. For MPDIoU, Equations (
7)–(
10) present the calculation:
where
A and
B are the real box and predicted box, respectively, and IoU is the ratio of the intersection area of the real box and the box predicted to the concatenated area as Equation (
4),
,
,
, and
denote the top-left and bottom-right point coordinates of
A,
,
,
, and
denote the top-left and bottom-right point coordinates of
B.
2.4.2. Slim Neck
In the age of equipment miniaturization, this work addresses the challenge of model size in YOLOv7-Pose, which has almost twice the number of parameters as in YOLOv7. The model size was reduced while maintaining precision, and the strategy involved updating the neck part of the model and conducting pruning training. In the neck layer of YOLOv7-Pose, GSConv [
26] was introduced at the position of multi-channel convolution, replacing the basic original convolution. GSConv is proposed based on depthwise convolution [
27], optimizing convolution operations through a multi-step process. Initially, it undersamples a standard convolution layer to reduce computational complexity. Then, it employs depthwise convolution for efficient feature map processing. The outcomes of these two operations are concatenated, leveraging their respective capabilities while preserving the network’s ability to capture essential features. Lastly, GSConv utilizes a shuffle operation to rearrange the channel order of the feature maps, enhancing information flow and facilitating more efficient computations.
Additionally, this work introduced the GSIN module, based on the inception [
28] and GSConv concepts. The GSIN module extracts multi-dimensional features through four different parallel convolution modules and concatenates them. The outputs of these operations are then fused and convolved through the GSConv and CBS modules. GSIN can extract features better with fewer parameters. The GSIN module is employed to replace the ELAN-H with large channels in the neck layer. The structure of the GSIN and GSConv is shown in
Figure 6.
2.4.3. Pruning Training
To further reduce the parameters of the model and accelerate the detection speed, structural pruning [
29] was applied to reduce the parameter count. The structural pruning process executed on the model with the best training results involved three main steps.
- (1)
Sparse Training:
In this step, the importance of each convolution kernel in the depth model was evaluated. By applying sparse training, crucial convolution kernels are identified.
- (2)
Model Pruning:
In this step, unimportant convolution kernels are removed during the model pruning stage.
- (3)
Fine-Tuning:
In this step, the pruned model was fine-tuned to achieve precision comparable to that of a normally trained network. This step ensures that, despite the reduction in parameters, the model maintains or even improves its performance.
In structural pruning, This work compared three mainstream pruning techniques: Slim, Group Slim, and Lamp. This comparison evaluates the effectiveness and trade-offs of each pruning approach in terms of model size, and precision.
2.5. Phenotypic Parameter Extraction Module
In this work, the phenotypic parameter extraction module of SCPE is utilizing the output of the keypoint detection stage to output the phenotypic parameters. The phenotypic parameter extraction module involves three main steps:
- (1)
Stereo Matching Network:
After obtaining the bounding boxes of the corn plants from the left and right images, the bounding box was selected in the left and right images to create a union box. The depth map was then generated through the stereo-matching network in the union box. PSMNet was used for this purpose, which introduced the spatial pyramid pooling structure. This allowed us to obtain a depth map of the object region by constructing a 4D cost volume and a 3D convolution. From the depth map, the 3D coordinates of the keypoints were determined based on the keypoints in the left image with the depth map.
- (2)
Construction of the Skeleton:
Using these 3D coordinates, the skeleton of each leaf and stem can be constructed. For one corn plant, the skeleton construction was mainly divided into the construction of the leaves and stems. The skeleton of the leaves was constructed according to the object detection results of the leaves and their related keypoints. In one leaf, four keypoints were detected, and the skeleton of the leaf was constructed in the following order: leaf connection point, leaf angle point, highest leaf point, and leaf tip point. The skeleton of the stem was constructed according to the stem object detection result with their related keypoints (root, stalk, and top points) and leaf-connecting points. First, they are averaged and then ordered according to the Z-values, constructing the skeleton in the determined order.
- (3)
Phenotypic Parameter Computation:
After constructing the skeleton of the corn plants, the positions of the leaves and the stem with their points in 3D space can be obtained. Following the calculation method outlined in
Section 2.1.4, the phenotypic parameters can be extracted.
From these steps in the phenotypic parameter extraction module, the phenotypic parameters can be extracted for corn plants. The process of the phenotypic parameter extraction module is shown in
Figure 7.
2.6. Experimental Setting
The software and hardware settings for model training and testing in this study are listed in
Table 2. The training epoch was 1100 epochs with a batch size of 4. The optimization was performed using the Adam optimizer, and the learning rate was adjusted every 500 rounds.
2.7. Model Evaluation Metrics
In this work, mAP was used to measure the accuracy of keypoint detection, and error was used to measure the accuracy of representation extraction. For the keypoint detection task, evaluating the precision of the detection involves more than simply measuring the Euclidean distance between true and predicted points. The assessment depends on the type of keypoints, and different weights are assigned to The similarity between the actual and predicted points was calculated. In this study, this work adopted the same evaluation metrics as YOLOv7-Pose, utilizing the average precision (AP) and the mAP based on the keypoint similarity metric Object Keypoint Similarity (OKS). OKS is calculated using Equation (
11).
where:
denotes the number of corn plants in the detection image;
denotes the keypoint number of the plant;
denotes the visibility of the keypoint in the image;
denotes the Euclidean distance between the true and predicted points. A smaller value of
indicates a better prediction at that point.
indicates the square root of the size of the area occupied by the object detection bounding box that identifies the corn plant.
measures the standard deviation of different keypoints. Finally, the Euclidean distance
of the keypoints, the area
of the detected corn plant, the labeling bias
, and the OKS are normalized to maintain consistency with the above analysis.
AP (average precision) is the average precision of the predicted result on the test set. As shown in Equation (
12), by setting a threshold
s for the OKS, if the value of the OKS at a keypoint is greater than
s, the keypoint is considered to be detected as correct. Otherwise, it was considered incorrect. The mAP (mean average precision) is the mean value of AP for different thresholds. As shown in Equation (
13), the threshold was set from 0.5 to 0.95, and the step was 0.05.
Error(%) is used to measure whether the extracted phenotypic parameters are accurate. The calculation formula is as follows Equation (
14):
where:
F denotes the extracted phenotypic parameters;
denotes the phenotypic parameters measured by manual measurement.
3. Results
In this section, this work follows the order of proposed improvements to conduct experiments.
Table 3 explores the impacts of three different bounding box loss functions in the YOLOv7-SlimPose model. To address inherent challenges in CIoU, this work proposes using MPDIoU for enhancement. For comparative analysis, SIoU [
30] was also included in the experiment for comparison. The results indicate that the highest mAP achieved with the MPDIoU was 96.8%, surpassing the CIoU loss by 0.5% and the SIoU by 0.3%. This experiment suggests that employing MPDIoU in YOLOv7-SlimPose leads to superior bounding box regression because of SIoU and CIoU, as the MPDIoU loss function accounts for the overlapping, non-overlapping areas, and center point distance, width, and height deviation.
Using the improvement of MPDIoU to simplify the neck component of the model, this work proposes the use of GSConv and GSIN to replace the original convolution and ELEN-H modules in the neck.
Table 4 shows that replacement had no negative effects. This work obtained a 0.1% improvement in the mAP with 6 million fewer parameters.
Table 5 presents the results of further reducing parameters through structural pruning. Three different pruning algorithms—Slim, Group Slim, and Lamp—were compared. The results show that none of the three pruning algorithms adversely impacted mAP. The Slim algorithm reduced the number of parameters by approximately 13%, which is better than Group Slim and Lamp.
From the above work, this work presents YOLOv7-SlimPose, an adaptation of YOLOv7-Pose achieved through improvement in the IoU loss function, replacement of components in the neck layer, and structural pruning.
Table 6 compares the results between YOLOv7-Pose and YOLOv7-SlimPose. YOLOv7-Pose exhibits an mAP of 0.957 with 80 million parameters. In contrast, YOLOv7-SlimPose demonstrates improved mAP (96.8%) and fewer parameters (67 million). YOLOv7-SlimPose excels in predicting bounding boxes and keypoints, offering enhanced speed and efficiency, and is suitable for deployment on embedded devices.
Figure 8 shows the comparison of loss during training between YOLOv7-SlimPose and YOLOv7-Pose.
Figure 9 shows the P-R curve of the best result of mAP90.
Figure 10 shows the change process and the extracted phenotypic parameters of corn. plant from an image using the SCPE algorithm, which first detects the stem and leaves in the image, and then the relevant keypoints are detected. The 3D skeletons of plants were constructed through the relevant points of leaves and stems. Finally, the phenotypic parameters of the plant can be extracted. The phenotypic parameters of the samples in
Figure 10 are summarized in
Table 7.
Twenty images were selected from the test set as the experimental data, and the phenotypic parameters extracted using the SCPE algorithm were compared with the true values manually measured from the 3D point cloud.
Table 8 presents the results of the SCPE algorithm for the experimental data. The SCPE algorithm was used to detect 20 samples with 440 keypoints, and 430 keypoints were correctly detected with a precision of 98%. The YOLOv7-SlimPose model took an average of 0.09 s to detect a corn plant in one left or right image, and the PSMNet processing module took an average of 0.2 s to obtain the 3D coordinate. The construction of the skeleton and phenotypic parameter extraction required almost no time. The SCPE algorithm used 0.38 s to extract the phenotypic parameters in total. The errors in the experimental data are listed in
Table 9.
Table 10 presents a comparison between the SCPE algorithm and the other algorithms. It can be seen from the table that the SCPE used the keypoint detection-based method to extract the phenotypic parameters. SCPE was much faster than the method based on 3D reconstruction, and the accuracy was also better. Although SCPE is difficult to use in a multi-plant environment, it is currently the fastest phenotypic parameter extraction algorithm for specific plants with a high degree of precision.
4. Discussion
This work proposes SCPE method utilizing stereo images for the phenotypic parameter extraction of corn plants through the keypoint-detecting model. The cost-effectiveness of binocular cameras allows SCPE to find an optimal balance between precision and cost. The SCPE method comprises two modules: the YOLOv7-SlimPose model for keypoint detection and the phenotypic parameter extraction module for constructing the skeleton and obtaining phenotypic parameters. The SCPE method demonstrates notable advancements over the original YOLOv7-Pose. To cater to miniaturized devices, we extensively optimized the original model with fewer parameters and increased speed.
In the first model, YOLOv7-SlimPose, for keypoint detection, seven kinds of keypoints were identified, encompassing various nodes of the corn plant. Building on YOLOv7-Pose, this work replaced the original bounding box loss function, CIoU loss, with MPDIoU loss to reduce the model size. The GSConv and GSIN modules were further utilized to reduce the size of the neck in the original model and prune the trained model based on the Slim pruning algorithm, achieving a bounding box mAP of 96.8% while reducing the model size to 81.4% of its original size. In comparison to YOLOv7-Pose, YOLOv7-SlimPose excels in speed, model size, and keypoint detection precision. Owing to its easy-to-train peculiarities, keypoints, and data structure, the YOLOv7-SlimPose method can be seamlessly adapted to other crops or plants. The left and right images are passed through the model to obtain bounding boxes with keypoints.
In the second phenotypic parameter extraction model for constructing the skeleton and obtaining phenotypic parameters, the depth map is acquired through PSMNet, which uses the object’s bounding box for both left and right images. These depth maps, when combined with the keypoints, yield the 3D coordinates for extracting phenotypic parameters. Leveraging these 3D coordinates for various keypoints allows the construction of the corn plant’s skeleton. Subsequently, phenotypic parameters are extracted by calculating Euclidean distances based on a predefined computational method. Additionally, the constructed skeleton provides ways to obtain the growth of corn plants, such as detecting lodging, monitoring the number of leaves, and assessing the overall normalcy of growth. SCPE requires only binocular images for phenotypic parameter extraction of corn plants, achieving faster running speeds compared to traditional methods relying on point cloud data. It operates in real-time within the equipment, effectively meeting the demand for swift phenotypic parameter extraction. Despite using a smaller dataset of 20 data points, the errors were approximately 10% compared with the manual calculation. However, the YOLOv7-SlimPose model was trained using data from a laboratory; it can also be utilized in real outdoor farmland, but accuracy may slightly decrease.
Compared with previous studies for phenotypic parameter extraction, SCPE is not the most accurate, but it is certainly the fastest, most convenient, and, at the same time, can also guarantee very high precision (about 90%). Compared with most phenotypic parameter extraction work using 3D reconstruction, SCPE is more concise and efficient. It is more suitable for the phenotypic parameter extraction of a large number of plants in the same category in the breeding stage.
It must be pointed out that, according to experiments, the error in the results is mainly caused by the error in the depth map. So using better equipment, models, or manual reviews to improve the quality of depth maps can improve the accuracy of SCPE.
To enhance model generality, future research will involve building corn plant datasets for different varieties on real outdoor farmland and implementing filters to reduce background interference. Additionally, the following work will explore the possibility of directly outputting phenotypic parameters by using an end-to-end model with binocular data, eliminating the need for depth data. We believe that, through future work, the SCPE algorithm can be applied to miniaturized devices, such as mobile phones and vehicle robots, to automatically conduct phenotypic parameter extraction of dense crops in farmland environments, without the need for manual intervention.
5. Conclusions
This work introduces a novel SCPE algorithm designed to extract the phenotypic parameters of corn plants using stereo images. The SCPE method comprises the YOLOv7-SlimPose model and a phenotypic parameter extraction module. Building upon YOLOv7-Pose, the YOLOv7-SlimPose model refines the original bounding box loss function, CIoU loss, to MPDIoU loss. To reduce the model size, this work adjusted the neck and conducted pruning on the trained model while maintaining precision.
The YOLOv7-SlimPose model successfully achieves bounding box and 2D keypoint detection in left and right images. The keypoint mAP of the YOLOv7-SlimPose model is 96.8%, with 61.5 million parameters and a detection speed of 0.38 s per corn plant image. Compared with the original model, the mAP shows a 0.3% increase, and the parameter count is reduced by 15%. Experimental results showcase the YOLOv7-SlimPose model’s effectiveness in keypoint detection tasks, boasting higher precision and a more compact model size compared to the original YOLOv7-Pose model.
The phenotypic parameter extraction model involves stereo matching, skeleton construction, and the phenotypic parameter calculation procedure. The stereo matching procedure can output a depth map and obtain the 3D coordinates of the keypoints, and the skeleton is constructed based on these 3D coordinates. The phenotypic parameters can be extracted from the skeleton and 3D coordinates. The stereo matching procedure efficiently yields a depth map from the bounding box in left and right images in 0.2 s.
The SCPE algorithm achieved an accuracy of about 90% for phenotypic parameter extraction of corn plants, and the extraction speed for phenotypic parameters was 0.38 s per corn plant. This marks it as the fastest method currently available for phenotypic parameter extraction. The SCPE algorithm serves as a technical foundation for phenotypic parameter extraction in plant phenotyping, can also monitor the growth of corn plants, and can be used for other crops or plants, such as sorghum, rice, and wheat.
Author Contributions
Conceptualization, Y.G. and Z.L.; methodology, Z.L.; software, Y.G. Validation, Y.G.; formal analysis, Y.G.; investigation, Y.G. and Z.L.; writing—original draft preparation Y.G.; writing, review, and editing, Y.G.; visualization, Y.G.; supervision, B.L. and L.Z.; funding Acquisition: Y.G. All authors read and agreed to the published version of the manuscript.
Funding
This work was supported by JST and the establishment of University Fellowships Toward Creation of Science Technology Innovation (grant number JPMJFS2133).
Data Availability Statement
The datasets analyzed during the current study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
CIoU: complete intersection over union; ELAN: efficient convergence network; mAP: mean
average precision; MPDIoU: minimum point distance-based IoU; OKS: object keypoint similarity;
SCPE: Stereo Phenotype Extraction algorithm.
References
- García-Lara, S.; Serna-Saldivar, S.O. Corn history and culture. In Corn; Elsevier: Amsterdam, The Netherlands, 2019; pp. 1–18. [Google Scholar]
- Raju, S.K.K.; Thompson, A.M.; Schnable, J.C. Advances in plant phenomics: From data and algorithms to biological insights. Appl. Plant Sci. 2020, 8, e11386. [Google Scholar] [CrossRef]
- Liu, X.; Li, N.; Huang, Y.; Lin, X.; Ren, Z. A comprehensive review on acquisition of phenotypic information of Prunoideae fruits: Image technology. Front. Plant Sci. 2023, 13, 1084847. [Google Scholar] [CrossRef] [PubMed]
- Shang, Y.; Hasan, M.K.; Ahammed, G.J.; Li, M.; Yin, H.; Zhou, J. Applications of nanotechnology in plant growth and crop protection: A review. Molecules 2019, 24, 2558. [Google Scholar] [CrossRef] [PubMed]
- Ma, Z.; Liu, S. A review of 3D reconstruction techniques in civil engineering and their applications. Adv. Eng. Inform. 2018, 37, 163–174. [Google Scholar] [CrossRef]
- Zermas, D.; Morellas, V.; Mulla, D.; Papanikolopoulos, N. 3D model processing for high throughput phenotype extraction–the case of corn. Comput. Electron. Agric. 2020, 172, 105047. [Google Scholar] [CrossRef]
- Zhao, G.; Cai, W.; Wang, Z.; Wu, H.; Peng, Y.; Cheng, L. Phenotypic parameters estimation of plants using deep learning-based 3-D reconstruction from single RGB image. IEEE Geosci. Remote. Sens. Lett. 2022, 19, 2506705. [Google Scholar] [CrossRef]
- Zhu, T.; Ma, X.; Guan, H.; Wu, X.; Wang, F.; Yang, C.; Jiang, Q. A method for detecting tomato canopies’ phenotypic traits based on improved skeleton extraction algorithm. Comput. Electron. Agric. 2023, 214, 108285. [Google Scholar] [CrossRef]
- Li, Y.; Wen, W.; Fan, J.; Gou, W.; Gu, S.; Lu, X.; Yu, Z.; Wang, X.; Guo, X. Multi-source data fusion improves time-series phenotype accuracy in maize under a field high-throughput Phenotyping platform. Plant Phenomics 2023, 5, 0043. [Google Scholar] [CrossRef] [PubMed]
- Qian, R.; Lai, X.; Li, X. 3D object detection for autonomous driving: A survey. Pattern Recognit. 2022, 130, 108796. [Google Scholar] [CrossRef]
- Shi, Y.; Guo, Y.; Mi, Z.; Li, X. Stereo CenterNet-based 3D object detection for autonomous driving. Neurocomputing 2022, 471, 219–229. [Google Scholar] [CrossRef]
- Li, Z.; Gao, Y.; Hong, Q.; Du, Y.; Serikawa, S.; Zhang, L. Keypoint3D: Keypoint-Based and Anchor-Free 3D Object Detection for Autonomous Driving with Monocular Vision. Remote Sens. 2023, 15, 1210. [Google Scholar] [CrossRef]
- Nguyen, H.X.; Hoang, D.N.; Bui, H.V.; Dang, T.M. Development of a Human Daily Action Recognition System for Smart-Building Applications. In Proceedings of the International Conference on Intelligent Systems & Networks, Hanoi, Vietnam, 18–19 March 2023; Springer Nature: Singapore, 2023; pp. 366–373. [Google Scholar]
- Fu, H.; Gao, J.; Liu, H. Human pose estimation and action recognition for fitness movements. Comput. Graph. 2023, 116, 418–426. [Google Scholar] [CrossRef]
- Du, X.; Meng, Z.; Ma, Z.; Lu, W.; Cheng, H. Tomato 3D pose detection algorithm based on keypoint detection and point cloud processing. Comput. Electron. Agric. 2023, 212, 108056. [Google Scholar] [CrossRef]
- Zheng, B.; Sun, G.; Meng, Z.; Nan, R. Vegetable Size Measurement Based on Stereo Camera and Keypoints Detection. Sensors 2022, 22, 1617. [Google Scholar] [CrossRef] [PubMed]
- Xiao, J.; Suab, S.A.; Chen, X.; Singh, C.K.; Singh, D.; Aggarwal, A.K.; Korom, A.; Widyatmanti, W.; Mollah, T.H.; Minh, H.V.T.; et al. Enhancing assessment of corn growth performance using unmanned aerial vehicles (UAVs) and deep learning. Measurement 2023, 214, 112764. [Google Scholar] [CrossRef]
- Dulal, R.; Zheng, L.; Kabir, M.A.; McGrath, S.; Medway, J.; Swain, D.; Swain, W. Automatic Cattle Identification using YOLOv5 and Mosaic Augmentation: A Comparative Analysis. In Proceedings of the 2022 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Sydney, Australia, 30 November–2 December 2022; pp. 1–8. [Google Scholar]
- Li, P.; Chen, X.; Shen, S. Stereo r-cnn based 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7644–7652. [Google Scholar]
- Atefi, A.; Ge, Y.; Pitla, S.; Schnable, J. Robotic detection and grasp of maize and sorghum: Stem measurement with contact. Robotics 2020, 9, 58. [Google Scholar] [CrossRef]
- Ortez, O.A.; McMechan, A.J.; Hoegemeyer, T.; Rees, J.; Jackson-Ziems, T.; Elmore, R.W. Abnormal ear development in corn: A field survey. Agrosyst. Geosci. Environ. 2022, 5, e20242. [Google Scholar] [CrossRef]
- Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo algorithm developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
- Siliang, M.; Yong, X. Mpdiou: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. Aaai Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
- Zhao, X.; Song, Y. Improved Ship Detection with YOLOv8 Enhanced with MobileViT and GSConv. Electronics 2023, 12, 4666. [Google Scholar] [CrossRef]
- Guo, Y.; Li, Y.; Wang, L.; Rosing, T. Depthwise convolution is all you need for learning multiple visual domains. Proc. Aaai Conf. Artif. Intell. 2019, 33, 8368–8375. [Google Scholar] [CrossRef]
- Shah, S.R.; Qadri, S.; Bibi, H.; Shah, S.M.W.; Sharif, M.I.; Marinello, F. Comparing inception V3, VGG 16, VGG 19, CNN, and ResNet 50: A case study on early detection of a rice disease. Agronomy 2023, 13, 1633. [Google Scholar] [CrossRef]
- Vadera, S.; Ameen, S. Methods for pruning deep neural networks. IEEE Access 2022, 10, 63280–63300. [Google Scholar] [CrossRef]
- Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).