1. Introduction
The automatic phenotypic analysis of plants based on computer vision technology has become a crucial component in facilitating high-throughput experimentation in both botany and agricultural research. Compared to traditional manual measurement methods, computer vision-based phenotyping methodologies offer advantages in reducing manual labor, promoting non-invasiveness, and efficiently managing large volumes of data [
1]. As a result, these methodologies have been widely recognized and applied over the past several years. This morphological analysis technique can be divided into 2D image-based phenotyping methods and 3D point cloud-based phenotyping, depending on the data dimensionality used. While 2D phenotyping technology has been actively pursued by many researchers due to its convenient data acquisition and high-throughput capabilities, the persisting challenge of plant occlusion, primarily due to the lack of spatial structural information within 2D images, remains. To overcome these challenges, the research community has turned to three-dimensional data processing techniques in phenotypic studies, with a notable focus on employing 3D point cloud data for detailed phenotypic analysis [
2,
3,
4,
5]. Unlike 2D imaging, 3D imaging technologies capture and convey spatial information inherently in three dimensions. These methodologies have the potential to significantly reduce information loss from data occlusion and overlap, offering a promising outlook for phenotypic data acquisition. Consequently, the integration of 3D vision technologies into plant phenotyping research has become more prevalent, enabling more detailed and comprehensive phenotypic evaluations.
Methods for obtaining plant point clouds can be systematically categorized as either active or passive. Active measurement techniques involve the direct manipulation of sensors to capture the 3D structure of plants [
6]. Common 3D sensor measurement techniques include laser triangulation [
7], time of flight [
8], terrestrial laser scanning (TLS) [
9], structured light [
10], and tomographic methods [
11,
12]. The application of UAV laser scanning (UAV-LS) technology in the agricultural field has steadily increased with the recent advancements in UAV technology [
13]. In addition, full-waveform LiDAR data can store the entire echo scattered by an illuminated object with different temporal resolutions [
14,
15,
16,
17,
18,
19]. In [
20], the DJI Zenmuse L1 multi-temporal point cloud was utilized to simulate ULS, and the vertical stratification of various crops was evaluated. Passive techniques, on the other hand, principally revolve around reconstructing 3D point clouds from 2D images collected from various vantage points. Within the realm of passive measurement technologies, the primary theologies and algorithms include the Structure from Motion (SfM) technique [
21,
22], the Multi-View Stereo (MVS) technique [
23], and space carving [
24,
25].
Discerning various attributes associated with plant phenotypes often necessitates an initial and accurate segmentation of the plant structures. Precisely segmenting plant organs is essential for high-throughput and high-precision phenotype analysis. However, due to the intrinsic complexity of plant structures and the morphological variability that occurs throughout the plant life cycle, achieving precise segmentation is challenging. Over the last decade, numerous scientists have utilized machine learning algorithms and constructed an array of models to intensively research plant organ segmentation. Standard practice involves pre-computing the local features of plant point clouds, including surface normal, curvature, smoothness [
26,
27,
28], local covariance matrices [
29], tensor [
30], and point feature histograms [
31,
32,
33]. The foundation of these methods lies in the noticeable heterogeneity in the morphological structures of different plant organs.
Alternatively, another potent strategy involves modeling the point cloud data to specific geometric structures, such as ellipsoids [
34], tubular structures [
23], cylinders [
35], or rings [
36]. These techniques find particular relevance for organs exhibiting significant differential morphological structures. Apart from these direct methodologies that capitalize on morphological differences, an efficacious approach involves the use of spectral clustering for aggregating points of the same organ by calculating inter-point similarities [
37]; however, this approach often incurs considerable computational costs. The combination of 2D imagery and 3D point clouds has also demonstrated commendable results in some plant studies [
38,
39]. Mature plants often display organ overlap due to dense canopies, prompting noteworthy attempts to adopt skeletonization algorithms to extract the “skeleton” of the plant to simplify its morphological structural complexity [
40]. Despite the success of these methodologies in some studies, they frequently necessitate the pre-computation of complex handcrafted features and are typically relevant only to specific morphological structures or closely related plant species, thereby limiting their widespread applicability.
In contrast to traditional processing methods, deep learning approaches enable a data-driven feature extraction from the input data. These techniques eliminate the constraints of prior plant knowledge and facilitate organ segmentation of crop objects at varying growth stages without the need for intricate parameter tuning. Over recent years, a multitude of point cloud processing solutions based on deep learning have been broadly applied within the plant phenotyping domain. According to the diverse data input methods, these can be categorized into projection-based methods, voxel-based methods, and point-based methods. Projection-based methods involve the projection of 3D point clouds onto 2D planes to form images, subsequently analyzing these 2D images utilizing Convolutional Neural Networks (CNNs), and eventually mapping the segmentation results back into the 3D point space [
41]. Some strategies establish correlations between 2D images and 3D point clouds via projection to enhance understanding of point clouds [
42,
43]. However, these methods can be impeded by projection angles and occlusions, potentially leading to a loss of crucial structural information.
An alternative approach involves mapping points into a 3D voxel grid. The voxelized data are then processed by 3D Convolutional Neural Networks (CNNs). For example, Jin et al. presented a voxel-based Convolutional Neural Network (VCNN), specifically designed for the semantic segmentation of maize stems and leaves across various growth stages [
44]. To address potential information loss during the mapping process, some researchers have proposed the concept of dynamic voxels [
45], applying it to plant point cloud segmentation [
46]. However, even with this approach, some point cloud information may be lost, and the size of the voxel significantly impacts network performance. Larger voxels often lose more detailed information, while smaller voxels usually lead to considerable computational costs.
Point-based methods, on the other hand, extract features directly from individual points for segmentation. Outstanding advancements in point cloud data processing have been marked by frameworks such as PointNet [
47] and PointNet++ [
48], which both enable end-to-end learning. These frameworks can learn features directly from raw point cloud data without manual intervention. The algorithms PointNet and PointNet++ have seen substantial application in the field of plant phenotyping [
49,
50,
51,
52,
53], with a host of scholars refining them further to enhance their performance. Main directions for improvement include adding local feature extraction modules [
54], constructing point connection graphs [
55,
56], applying the method of kernel point convolution [
57,
58], implementing down-sampling strategies [
56,
59,
60], increasing residual layers [
61], and promoting multimodal feature fusion [
62,
63]. Despite these advancements, these methods still utilize traditional featurizers that allow extensive progress in local feature extraction but fail to consider information from further points, thus hampering performance. In response, Xiang et al. [
64] proposed the CurveNet model, a model pivoting on curves, which continually refines its trajectory to connect with more distant points and, as such, has demonstrated remarkable results.
Acknowledging the escalating importance of attention mechanisms, certain researchers have adopted these strategies to foster enhanced interactions within point cloud data. For instance, Li et al. advocated for the PSegNet approach [
59], which capitalized on channel and spatial attention to amplify the model’s extraction capabilities. Furthermore, the advent of self-attention mechanisms and transformers has paved the way for exhilarating opportunities in the realm of point cloud data processing. Transformers inherently capture long-range relations via self-attention mechanisms [
65]. Models such as the Point Transformer (PT) [
66], Voxel Transformer (VT) [
67], and Stratified Transformer (ST) [
68] have demonstrated exceptional performance on numerous datasets, thereby drawing considerable interest in plant phenotyping research.
Moreover, significant strides have been made by researchers such as Guo et al., who integrated an Adaptive Self-Attention module into the PointNet framework, resulting in substantial improvements in performance [
69]. Compellingly, some scholars have employed novel position encoding strategies [
70] and stratified feature extraction architectures [
71] in the classic transformer framework to fortify the model’s attribute extraction ability. Undoubtedly, the emergence of self-attention mechanisms and transformer frameworks has significantly amplified the performance of point cloud segmentation models. However, this enhancement also incurs an appreciable computational cost. Moreover, models stemming from the transformer framework usually necessitate a substantial volume of training data. To address this predicament, certain researchers have gravitated towards weakly supervised segmentation techniques such as Eff-3DPSeg [
72]. Not only do these weakly supervised approaches reduce the reliance on extensive data, but they also provide fresh perspectives to enhance network segmentation performance.
Despite the relatively comprehensive application of various point cloud segmentation algorithms in the realm of plant phenotyping, there are numerous issues necessitating further exploration. The intricate morphological structure of plants poses significant challenges; while some models demonstrate commendable segmentation performance on simpler plant entities, they struggle with segmenting complex mature crops or smaller organ categories. This predicament significantly impedes the progression of plant phenomics research. Therefore, whether these novel network models are suited to a variety of crop structures remains a worthwhile topic for exploration. Additionally, the majority of studies on plant organ segmentation methods are conducted on crop objects in controlled environments, with few extending to field environments. In plant cultivation and crop breeding, the ultimate aim of phenotyping is to promote vigorous plant growth in fields. Thus, technologies suitable for field phenotyping are needed. Finally, the quality of point clouds generated by various point cloud acquisition techniques differs; one cannot simply test model performance on point clouds acquired from a single platform. To some extent, data disparities also influence model performance. Therefore, a comprehensive evaluation of models on point cloud datasets acquired from various platforms is necessary to meet the requirements of current practical applications.
In our view, it is essential to explore the application of some classical point cloud data processing models on point cloud datasets of various plant entities collected under diverse environments and on numerous platforms. By analyzing these methods, we hope to inspire the design of superior models and point cloud processing strategies and to promote the application of actual 3D point cloud technology in agricultural production.
4. Results
For the task of organ-level point cloud segmentation, we conducted a comprehensive assessment of the segmentation performance of the aforementioned nine models on point cloud data acquired under different scenarios and from various sensor platforms. The segmentation experiments were classified into three categories according to the method of obtaining point clouds: (1) experiments using point clouds acquired through laser triangulation, (2) experiments utilizing TLS point clouds, and (3) experiments on image-generated point clouds. Furthermore, in this section, the performance of the Mask3D model in performing instance segmentation on plot point clouds was also validated.
All experiments were conducted on a server with a 24-core, 48-thread CPU, 256 GB memory, and 4 NVIDIA Tesla V100 SXM3 GPUs, running on the Ubuntu operating system. PyTorch was used as the training framework. Each network model underwent single-GPU training, with 150 epochs for Stratified Transformer and Point Transformer training, 4000 epochs for Mask3D, and 250 epochs for the remaining models.
For the task of organ-level point cloud segmentation, to ensure consistent evaluation of the model, all crop point clouds were down-sampled to 10,000 points using the Farthest Point Sampling method during organ-level validation. Before the data were fed into the network, shuffling and normalization were performed. In doing so, we aimed to validate the efficacy of these classical deep learning models in dealing with point clouds collected from various environments and platforms. For the segmentation of individual plants, the network required a large sub-sampling rate when the entire regional point cloud was fed into it, which resulted in a significant loss of geometric information. Therefore, the regional maize point cloud was partitioned into fixed-size voxels, and a traversal of the voxels was performed, enabling feature extraction on all points within each voxel.
4.1. Organ-Level Segmentation
4.1.1. Experiments Using Point Clouds Acquired through Laser Triangulation
The Pheno4D dataset [
86] was utilized as the source of laser scanning data under controlled indoor conditions to evaluate the performance of various semantic segmentation models.
Table 3 presents the segmentation results of these models on the Pheno4D dataset, where OA denotes the overall accuracy, mACC denotes the average accuracy of every class, and mIoU represents the average IoU of every class. The majority of the models achieved an mIoU of over 80%, with the exception of PointNet and DGCNN. The ST and PT models exhibited the best segmentation performance, with overall accuracies surpassing 95%. The segmentation outcomes are depicted in
Figure 4.
PointNet and DGCNN exhibited subpar performance on maize and tomato crops, with notable inaccuracies, especially in segmenting the upper stems and leaves of tomato canopies. While other models successfully segmented corn plant stems and leaves, only PointCNN, PT, and ST managed to effectively segment the structurally intricate tomato plants. Challenges persist for other networks in accurately segmenting the upper stems of tomato plants.
4.1.2. Experiments Utilizing TLS Point Clouds
The datasets of maize, cotton, rapeseed, and potatoes were utilized to validate the model’s application potential for point cloud data acquired using TLS technology.
Table 4 presents the segmentation results of these models on TLS point clouds. Focusing on maize and cotton as representative examples, the segmentation results are illustrated in
Figure 5. Compared to the Pheno4D dataset, there was a general decline in performance metrics. Point cloud data collected in real growing scenarios are typically affected by various environmental factors, such as background clutter and wind. These point clouds often contain noise points that are difficult to filter and occlusions caused by dense planting. Compared to Pheno4D, these datasets present greater challenges for segmentation. ST achieved the best segmentation results across all subjects, while PT secured sub-optimal results in most cases. PointCNN and PAConv yielded comparable segmentation outcomes for both crops, effectively separating the stems and leaves. However, more detailed areas, such as the maize cobs and obscured sections of cotton plants, showed varying levels of misclassification.
4.1.3. Experiments on Image-Generated Point Cloud
The model’s segmentation performance was validated using point clouds of reconstructed tomato plants as the dataset. This dataset was obtained under greenhouse conditions using DSLR cameras to capture RGB images, which were then reconstructed.
Table 5 presents the segmentation results of the models on a dataset of mature tomatoes, with the visualization results displayed in
Figure 6. The morphological structure of the tomato plants is highly complex, compounded by a few persistent noise points that increase the difficulty of organ segmentation. The mean Intersection over Union (mIoU) for all models was below 70%. ST and PT achieved the best and second-best results, respectively, but the segmentation outcomes were still unsatisfactory. The models continue to struggle with the segmentation of tomato fruits, and the segmentation of stems and leaves in densely leafed areas also poses significant challenges. Other models even failed to identify the majority of the tomato stems.
Additionally, to verify the potential of the model in the segmentation task of point clouds derived from UAV-based RGB images, we utilized reconstructed maize point clouds as an example dataset. The model performance is summarized in
Table 6, with the visualization results displayed in
Figure 7. All models achieved an accuracy above 0.8. CurveNet exhibited the poorest performance, with an mIoU of 0.608, indicating significant degradation compared to data collected from other platforms. Moreover, PointMLP’s performance was notably impacted, falling below that of its predecessor, PointNet. In contrast, DGCNN showed relative performance improvements over previous datasets. Both ST and PT maintained their positions as the top-performing and second-best models, respectively.
4.1.4. Comprehensive Evaluation
The measurement approach comprehensively evaluates the models’ performance on both the ideal Pheno4D data and datasets collected under real-world conditions. We aim to use this test to assess the segmentation performance of the models across various types of datasets. As shown in
Table 7, ST and PT achieved the best and second-best results, respectively. PointCNN and PAConv reached an mIoU of over 75%, followed by PointNet++ and DGCNN. CurveNet and PointMLP exhibited similar segmentation outcomes, performing poorly on crop subjects with insufficient precision in detailed segmentation. PointNet was unable to handle the organ segmentation tasks for the aforementioned subjects adequately.
PointNet, a pivotal model in point cloud data processing, utilizes a unique point-wise approach and architecture as detailed in
Section 3.2. Despite its innovation, PointNet does not consider inter-point dynamics and primarily depends on global feature extraction through max pooling at the terminal stage. Nonetheless, PointNet lacks power in tasks requiring both local and global feature integration for point cloud segmentation. Its inability to simultaneously capture detailed local structures and overarching global patterns makes it less effective for segmenting plant organs, where intricate local–global feature extraction is crucial.
PointNet++ significantly enhances local feature extraction compared to its predecessor, PointNet. It adopts an encoder–decoder framework that leverages FPS to identify key central points. These points anchor a ball search that divides the point cloud into overlapping local areas via a grouping process. Within these areas, a mini-PointNet was used to encode local region patterns into feature vectors. This design enables PointNet++ to efficiently gather a more comprehensive set of local data, improving its utility in detailed segmentation tasks. PointMLP is noted for its efficient design, incorporating residual point MLP modules for local feature extraction and max pooling for feature aggregation. This model, inspired by residual networks, progressively expands its receptive field, similar to PointNet++, which also uses k-NN for local tasks and MLP for feature processing. Although PointMLP successfully enlarges its receptive field by layering MLP modules to enhance global feature capture, it falls short in integrating local group interconnections. DGCNN utilizes a graph-based framework to delve into complex point relationships within a graph. DGCNN incorporates k-NN aggregation in its EdgeConv module to capture local details. Post the final EdgeConv operation, it aggregates global features via max pooling. Despite its strengths, DGCNN’s local information capture is limited by the scope of neighboring points it considers, omitting more distal points.
CurveNet starts with the farthest point, employing a k-NN method and an MLP-based scoring system to select the starting point for curve analysis. It then uses a walking policy for curve construction and an MLP to extract and integrate curve features. CurveNet’s advanced curve aggregation technique and walking strategy effectively incorporate data from distant points, enhancing connectivity across extended regions of the point cloud. Our comparative experiments indicate that CurveNet matches the performance of PointCNN and PAConv on the clean Pheno4D datasets, but its effectiveness declines on crop types collected outdoors. Xiang et al. have noted that CurveNet’s curve detection is particularly vulnerable to noise, which can alter the initial curve point and cause inaccuracies [
64]. In real-world conditions, the crop point clouds we collected contain noise and partial occlusions, significantly affecting CurveNet’s ability to accurately group data. We conjecture that the robustness of CurveNet’s performance is strongly dependent on dataset quality, with noise and missing data notably degrading its ability to segment plant objects.
PointCNN and PAConv demonstrate acceptable segmentation abilities for various plants, including rapeseed, potatoes, and cotton, with similar performance outcomes. PointCNN features a unique convolutional operator called X-Conv, which utilizes k-NN to collect nearby points for feature aggregation, adapting the principles of traditional CNNs for 3D point cloud data. The X-Conv layers gather and centralize features from adjacent points, and through their sequential arrangement, they progressively widen their receptive field to cover the entire point cloud. While this method effectively captures and combines local features, it struggles to integrate attributes from more distant points. This shortcoming is notably problematic when analyzing crops with complex morphologies, where the model’s reliance on stacking convolutional layers might fail to capture detailed feature interactions, possibly resulting in less accurate segmentation. The Position-Adaptive Convolution (PAConv) method optimizes point cloud processing by employing k-NN to select neighboring points and dynamically creating a weight matrix using a network named ScoreNet. This adaptability allows the operator to conform to diverse point cloud geometries effectively. Integrating PAConv into the DGCNN framework, and replacing the EdgeConv operator, significantly enhances the model’s ability to capture local features. This integration reduces the computational load of measuring feature distances and facilitates the inclusion of additional neighboring points, thereby improving local feature representation. As a result, there is a notable increase in the average Intersection over Union (IoU) across various crops, from 0.733 to 0.767. The application of PAConv within the DGCNN framework enables superior segmentation, particularly in relatively complex crops like cotton, marking a substantial improvement in the detailed capture of local structures within point clouds.
The Point Transformer employs a self-attention mechanism to effectively extract features from point clouds, using k-NN during its Transition Down phase to collect neighboring points and max pooling to integrate this local information. Building on this, the Stratified Transformer enhances the design by incorporating the KPConv method for point embedding and adopting stratified sampling to capture both dense and sparse points. This methodology ensures a thorough integration of both local and global information. The Stratified Transformer, in contrast to the Point Transformer, features a broader receptive field that allows it to collect data from more distant points. This is particularly demonstrated by its robust segmentation capabilities across a variety of crop types. Excelling in segmenting plant organs within seven distinct crops, the Stratified Transformer proves highly effective and versatile in handling complex point cloud data for agricultural applications. The excellent segmentation performance of both the Point Transformer and Stratified Transformer demonstrates the significant potential of self-attention mechanisms in the task of plant organ segmentation.
4.1.5. Evaluation of Computational Costs
During our performance evaluation, the Stratified Transformer emerged as the most accurate model for segmentation. To thoroughly assess each model’s effectiveness, we recorded their training times, which reflect computational costs. Each model underwent training on a single GPU, optimizing the batch size to match the GPU’s capacity.
Table 8 details the time efficiency and segmentation accuracy of each model. Notably, the Stratified Transformer, while delivering superior segmentation results, required the most extensive training time at 88.16 h. Conversely, the PointMLP model, with its efficient architecture, logged the shortest training period of just 2.86 h.
Figure 8 provides a comprehensive comparative analysis of various segmentation models, illustrating their performance and computational efficiency. Additionally, the figure summarizes key performance metrics—overall accuracy (OA) and mean IoU (mIoU)—for all evaluated crop types. The comparison highlights the high computational demands of the Stratified Transformer. In contrast, models like PointCNN and PAConv offer impressive accuracy with much shorter training times, making them suitable for simpler crops.
The field of point cloud data processing has become a focal point in both academic and industrial circles, fueled by innovative models that have significantly enhanced segmentation technologies. These models excel at segmentation, utilizing a range of techniques including convolutional operations, graph-based methods, and attention mechanisms to extract local features effectively.
However, despite these advancements, the task of accurately segmenting point clouds, particularly for plant species with complex morphologies, continues to be a pressing challenge. Models like the Stratified Transformer demonstrate exceptional segmentation abilities but are hindered by high computational demands. There is an evident need for models that optimally balance computational efficiency and segmentation accuracy. This balance is crucial and represents a key direction for future research in point cloud segmentation. The ongoing efforts to improve the precision and efficiency of these models are expected to catalyze the next generation of innovations in this field.
4.2. Individual Plant-Level Segmentation
The dataset of the plot maize was utilized to validate the Mask3D’s application potential for point cloud data acquired using UAV-LS technology.
Table 9 presents the segmentation results of Mask3D on the plot maize dataset.
Figure 9 illustrates the visualization of instance segmentation results produced by Mask3D. Overall, Mask3D achieved satisfactory segmentation performance on the plot maize, with maize accuracy of 0.817 and a mean Average Precision (mAP) of 0.909. The visualization of the Mask3D results demonstrates that Mask3D effectively achieves instance segmentation of plot maize point clouds, except for one plant that was not distinguished. Remarkable segmentation results have been achieved on other individual maize plants. More importantly, compared to traditional segmentation algorithms, Mask3D does not require the configuration of complex parameters, showcasing significant application potential in point cloud segmentation at the individual plant level.
6. Conclusions
We conducted in-depth comparative experimental analyses on several typical deep learning models. To evaluate the models’ segmentation performance across diverse environments and data generated by different platforms, we have summarized the existing public datasets, provided detailed descriptions, and identified the shortcomings of the current datasets: limited plant species, small sample size, single data collection method, and overly ideal collection environment. We conducted data acquisition using terrestrial laser scanning and UAV laser scanning, and reconstructed point clouds from RGB images. The datasets comprised point cloud data from various crops, including maize, cotton, potato, and rapeseed. These datasets were subsequently utilized to assess the models’ practical application effectiveness.
At the organ-level segmentation, point cloud data acquisition and generation can be divided into five categories: indoor laser scanning, indoor 3D reconstruction from images, outdoor potted terrestrial laser scanning, field terrestrial laser scanning, and 3D reconstruction from UAV RGB images. The experiments show that ST and PT have achieved the best and second-best results, respectively, but require significant computational resources. PAConv and PointCNN have great potential on crop objects with simpler morphological structures. Based on the experimental results of the nine typical models, we have conducted in-depth experiments and comparative analyses of the models’ performance, focusing on three key steps of model composition: local operations, grouping methods, and feature aggregation methods. In this way, we have explored the reasons for the advantages and disadvantages of model performance and identified the inherent challenges of plant organ segmentation tasks: imbalanced class data, data loss caused by occlusion, and unavoidable noise.
Additionally, the potential application of the classic scene instance segmentation model Mask3D on maize point clouds collected by UAV-LS was validated. Mask3D achieved commendable performance, with a mean Average Precision (mAP) of 0.909. This finding supports the development of automated segmentation methods from plot point clouds to individual plant point clouds. Unlike traditional point cloud instance segmentation methods, Mask3D does not require the pre-computation of complex features, which greatly facilitates the segmentation of field maize. Research on end-to-end algorithms for segmenting individual point clouds from plot point clouds remains relatively scarce and warrants further exploration.
In future research, several key points warrant attention. First, the lack of automated methods for segmenting individual plant point clouds from plot point clouds limits high-throughput phenotyping in field scenarios. Second, robust point cloud segmentation models with strong feature extraction capabilities and the ability to adapt to multi-scale information are required. Additionally, in the realm of deep learning models, the down-sampling of point clouds presents a critical issue. The number of input points is linearly related to the model’s computational cost. Therefore, determining an appropriate threshold is a topic that merits investigation.