1. Introduction
Mushrooms are a common type of fungus and are globally recognized for their culinary and medicinal applications. They are valued for their distinctive taste, rich nutritional content, anti-tumor properties, cholesterol-lowering effects, and significant commercial importance [
1]. According to Expert Market Research, the global mushroom market was valued at approximately USD 68.03 billion in 2023, with a projected compound annual growth rate of 7.18% from 2024 to 2032 [
2]. Shiitake mushrooms (
Lentinula edodes), native to China, are a dual-purpose crop used in both food and medicine and have become one of China’s most competitive agricultural exports [
3]. In 2022, China produced 12.95 million tons of shiitake mushrooms, accounting for 98.3% of the global production (Ministry of Agriculture and Rural Affairs of China). Phenotypic traits of shiitake mushrooms are vital for selecting high-quality strains and grading products. However, current phenotypic extraction technologies have limited research on the phenotype–genotype relationship, which is critical for variety improvement and product grading. Therefore, developing a high throughput, precise, and non-destructive method for extracting phenotypic parameters of edible fungi is essential for advancing mycological research and industrial applications.
Current methods for obtaining phenotypic parameters include manual measurements, 2D imaging, and 3D point cloud extraction. Manual measurements are slow, repetitive, subjective, and prone to damaging the plant during the process [
4]. Two-dimensional imaging methods are limited to measuring phenotypic parameters, such as cap diameter in two dimensions, and cannot capture spatial information such as thickness or orientation [
5,
6]. The 3D point cloud extraction method employs certain techniques such as laser scanning, depth cameras, and multi-view image-based 3D reconstruction [
7]. Laser tools, including laser scanners [
8,
9] and LiDAR systems [
9,
10,
11], provide high-resolution, accurate point clouds with real world dimensions but are expensive, offer low point cloud density, and have slow data collection speeds. Depth cameras, such as Kinect and Intel RealSense [
12,
13,
14,
15], facilitate rapid 3D data reconstruction but produce lower-quality point clouds because of their limited resolution and high sensitivity to environmental conditions.
The 3D reconstruction method based on multi-view images has the advantages of low-cost, high point cloud density and accuracy, and is not easily affected by the environment. Structure from Motion (SfM) and Multi-View Stereo (MVS) are usually used together to obtain more comprehensive and accurate 3D reconstruction results, which are widely used in the field of computer vision. For instance, He et al. [
16] utilized a low-cost SfM and MVS system to obtain seven phenotypic parameters for strawberries. Hao et al. [
17] developed MVS-Pheno V2 to collect point cloud data for cotton crops, employing the PointSegAt deep learning network to segment overlapping cotton leaves and extract phenotypic parameters. Xiao et al. [
18] applied multi-view image reconstruction to extract ten phenotypic traits from 3D point clouds of sugar beet roots, while Xiao et al. [
19,
20] extended the SfM-MVS technology to drones, designing a cross-surround photography method for the high-throughput, phenotypic extraction of maize, sugar beet, cotton, and cotton bolls. Additionally, Sun et al. [
21] created an ensemble learning framework that leveraged this technology to predict soybean yield, demonstrating that incorporating 3D structural features enhanced prediction accuracy. In summary, the application of 3D reconstruction point clouds from multi-view images is a cost-effective, precise, and robust approach that provides a valuable tool for crop breeding and agricultural production.
During the collection of multi-view images, the significant irrelevant background content is often captured, reducing the efficiency of 3D reconstruction. He et al. [
22] mitigated this issue by cropping multi-view images to reduce the size and enhance the processing efficiency, aiding subsequent point cloud preprocessing. However, this method does not fully eliminate background information. Segmenting multi-view images before 3D reconstruction has been proven to completely remove background content [
18]. Conversely, traditional segmentation algorithms face challenges in terms of accurately preserving complex crop edges and handling background shadows. Threshold-based and watershed algorithms are sensitive to noise and rely on brightness variations, limiting their effectiveness in processing complex images [
23]. Region-growing algorithms are capable of identifying continuous areas, whereas they are computationally intensive and highly sensitive to seed point selection. These methods generally lack robustness and fail to capture high-level semantic information in complex scenarios [
24]. In contrast, deep learning has demonstrated exceptional performance in image segmentation, accurately identifying targets, preserving boundaries, and extracting regions of interest holistically. For example, Yang et al. [
25] applied the U
2-Net model to remove irrelevant backgrounds from multi-view images of leafy vegetable plants, effectively eliminating large-scale background noise in reconstructed point clouds. This approach enhanced the accuracy of 3D reconstruction and improved the efficiency of subsequent point cloud preprocessing.
However, the research on the multi-view 3D reconstruction algorithm of edible fungi (including shiitake mushrooms) is still limited [
26,
27]. Existing methods have some problems in point cloud segmentation, 3D reconstruction, and phenotypic analysis, such as incomplete background elimination, expensive equipment, a large amount of calculation, and limited accuracy. In view of these limitations, this study uses the YOLOv8 algorithm to segment the shiitake mushroom multi-view image and uses the SfM-MVS algorithm for 3D reconstruction. Combined with the improved CP-PointNet++ model and clustering algorithm, an automatic point cloud segmentation pipeline was developed to achieve a high-throughput, lossless, and fast extraction of shiitake mushroom phenotypic parameters. Finally, the extracted parameters are input into the generalized regression neural network (GRNN) for yield estimation, which provides a robust tool for the application of fungal breeding.
2. Materials and Methods
This study introduced an automated pipeline for 3D reconstruction, point cloud segmentation, phenotypic parameter extraction, and yield prediction of shiitake mushrooms using the SfM 3D reconstruction method. The pipeline consisted of three main stages.
First, the YOLOv8 model segmented regions of interest (ROI) from RGB images, and the segmented images were utilized for subsequent 3D point cloud reconstruction.
Subsequently, the obtained point cloud underwent a series of preprocessing steps, and an improved PointNet++ model, incorporating the CBAM module and Partial Convolution (PConv), was utilized for segmentation, generating point clouds for the pileus, stipe, and shiitake mushroom sawdust substrate. The region-growing algorithm and fast Euclidean clustering algorithm were then applied to segment the individual mushroom point clouds, enabling the calculation of phenotypic parameters.
Finally, the calculated phenotypic parameters were input into machine learning algorithms for yield estimation, as outlined in the detailed workflow shown in
Figure 1.
Figure 1.
Overall workflow for 3D reconstruction, point cloud segmentation, phenotypic calculation, and yield prediction of shiitake mushrooms. A novel CP-PointNet++ point cloud segmentation network was proposed based on CBAM and PConv.
Figure 1.
Overall workflow for 3D reconstruction, point cloud segmentation, phenotypic calculation, and yield prediction of shiitake mushrooms. A novel CP-PointNet++ point cloud segmentation network was proposed based on CBAM and PConv.
2.1. Shiitake Mushrooms Sample and Data Collection
Shiitake mushrooms mycelia of the Shanghai Academy of Agricultural Sciences 509 strain were inoculated onto shiitake mushroom sawdust substrate measuring 10 cm in diameter and 40 cm in length. The substrate was incubated in a growth chamber maintained at 20–25 °C and 85% humidity. After the appearance of brownish button primordia, thinning was performed, and the substrate was left to develop until most mushrooms reached maturity.
After the mushrooms matured, the substrates were removed for multi-view image capture. A smartphone (iPhone 13 pro max) with an aperture of f1.5, an exposure time of 10 ms, and a resolution of 3024 × 4032 was fixed on a tripod positioned 30–50 cm from the culture substrate and markers. The markers were 30, 10, and 5 cm in length, width, and height, respectively. Horizontal images were extracted at the same height as the substrate, along with two downward images. For downward shots, the camera was placed 25 and 50 cm above the substrate at viewing angles of 30° and 60°, respectively.
The image capture process is illustrated in
Figure 2a, with approximately 150 images from multiple angles (
Figure 2b). After image collection, 163 mushrooms were harvested from the substrate. Phenotypic parameters, including pileus transverse and longitudinal diameters, pileus thickness, stipe diameter, stipe height, and mushroom mass, were manually measured using a soft ruler, Vernier caliper, and balance.
2.2. Semantic Segmentation and 3D Reconstruction of Multi-View Images
In terms of data acquisition, the collected multi-view RGB images contain substantial background information, which significantly affects the accuracy and efficiency of the subsequent 3D reconstruction. Therefore, removing the background information is essential. As an end-to-end network architecture, YOLOv8 demonstrates excellent performance in semantic segmentation and is well suited for identifying subtle semantic differences in complex environments. In this study, YOLOv8 was applied to segment digital images and extracted the ROI for mushrooms. Image annotation was performed using LabelMe to classify the images into ROI and irrelevant areas. The plant pixels were assigned a value of 255 and labeled white, whereas the irrelevant areas were assigned a value of 0 and labeled black. An example of the annotation results is shown in
Figure 2c. The dataset comprised 1482 images divided into training, validation, and testing sets in an 8:1:1 ratio.
- (1)
Image Segmentation Based on YOLOv8
The YOLOv8 [
28] network architecture consists of three main parts.
Backbone: this component is responsible for feature extraction utilized in convolutional and deconvolutional layers combined with residual connections and bottleneck structures to reduce the network size and enhance performance.
Neck: this component performed multi-scale feature fusion by integrating feature maps from various stages of the Backbone, thereby enhancing the feature representation capabilities.
Head: this component executed the final segmentation task by iteratively training the samples, minimizing the loss function, and enhancing the segmentation performance of plant images.
During training, the image resolution was adjusted to 640 × 640, and YOLOv8x-Seg applied online image augmentation to introduce slight variations in each epoch. As a key data enhancement technique, a number of images were selected and combined, followed by basic data enhancements, including flipping, scaling, color gamut adjustments, etc., enabling the model to detect partially occluded objects and objects at varying positions. The AdamW optimizer was employed with a learning rate of 0.002, momentum of 0.9, batch size of 32, and 300 epochs to ensure the convergence of the loss function. During testing, the network generated a 640 × 480 mask image that was resized to the original input image size using bilinear interpolation. The ROI regions were then extracted based on the mask with irrelevant areas filled in white.
- (2)
Shiitake mushroom Point Cloud 3D Reconstruction
The 3D reconstruction of shiitake mushrooms was performed using COLMAP (Ver. 3.9.1,
https://colmap.github.io/, accessed on 18 August 2024.). Initially, the Scale-Invariant Feature Transform (SIFT) algorithm extracted the key points from multi-view RGB images and generated the feature descriptors for these key points. The key points from different viewpoints were then matched to establish correspondences between the images. The SfM algorithm estimated the camera positions and orientations for each image based on the matched feature points, producing a sparse 3D point cloud that represented the preliminary geometric structure of the scene. Finally, MVS algorithm refined the reconstruction by generating a dense 3D point cloud based on the sparse point cloud foundation.
2.3. Point Cloud Data Preprocessing
2.3.1. Point Cloud Down Sampling and Scale Restoration
The dense point cloud reconstructed using the SfM-MVS approach had a large data volume, requiring a down sampling method to reduce the processing time for the subsequent algorithms. A 3D voxel grid method was employed to generate a voxel grid from the input point cloud data. Within each voxel, the centroid of all points was used to approximate and replace all other points, reducing the data volume while preserving the structure. This process minimized the computational time and enhanced the efficiency of the subsequent algorithms. The voxel size was set to 0.008, removing an average of 82% of the points without altering the outer contours of the point cloud, ensuring accurate phenotypic parameter calculations.
To determine the true dimensions of the shiitake mushrooms spawn, a coordinate scaling correction was applied to the reconstructed 3D point cloud. Using a reference marker, a scaling factor was calculated to adjust the point cloud coordinates to accurately represent the real-world dimensions. The calculation formula is as follows:
where
,
and
represent the actual length, width, and height of the marker, respectively;
,
, and
denote the corresponding length, width, and height in the reconstructed 3D point cloud, respectively. The original coordinates
were transformed into new coordinates
based on the calculated scaling factor.
2.3.2. Point Cloud Filtering and Coordinate Correction
In the SfM algorithm, the camera position in the first image is often selected as the coordinate origin for the point cloud. However, with multiple datasets, this approach can produce disorganized coordinate systems, complicating subsequent point cloud processing. To address this, a centroid-shifting operation was performed. The centroid of the point cloud was calculated, and all points were translated to align the centroid with the origin (0,0,0).
The 3D point cloud data reconstructed from multi-view images segmented by YOLOv8 may contain outlier noise points owing to hardware limitations and human operations. To address this, a Statistical Outlier Removal (SOR) filter was applied, which is a widely used method in point cloud processing to remove outliers based on the local neighborhood statistics for each point. For this filtering, the parameters were set to K = 15 and n = 0.75, effectively reducing noise while preserving the main structure of the point cloud.
The Random Sample Consensus (RANSAC) algorithm was employed to compute the normal vector m of the whiteboard. Among them,
mx,
my, and
mz are the components of the plane’s normal vector along the
x,
y, and
z axes, respectively. Using
m and the z-axis normal vector
n , the rotation axis
and rotation angle
were determined. The rotation matrix
R was then generated using Rodrigues’ rotation formula and applied to the original point
to obtain the corrected point
, as represented by the following equations:
where
I is the 3 × 3 identity matrix, and
is the skew-symmetric matrix of the normalized rotation axis
:
Pass-through filtering was applied to remove the marker and whiteboard beneath the aligned point cloud, yielding a clean point cloud of shiitake mushrooms spawn, as illustrated in
Figure 3.
2.4. Shiitake Mushrooms Spawn Point Cloud Segmentation Model
2.4.1. PointNet++ and Its Improvements
- (1)
CP-PointNet++
This study employed the CP-PointNet++ model as an enhanced version of PointNet++ [
29] for point cloud segmentation. PointNet++ utilized a hierarchical feature extraction method to improve the model’s understanding of local structures by sampling and grouping point cloud data at various scales. Through layer-wise feature transformation and pooling, rich local feature representations were extracted while preserving global structural information. The segmentation network of PointNet++ featured a multi-level architecture with a core structure based on Set Abstraction (SA) modules, each comprising three layers: Sampling Layer, Grouping Layer, and MLP Layer.
The PointNet++ segmentation network employed an Encoder–Decoder structure, with the features passed to subsequent modules via skip link concatenation. Feature propagation was facilitated by the interpolation function and unit PointNet. The interpolation function restored the features omitted during the down sampling process in the SA module, whereas the unit PointNet consisting of MLP and ReLU continued to extract the features from the data.
To overcome the limitations of the original PointNet++ model, such as insufficient feature extraction, low segmentation accuracy, and high memory usage, this study proposed the CP-PointNet++ model. CP-PointNet++ enhanced feature extraction by integrating the Convolutional Block Attention Module (CBAM) into the MLP layers of the original PointNet++ network. Additionally, Partial Convolution (PConv) replaced the standard convolutions, effectively reducing memory usage during training. The model structure is shown in
Figure 4.
- (2)
Introduction of CBAM Module
In the shiitake mushroom point cloud, similar features across different categories can reduce segmentation accuracy. To address this, the CBAM [
30] was incorporated into the MLP layers to enhance feature extraction capabilities. CBAM comprised two modules: the Channel Attention Module (CAM) and Spatial Attention Module (SAM). CAM processed the information of each channel using global average pooling and global max pooling. The outputs were combined through a fully connected layer and passed through a sigmoid function to generate the channel-level attention weights. These weights adjusted the feature responses, emphasizing the most relevant channels. The formula is as follows:
Unlike CAM, SAM focused on the significance of spatial locations. It extracted the spatial dimension information through global pooling and generated a spatial attention map using convolutional operations to adjust the importance of each spatial location. The formula is as follows:
By integrating the two attention mechanisms, CBAM inferred the attention maps and generated enhanced features. The combined formula for the two modules is as follows:
where
F represents the intermediate features in the network,
denotes element-wise multiplication, and
F″ is the refined feature map. The module structure is shown in
Figure 5.
- (3)
Replacement of Conv with PConv
PConv [
31] replaced standard convolutions with lightweight convolutions to reduce the excessive similarity between channels in standard convolution layers. This was achieved by selectively utilizing a subset of channels for feature extraction, which was then concatenated with the remaining channels. Finally, point-wise convolutions were applied to strengthen the inter-channel correlations. The module structure is shown in
Figure 6.
2.4.2. Pileus and Stipe Segmentation and Phenotypic Parameter Calculation
The CP-PointNet++ algorithm segmented the point cloud into three categories: pileus, stipe, and spawn. Further segmentation of the pileus and stipe collections is necessary. The pileus typically has a curved or circular shape, with a smooth surface and distinct curvature characteristics. Additionally, the normal directions of adjacent points exhibited a relatively consistent variation trend. A region-growing algorithm was adopted to segment individual pilei.
The stipe exhibited spatial separability, whereas occlusion during data acquisition resulted in varying point cloud densities in the 3D reconstruction. Fast Euclidean clustering served as a distance-based algorithm, assuming that points within the same class could be spatially proximate. This efficient and computationally simple algorithm was applied to segment the stipe collection, based on the spatial characteristics of the stipe point cloud.
The phenotypic parameters of shiitake mushrooms to be calculated included pileus transverse diameter, pileus longitudinal diameter, pileus height, stipe diameter, stipe height, and minimum bounding box (OBB) volume, which was the sum of the pileus and stipe bounding box volumes. The calculation methods are illustrated in
Figure 7.
The PCA algorithm was applied to the pileus to determine the primary orientation of the point cloud and perform rotation. The transformed pileus was then projected onto the XOY plane, where the Euclidean distance between the two farthest points defined the transverse diameter. The longest diameter perpendicular to the transverse diameter was identified as the longitudinal diameter. The pileus thickness was calculated as the absolute difference between the maximum and minimum values along the z-axis.
After applying the PCA algorithm to rotate the stipe, slices were extracted at 25%, 50%, and 75% positions along the principal direction. The slice thickness parallel to the YOZ plane was set at 10% of the stipe height. For each slice, the least-squares method was used to fit a circle and calculate the diameter. The average of these three diameters was considered as the stipe diameter. Stipe height was calculated as the absolute difference between the maximum and minimum values along the x-axis.
2.5. Yield Estimation
The task of yield estimation is to build the yield estimation model of shiitake mushrooms. The input of the model is the phenotypic parameters of shiitake mushrooms (including six phenotypic parameters extracted, including the transverse diameter of the cap, the longitudinal diameter of the cap, the thickness of the cap, the height of the stipe, the diameter of the stipe, and the volume of the bounding box). The output of the model is the yield value, which solves the non-linear regression problem. This task is not complex for the deep learning model, but it is difficult to obtain agricultural data, the amount of data is small, and they are prone to fitting problems, so it is particularly important to select the appropriate algorithm for the estimation of shiitake mushroom yield. After consulting the literature, it is found that machine learning algorithms such as Partial Least Squares Regression (PLSR), Support Vector Machine Regression (SVR), Random Forest (RF), and Generalized Regression Neural Network (GRNN) are suitable for dealing with such nonlinear regression problems.
PLSR combined principal component analysis with multi-variate regression to effectively address the multi-collinearity among input variables. It identified the latent variables (principal components) to explain the input data variance and maximized the correlation between these variables and the output variable. SVR constructed an optimal hyperplane to minimize the regression errors, exceled in handling high-dimensional and non-linear problems, and mapped data to higher-dimensional spaces using the kernel functions for the regression analysis. As an ensemble learning algorithm based on decision trees, RF made predictions by training multiple decision trees. It can be robust against overfitting to effectively handle non-linear features and demonstrate the reliability of feature selection and random sampling. GRNN served as a neural network model based on radial basis functions, and weighted averaging was performed based on the local similarity of input data. It quickly fitted the data, delivered the predictions efficiently, handled noisy data effectively, and converged rapidly.
In this study, the four models were used to estimate the yield of shiitake mushrooms. The dataset was randomly divided, with 80% used for training and 20% for testing.
2.6. Model Training and Performance Evaluation
- (1)
Hardware and software environment for model training
The hardware environment for 3D reconstruction and model training included an Intel(R) Xeon(R) Gold 6246R CPU, NVIDIA Quadro RTX 8000 GPU with 48GB memory, and 128GB of RAM. The software environment was run on the Windows 10 operating system, with the deep learning models developed using PyTorch 1.13 and CUDA 11.7. The point cloud data were divided into training, validation, and test sets at an 8:1:1 ratio. The hyperparameters were configured as follows: batch sizes of 16 and 300 epochs, a learning rate of 0.001, Adam optimizer, and a weight decay coefficient of 0.07.
- (2)
Semantic segmentation evaluation
The model was evaluated using Precision (
P), Recall (
R),
F1 Score, and Average Precision (
AP). The definitions of these metrics are as follows:
In this context, True Positives (TPs) were the samples correctly predicted as positive by the model, whereas True Negatives (TNs) were the samples correctly predicted as negative. False Positives (FPs) were samples incorrectly predicted as positive, and False Negatives (FNs) were samples incorrectly predicted as negative.
- (3)
Point cloud segmentation evaluation
The performance of the trained PointNet++ model for point cloud segmentation was evaluated using Overall Accuracy (
OA) and Mean Intersection over Union (
mIoU). The corresponding formulas are as follows:
where
k represents the number of classes,
TPi represents the true positives for the
i-th class,
FPi represents the false positives for the
i-th class, and
FNi represents the false negatives for the
i-th class.
- (4)
Phenotypic parameter calculation and yield estimation evaluation
The evaluation metrics selected were Mean Absolute Percentage Error (
MAPE), Root Mean Squared Error (
RMSE), Normalized Root Mean Squared Error (
nRMSE), and Coefficient of Determination (
R2). Among these,
RMSE,
nRMSE, and
R2 were also used as evaluation metrics for the yield estimation model.
where
n represents the number of samples,
yi is the observed value,
is the predicted value, and
is the mean of the observed values.
5. Conclusions
This study proposed an automated algorithm for the three-dimensional reconstruction and segmentation of Shiitake mushrooms based on multi-view images. The workflow consisted of five main components: image acquisition and segmentation, 3D reconstruction, point cloud preprocessing, point cloud segmentation, and phenotypic parameter calculation. The Yolov8 model achieved excellent performance in segmenting ROI from multi-view images, with an accuracy of 99.96%. The PointNet++ model enhanced with CBAM and PConv modules excelled in point cloud segmentation, achieving an OA of 97.45%. For the phenotypic parameter calculation, the nRMSE for the pileus transverse and longitudinal diameters and the stipe diameter was below 10%, whereas the errors for pileus height and stipe height were higher, with nRMSE values of 17% and 15%, respectively. Using the GRNN model, the shiitake mushrooms yield was estimated based on the extracted phenotypic parameters, achieving an RMSE of 2.276 g. This method could ex-tend beyond shiitake mushrooms to other fungi, including white mushrooms, straw mushrooms, reishi mushrooms, lion mane mushrooms, and oyster mushrooms. Future research will focus on integrating 3D reconstruction technologies with deep learning to enhance phenotypic parameter extraction, supporting applications in mushroom grading, phenotypic–genotypic analysis, and related fields.