1. Introduction
Lettuce,
Lactuca sativa L., is a commercially crucial leafy vegetable rich in vitamins, carotenoids, dietary fiber, and other trace elements [
1]. Global lettuce consumption has increased rapidly in recent years due to its high nutritional and medicinal value [
2,
3]. Although lettuce has a rapid growth rate [
4] and multiple harvesting times, it is sensitive to its growth environment. For example, it has poor adaptability to saline–alkali soil [
5], and different light environments can affect its growth morphology and nutrient content [
6,
7]. It is vital to carefully monitor the crops during critical growth stages to maintain consistent supply and quality.
Plant phenotypic analysis is an interdisciplinary research field. Plant phenotypic information reflects various traits of the whole life cycle, such as growth form, development process, physiological response, etc. These traits result from interactions between plant genotypes and environmental conditions [
8,
9]. Linking phenotypic traits to genotypes can help select high-yield, stress-resistant varieties, thereby improving agricultural productivity to meet the demands of growing populations and climate change [
10]. One of the significant challenges in crop breeding is the imperfect phenotypic detection technology [
11]. Traditional phenotypic monitoring relies on visual observation and manual measurement, which is time-consuming and error-prone, and it needs to be more accurate in evaluating trait diversity among different varieties. Therefore, automated phenotyping technologies are essential for more efficient and accurate plant trait detection.
Recent advancements in computer vision, algorithms, and sensors have significantly progressed plant phenotypic analysis. Many imaging techniques can now capture complex traits associated with growth, yield, and adaptation to biotic or abiotic stresses, such as disease, insect infestation, water stress, and nutrient deficiencies. These techniques include digital imaging spectroscopy, chlorophyll fluorescence, thermal infrared, RGB, and 3D imaging [
12,
13]. Spectral images can be utilized to analyze the physiological characteristics of lettuce, such as the leaf’s overall physiological condition, water content, pigment, and structural composition information related to biomass [
14]. Eshkabilov et al. [
15] employed hyperspectral data and artificial neural network (ANN) to predict the fresh weight, chlorophyll, sugar, vitamins, and nutrients of lettuce, achieving an R index ranging from 0.85 to 0.99. Yu et al. [
16] used hyperspectral data and time series phenotype as input, combined with RNN and CNN models, to detect SSC, pH, nitrate, calcium, and water stress levels of lettuce. Based on hyperspectral images, Ye et al. [
17] estimated the total chlorophyll of greenhouse lettuce, and the average R
2 and RMSE were 0.746 and 2.018. Hyperspectrum contains more information than multispectrum, but the data processing is more complex, and the equipment is expensive.
The canopy area was estimated by chlorophyll fluorescence imaging to predict fresh weight, and heavier lettuce or anthocyanins impacted the results [
18,
19]. Thermal infrared imaging can obtain the temperature of the plant or leaf, generally as supplementary data. Concepcion et al. [
20] combined thermal imaging and RGB images to estimate lettuce’s full moisture content and equivalent water thickness. The R
2 scores reached 0.9233 and 0.8155, respectively.
RGB imaging is the most commonly used method for crop phenotype studies due to its low cost, ease of use, and simple data processing [
21,
22,
23,
24,
25]. Yu et al. [
21] collected multi-view images of lettuce under water and nitrogen stress and used ConvLSTM to predict the images of lettuce. RMSE, SSIM, and PSNR results were 0.0180, 0.9951, and 35.4641, respectively. The average error of the phenotypic geometric index based on prediction images was less than 0.55%. Zhang et al. [
22] employed a CNN with RGB images to estimate three lettuce types’ fresh weight, dry weight, and leaf area. They achieved R
2 values of 0.8938, 0.8910, and 0.9156, respectively, with NRMSE values of 26.00%, 22.07%, and 19.94%. Three-dimensional imaging could provide more information than RGB imaging by capturing an object’s three-dimensional coordinate information and generating its stereoscopic image [
26,
27,
28]. Lou et al. [
26] used a ToF camera to capture point cloud data from a top-down perspective, and the lettuce point cloud was reconstructed using geometric methods. The results showed that the completed point cloud had a high linear correlation with actual plant height (R
2 = 0.961), leaf area (R
2 = 0.964), and fresh weight (R
2 = 0.911).2
As can be seen from the above studies, phenotypic analysis based on a single mode has accumulated many research results. Still, the information provided by a single sensor needs to be improved. The multimodal information has a certain degree of complementarity and consistency, which can compensate for each other’s shortcomings. Using multimodal data to improve model performance has become popular in lettuce phenotype research.
The fusion methods for different modal information can be divided into three categories: data layer fusion, feature layer fusion, and decision layer fusion. The data layer fusion method treats multimodal data as indistinguishable multichannel data. It can use the inherent complementarity between modes to supplement the incomplete information in the input stage [
29]. Taha et al. [
30] combined spectral vegetation indices and color vegetation indices to estimate the chlorophyll content of hydroponic lettuce. The AutoML model outperformed the traditional model with an R
2 of 0.98.
The feature layer fusion method integrates multimodal images into parallel branches, extracts independent features of different scales, and performs feature fusion. Wu et al. [
31] proposed a hybrid model based on dual-transformer and convolutional neural networks to detect lettuce phenotypic parameters using RGB-D image data. The average R
2 of phenotypic traits was 92.47.
The decision-level fusion method is the fusion of the detection results of the previous stage [
32,
33,
34,
35]. In the study by Lin et al. [
32], the U-Net model was used to segment lettuce, extract leaf boundary and geometric features, and estimate fresh weight through a multi-branch regression network fusion of RGB images, depth images, and geometric features. The experimental results showed that the multimodal fusion model significantly improved the accuracy of lettuce fresh weight estimation in different growth periods. The RMSE of the whole growth period was 25.3 g, and the R
2 was 0.938.
Using feature layer fusion methods with multimodal data has been relatively rare in the lettuce phenotyping field. A new multimodal fusion method based on the feature layer was proposed to address this gap, which mainly performed feature extraction and fusion for RGB and depth image data. The main contributions are as follows: (1) A feature correction module is proposed, which filters and corrects each other’s feature noise information in channel and spatial dimensions, based on the principle that the information and noise of different modes are usually complementary. (2) A feature fusion module based on SE attention is proposed to integrate the features of the two models into a unified feature map. (3) The phenotypic trait header uses a residual structure, replacing linear interpolation in the feature pyramid network (FPN) with transposed convolution. The experimental results showed that the model improved lettuce’s object detection and segmentation performance and performed well in estimating phenotypic traits.
3. Results
3.1. Evaluation Index
In this study, COCO evaluation indexes were used for detection and segmentation. The COCO index is a mainstream evaluation criterion for object detection and segmentation, including AP (Average Accuracy) and AR (Average Recall). It uses ten different IoU thresholds (0.5 to 0.95, separated by 0.05) to assess how closely the detection box or segmentation mask matches the actual annotation. The primary metric for COCO is AP0.5:0.95, which is the AP average across all categories and all IoU thresholds. In addition, we also introduce F1, which is an essential indicator for evaluating the performance of binary classification models, especially in the case of class imbalance. It is the harmonic average of Precision and Recall, considering the performance of these two metrics. Equation (7) for calculating F1 is
Precision refers to the proportion of all samples predicted by the model to be positive that are positive. Recall is the percentage of all positive samples the model correctly predicts will be positive.
In the regression prediction of lettuce phenotypic traits, R
2, MAPE, and NRMSE indexes were used. R
2 is the evaluation index of regression analysis, representing the proportion of all the variation in the dependent variable that the independent variable can explain through the regression relationship. The range is [0, 1]. The closer R
2 is to 1, the better the model fit; the closer R
2 is to 0, the worse the fit. Equation (8) for calculating R
2 is
In the formula, is the average of the actual value, is the i-th model predicted value, and is the i-th true value.
MAPE (Mean Absolute Percentage Error) represents the relative difference between predicted and actual values. It provides a percentage that indicates the accuracy of a model. The range of MAPE is from 0 to infinity. A MAPE of 0% indicates a perfect model, while higher percentages indicate a less effective model. Equation (9) for calculating MAPE is as follows:
In the formula, is the i-th model predicted value, and is the i-th true value.
NRMSE (Normalized Root Mean Square Error) is derived by normalizing the square root of the Mean Squared Error (MSE). The value of NRMSE ranges from 0 to 1. MSE represents the average of the squares of the differences between predicted and actual values. Unlike MAPE, NRMSE emphasizes the impact of more significant errors. Equation (10) for NRMSE is as follows:
In the formula, is the i-th model predicted value, is the i-th true value, and indicates the largest/smallest value out of the true values.
3.2. Model Performance
The test set results are shown in
Figure 7 and
Table 2,
Table 3 and
Table 4. The three subgraphs in
Figure 7 correspond to the evaluation scores of R
2, MAPE, and NRMSE indicators in fw, dw, h, d, and la growth traits, and each subgraph contains the results of the K-fold cross-validation method. Bold fonts in all tables represent the best results.
When R2 is 0–1, the larger the R2 value, the better the estimation effect. The smaller the MAPE and NRMSE values, the better the network performance. Our proposed method performs well in various indicators, among which the predicted R2 of fw, dw, h, d, and la are 0.9732, 0.9739, 0.9424, 0.9268, and 0.9689, respectively, and MAPE is 0.1003, 0.1622, 0.074, 0.0516 and 0.0864. NRMSE is 0.0398, 0.0387, 0.0577, 0.0517 and 0.0409. In the standard deviation results, d’s R2 and dw and la’s MAPE standard deviation are greater than 0.01, and the rest are less than 0.01.
As can be seen from
Figure 1, the R
2 results of the 5-fold cross-validation experiment of fw and dw are the closest and highest, while the R
2 results of d are the lowest and most dispersed. However, among the MAPE error values, dw is the largest but most dispersed, and d is the smallest. The maximum error value of NRMSE is h, and the minimum error value is dw. While the R
2 of fw and dw is the highest and closest, the MAPE of dw (0.1622) is significantly higher than that of fw (0.1003), suggesting that while the model explained the overall change in dw well, there were large relative errors in some predictions.
MAPE is sensitive to negligible values and may perform poorly when dealing with negative or near-zero scenes. The early growth value of dw is small, which may be one of the reasons for the high MAPE value. The R2 of h is relatively high, but the prediction error of NRMSE is greatest in 5 phenotypic traits. The R2 of d shows some fluctuation, indicating that the prediction fit degree fluctuates, but MAPE and NRMSE both show minor and stable prediction errors. The R2 value of la is 0.9689, the MAPE is 0.0864, and the NRMSE is 0.0409, all showing high Precision and low error. The model is the most robust in la prediction, with high accuracy and consistency.
The model’s object detection and segmentation results on the lettuce dataset are shown in
Table 5. The average AP50:95, AP50, and AP75 of the object detection results of the 5-fold cross-validation experiment are 0.8881, 0.9979, and 0.9945, respectively, and the average AP50:95, AP50:95, and AP75 of the segmentation results are 0.9041, 0.9979, 0.9969, respectively. As you can see, the fourth fold is the best result in the phenotypic trait prediction section. However, the second fold of object detection and segmentation has the best effect.
The inference results of the four varieties of lettuce model are shown in
Figure 8. It can be seen from the figure that no matter the type of lettuce, the early segmentation effect is better, and different lettuce shows different forms as it grows. Both Lugano and Satine show leaf clumping and folding at the later stage of growth, but Satine has more prominent small leaf folds than Lugano, and the effect is relatively poor at the edges. The leaves of Salanova are more spread than other varieties, and careful observation shows that the effect is indeed the worst in the later stage. In Aphylion and Salanova, it can be observed that less of the edge is not covered, which means that the edge covers more of the non-leaf position, decreasing segmentation results, as evidenced by the indicators in the
Section 4.2.