1. Introduction
Apple phenotype characteristics are an important aspect of agricultural research and production, including the apple shape index (the ratio of longitudinal to transverse diameters), apple size (based on the transverse diameter), color (surface coloration rate), and surface condition (such as freshness and damage) [
1]. These characteristics are not only key indicators for evaluating apple quality but also crucial for guiding agricultural production, improving cultivation techniques, and enhancing the commercial value of apples [
2]. In recent years, with the development of agricultural automation and intelligence, computer vision and deep learning-based object detection and semantic segmentation technologies have demonstrated great potential in extracting apple phenotype data [
3]. Traditional methods for apple phenotype analysis typically rely on manual measurement and empirical judgment, which are not only inefficient but also subject to subjective biases [
4]. In contrast, instance segmentation-based techniques can precisely locate the boundaries of apples and simultaneously perform automatic extraction of various phenotype features, enabling large-scale apple quality detection and analysis [
5]. However, apple growth is often influenced by environmental factors (such as climate change and soil conditions), leading to growth anomalies such as deformed apple shapes and surface damage, which significantly impact an apple’s quality and market value [
6].
To achieve precise extraction of apple phenotype characteristics and efficient identification of growth anomalies, a comprehensive method combining instance segmentation and natural language processing (NLP) is proposed in this study [
7,
8]. On the one hand, the edge transformer segmentation network based on instance segmentation can precisely extract various phenotype features of apples [
9]. On the other hand, the NLP module parses and analyzes agricultural text data (such as expert notes, planting records, and meteorological data) to reveal potential causes of and development trends in growth anomalies from multiple dimensions [
10]. This integration of image analysis and text parsing not only enhances the comprehensiveness and accuracy of apple phenotype data extraction but also provides new insights for apple quality management and anomaly prediction [
11]. In recent years, object detection technologies such as the YOLO series and Mask R-CNN have been widely applied in apple recognition and classification, demonstrating excellent performance in both real-time detection and accuracy [
12]. Additionally, semantic segmentation techniques, such as UNet and DeeplabV3+, have been proven to be highly effective in apple surface feature extraction and disease detection tasks [
13,
14]. Complementing these technologies, the rapid development of NLP has provided new tools for automating the processing of agricultural text data. For instance, Anand et al. proposed a deep learning framework, AgriSegNet, for multi-scale, attention-based semantic segmentation using drone-acquired images to automatically detect agricultural field anomalies [
15]. Zhang et al. proposed a segmentation method which outperforms traditional PSO clustering methods in terms of stability and accuracy. It can accurately and effectively segment agricultural product images in various complex environments, facilitating automated agricultural product picking robots [
16]. Su et al. introduced a novel data augmentation framework based on random image cropping and patching (RICAP), which effectively improves segmentation accuracy. The proposed framework boosts the average accuracy of deep neural networks from 91.01% to 94.02% by enhancing the original RICAP approach [
17]. Zhang et al. developed a pruning inference method which automatically deactivates part of the network structure based on different conditions, reducing network parameters and operations and significantly increasing the network speed. The proposed model achieved accuracy, recall, and mAP rates of 90.01%, 98.79%, and 97.43% in detecting apple flowers, respectively [
18]. These advancements demonstrate the promising potential of integrating image analysis with NLP technologies in agricultural production. Therefore, the method proposed in this paper aims to address the limitations of single-image analysis methods and provides technical support for improving agricultural production efficiency and economic benefits through multimodal data fusion. The contributions of this paper are as follows:
Integration of Instance Segmentation and Natural Language Processing: For the first time, instance segmentation technology is combined with natural language processing (NLP) to achieve multi-dimensional data fusion in apple phenotype feature extraction and anomaly identification. Instance segmentation ensures the precise extraction of critical phenotype features such as an apple’s size, color, and surface condition by accurately delineating an apple’s boundaries and surface features. Meanwhile, the NLP module analyzes agricultural text data (e.g., expert notes, planting records, and meteorological data) to reveal potential causes of growth anomalies, providing a comprehensive and accurate analysis which overcomes the limitations of traditional single-image analysis.
Innovative Application of the Edge Transformer Segmentation Network: The edge transformer segmentation network introduced in this paper integrates Transformer mechanisms with edge-aware modules to better handle complex boundary information of apples. This innovation improves segmentation precision and robustness, especially when dealing with damaged or irregular apple shapes. The method shows excellent performance in extracting key phenotype features from apple surface characteristics and provides reliable support for apple quality assessment and growth anomaly monitoring.
Multi-Modal Data Fusion for Anomaly Recognition: In contrast to traditional single-image analysis, this study proposes a method for multi-modal data fusion. By combining image data with agricultural text data (e.g., meteorological data and planting records), the NLP module conducts multi-dimensional analysis, leading to more accurate identification of growth anomalies (such as apple deformities and surface damage) and revealing potential causes from a broader context. This cross-modal data fusion offers new perspectives for anomaly prediction and apple quality management in agricultural production.
In the following sections, we provide a detailed overview of the proposed method and its components.
Section 2 reviews the foundational methods in object detection, semantic segmentation, and related techniques, which serve as the building blocks for our approach.
Section 3 introduces the materials and methods used in this study, including the dataset collection, preprocessing steps, and detailed architecture of the proposed model.
Section 4 presents the experimental results and discusses the performance of our method in comparison to existing baseline models. Finally,
Section 5 concludes this paper by summarizing the findings and discussing potential future research directions.
3. Materials and Methods
3.1. Image Construction
In this study, the collection of image datasets and image annotation are key steps in fruit phenotypic analysis and anomaly recognition tasks. To ensure the diversity and representativeness of the data, a large number of apple images were collected from multiple regions, covering different growth environments and climatic conditions. The image data were primarily collected from apple orchards in Changping District, Beijing, Qixia City, Yantai, Shandong Province from March 2023 to August 2024, with some images also sourced from the internet, as shown in
Table 1, totaling 24,042 images. The image acquisition equipment and methods employed in this study are critical. The image acquisition method and samples are shown in
Figure 1.
A Canon EOS 5D Mark IV camera, manufactured by Canon Inc., headquartered in Tokyo, Japan, was utilized due to its exceptional imaging quality and ability to capture fine details, meeting the requirements for precise acquisition of fruit details. This camera was paired with a Canon EF 100mm f/2.8L Macro IS USM macro lens, which is particularly suited for capturing high-precision images at close distances. To minimize the impact of shadows and reflections on the image quality, all images were captured under soft natural light conditions during early morning or late afternoon. However, in real-world agricultural settings, images may sometimes still contain shadows or overexposure due to fluctuating lighting conditions. To handle such cases, we employed image preprocessing techniques such as histogram equalization and contrast adjustment to reduce the effects of shadows and overexposure. Additionally, images with excessive overexposure or shadows which significantly obscured fruit features would be identified and excluded from the dataset during the quality control process. This ensured that only high-quality images suitable for phenotypic analysis were included in the final dataset. During image acquisition, particular attention was paid to capturing fruits from multiple angles to document their morphological characteristics and surface abnormalities. To ensure the dataset’s representativeness, images were collected covering various fruit types, maturity stages, varieties, shapes, colors, and symptoms of different diseases. The images were obtained from apple orchards in Changping District, Beijing, and Qixia City, Yantai, Shandong Province. These two regions differ in terms of climate and soil conditions, resulting in diverse image backgrounds and fruit states. For image annotation, a semi-automated approach was adopted using the LabelMe tool. Annotators first manually outlined the fruit positions and drew bounding boxes around them. Subsequently, detailed annotations were added regarding each fruit’s shape index, size, color grading, and surface conditions.
Building on the large-scale image data acquisition, additional data were collected by scraping open-source datasets and agriculture-related websites. These sources included expert notes, cultivation records, and meteorological data which, combined with the image data, contributed to the construction of a multimodal dataset. This dataset serves as a valuable resource for fruit phenotypic analysis and anomaly detection while also establishing a foundation for agricultural economic analysis [
42,
43]. By integrating expert notes, cultivation records, and meteorological data, it is possible to analyze the influence of environmental factors, management practices, and production decisions on fruit quality and yield during the growth process. This provides scientific guidance for agricultural production, optimizing management strategies, and enhancing economic benefits. Expert notes represent a significant component of the dataset, encompassing 20,432 entries documenting common issues and practical experiences throughout the apple cultivation process. These entries span various stages from planting to harvesting and include information on climate, soil, pest management, and irrigation techniques. Cultivation records provide detailed data on apple cultivation processes, including planting times, fertilization practices, irrigation frequencies, and soil treatments, amounting to 19,267 entries. Additionally, meteorological data are another vital source, consisting of 22,803 entries. The growth of apples is closely associated with climatic conditions, as factors such as temperature, humidity, and precipitation directly impact growth cycles, fruit quality, and pest outbreaks. In-depth analysis of these data sources offers valuable references for agricultural economic studies [
44,
45].
3.2. Data Preprocessing
Data preprocessing is the process of cleaning and adjusting raw images to eliminate noise, correct color biases, and crop regions of interest (ROIs) to improve data effectiveness. Common preprocessing operations in image processing include image cropping, flipping, rotation, resizing, denoising, and white balance correction, as shown in
Figure 2.
Image cropping involves selecting ROIs to remove irrelevant background information, thereby reducing computational redundancy and emphasizing the target object. This method is particularly important in fruit phenotype analysis as cropping eliminates background interference, allowing the model to focus more on the features of the fruit itself. Denoising is achieved through techniques such as filtering to reduce random noise in the image, thereby improving image quality. Gaussian filtering is a commonly used denoising method. White balance correction is used to correct color distortion in images and restore the true color information of the fruit. This process adjusts the mean values of the red, green, and blue (RGB) channels of the image such that they align with the target values.
3.3. Data Augmentation
Data augmentation is the process of applying various transformations to the original images to generate a more diverse set of training samples, thereby enhancing the robustness and generalization ability of the model. Common data augmentation techniques include Cutout, Mixup, and CutMix. Cutout involves randomly masking a rectangular region on an image to simulate scenarios where the target is partially occluded. This augmentation technique effectively improves the model’s prediction ability under occlusion. Let the size of the image
I be
and a region of a size
be randomly occluded at position
. The augmented image
A can be expressed as follows:
Mixup involves linearly mixing two images at a certain ratio, with the aim of improving the model’s smoothness and prediction ability for unseen samples. The augmentation formula for Mixup is
where
is the mixing coefficient sampled from a Beta distribution,
and
are the two original images, and
and
are the corresponding labels. CutMix combines the ideas of Cutout and Mixup by pasting a portion of one image onto another and adjusting the labels to reflect the proportion of the mixed region. Let
and
be the two images. The augmented image
A can be expressed as
The label adjustment formula is
where
is the proportion of Region1.
3.4. Hyperparameters
Dataset partitioning is a key step in training and validating machine learning models. The goal is to appropriately allocate data for model training, validation, and testing to ensure the scientific and representative evaluation of the model’s performance. In this study, the dataset was divided into training and validation sets with a ratio of 8:2. Additionally, to enhance the stability and robustness of the model,
K-fold cross-validation was employed. Specifically, the dataset was divided into
K non-overlapping subsets, with
subsets used for training and the remaining subset used for validation. After repeating this process
K times, the average performance was computed. The formula for calculating the average performance of
K-fold cross-validation is
where
represents the performance metric (such as the accuracy or
) for the
kth validation. In fruit phenotype analysis, through proper data preprocessing, augmentation, and partitioning, a high-quality dataset was built to provide a reliable foundation for subsequent model training and performance evaluation. This method not only enhances the model’s robustness but also provides important support for practical deployment.
3.5. Proposed Method
The method presented in this study, from its overall design to specific implementation, aims to extract fruit phenotype features and identify growth anomalies through a multimodal data fusion approach, combining instance segmentation and NLP techniques. The overall framework of the model is a continuous flow from input data to final output predictions, involving the collaborative work of multiple modules, as shown in
Figure 3.
First, after the input data were preprocessed, it entered the instance segmentation network for feature extraction from the images. The extracted image features were then processed by subsequent modules, such as the feature transformation and alignment stages, to refine and integrate the information. Concurrently, the Agricultural Knowledge Data block, which includes textual data such as planting records, climate data, and expert notes, was parsed by the NLP module. This NLP process extracted meaningful insights from the textual data to complement the visual features extracted from the images. The integration of both image and agricultural knowledge data allowed for more accurate fruit phenotype recognition and the identification of growth anomalies. Finally, the combined results from both the image-based and text-based analyses were used to make precise predictions regarding the fruit phenotype and detect potential growth anomalies.
3.5.1. Edge Transformer Segmentation Network
In this study, the proposed edge transformer segmentation network is a key component for fruit phenotype analysis and growth anomaly identification, combining agricultural images with agricultural knowledge (text data).
As shown in
Figure 4, the network design incorporates a deep fusion of instance segmentation and NLP techniques, aiming to precisely extract fruit phenotype features through image segmentation and text analysis, while also integrating agricultural text data (such as planting records and meteorological data) for anomaly detection. The network input consists of two parts: (1) agricultural images, which provide visual information about the fruits, and (2) agricultural knowledge (text data), which supplies multi-dimensional information regarding the fruit’s growth environment, climate change, and management practices. The fusion of this multimodal data helps enhance the prediction accuracy of fruit growth anomalies and provides robust support for fruit quality management and agricultural production decision making. The network implementation explicitly leverages the Transformer architecture to model cross-modal interactions rather than applying simple rules to a 2D matrix. Specifically, both image and text features are projected into a shared feature space, where multi-head self-attention (MHSA) is employed to dynamically learn the dependencies between different modalities. First, the network extracts basic features from the image using convolutional layers, generating an initial feature map. This feature map then passes through the edge-aware module, which is designed to enhance the network’s focus on the fruit’s edge areas, especially when the fruit has a complex shape or surface damage. The edge-aware module further strengthens the edge features in the image by combining traditional edge detection algorithms (such as Sobel or Canny) with convolutional operations. The module computes the edge feature map of an image and merges it with the feature map generated by the convolution layers, enhancing the sensitivity to fruit contours, cracks, and other detailed regions and thereby improving the segmentation accuracy. Next, the enhanced feature map enters the Transformer module for global information modeling. Unlike conventional convolution-based approaches, which rely on local receptive fields, the Transformer module effectively models long-range dependencies within an image through self-attention mechanisms. Given an input feature representation
F, the self-attention operation is computed as follows:
where
,
, and
are the query, key, and value projection matrices, respectively, and
is the dimensionality of the key vectors. By leveraging this global attention mechanism, the network learns contextual relationships across different regions of an image, particularly addressing cases where fruit morphology spans multiple spatial regions. In addition to image data, the network’s input also includes textual data related to fruit growth. The NLP module parses and analyzes agricultural text data (such as climate records, planting records, and expert notes), providing additional support for predicting fruit growth anomalies. To ensure a seamless fusion of text and image data, textual information is encoded using a Transformer-based embedding model, such as BERT or a domain-specific language model. This process converts textual descriptions into dense feature vectors, which are then aligned with the visual embeddings via cross-attention layers. The NLP module employs a Transformer structure to process the text data, extracting key information related to fruit growth such as climate changes, fertilization management, and environmental factors. This information, combined with the image data, provides more comprehensive background knowledge for the network, helping the model better understand anomalies in the fruit’s growth process while identifying its phenotype features. To further refine multimodal interactions, a cross-attention mechanism is incorporated, where text features serve as queries and image features serve as keys and values, thereby guiding the visual representation learning process based on domain knowledge. Ultimately, the edge transformer segmentation network, through the joint processing of image and text data, is capable of providing high-precision and robust segmentation results for fruit phenotype analysis and growth anomaly detection. Compared with traditional methods, the network design presented in this study fully accounts for the fruit’s morphological features and surface conditions and the influence of external environments. This design, particularly when dealing with fruits with fuzzy edges, damage, or irregular shapes, demonstrates stronger accuracy and robustness. The multimodal data fusion approach not only enhances the accuracy of fruit quality assessment but also provides powerful technical support for intelligent decision making in agricultural production.
3.5.2. Edge Attention Mechanism
The proposed edge attention mechanism module is an extension of the traditional self-attention mechanism, with a particular focus on edge information in the image to enhance the network’s sensitivity to the boundary regions of an apple. The goal of this module is to introduce an edge-aware mechanism which increases the network’s attention to the fruit’s boundary regions, thereby optimizing segmentation results, particularly when fruit shapes are complex, the surface is damaged, or the boundaries are blurred. Compared with the traditional self-attention mechanism, the edge attention mechanism enables the network to prioritize the learning and enhancement of edge features during training, ensuring accurate segmentation in complex fruit shapes and damaged regions, as shown in
Figure 5.
In the design of the edge attention mechanism, the core idea of the network architecture is to combine the advantages of the standard self-attention mechanism with the edge-aware module. Specifically, the network first uses traditional convolutional layers to extract feature maps and then employs edge detection algorithms to generate an edge map, followed by a weighted self-attention mechanism to increase the network’s focus on the edge regions. The design of the number of layers in the edge attention mechanism includes multiple layers of self-attention mechanisms, with each layer containing a multi-head self-attention module and a feedforward neural network. The width and height of each layer remain consistent, ensuring spatial consistency of the feature maps across different layers. The specific design of the network is as follows. The input image size is
, where
H is the image height,
W is the image width, and
C is the number of input channels. In the self-attention layers, the output feature map remains
and is further enhanced in the edge-aware module, improving the edge regions of the feature map. To enhance the self-attention mechanism with edge awareness, an edge-weighting function
is introduced, dynamically adjusting the attention computation such that pixels in the edge regions receive higher weights. The edge-enhanced attention is formulated as follows:
where
E represents the edge feature map generated by the edge detection module, and its influence is adjusted through learnable weights. This formulation ensures that the network focuses more on fruit contours, even when dealing with irregular shapes or blurred edges, thereby improving segmentation precision. The parameters for each layer are as follows. The dimension of each self-attention head is
, where
N is the number of heads, typically set to eight. The output from each layer is processed through a feedforward neural network, with the width and height of the feature map maintained at
and the number of channels remaining being
C. To incorporate edge information, a weighting mechanism is designed by combining the edge feature map
E with the feature map
F to enhance the attention weights on the edge regions. The specific weighted calculation is given by
where
refers to the edge features generated by the edge detection module and
E represents the edge regions of the image. Through this mechanism, the edge attention mechanism prioritizes modeling the fruit’s edges, ensuring segmentation precision, particularly in regions where the fruit shape is complex or damaged. The edge attention mechanism enhances the network’s attention to the boundaries and surface damage of the fruit, while the edge transformer segmentation network handles extraction of the overall phenotype features from an image. These two modules work collaboratively in the segmentation process, improving segmentation precision. By enhancing the network’s sensitivity to edge regions, the edge attention mechanism provides higher segmentation accuracy in detailed image parts, particularly in edges and damage areas. Meanwhile, the edge transformer segmentation network focuses on processing global information and fruit morphological features. This design ensures that when handling complex scenarios and fruit with detailed features, the model can perform segmentation with higher precision, especially when fruit surfaces exhibit cracks, spots, or rot, while still maintaining high recognition accuracy. Therefore, the edge attention mechanism not only improves the model’s performance in segmentation tasks, but also, by combining edge-aware mechanisms with global context modeling, greatly enhances the model’s adaptability and robustness in fruit phenotype analysis.
3.5.3. Edge Loss Function
The proposed edge loss function in this study is a novel loss function designed to optimize edge accuracy in image segmentation tasks, especially for fruit phenotype analysis. Traditional loss functions, such as cross-entropy loss and Dice loss, are effective in optimizing models in most cases. However, they typically overlook the details of the edge regions, especially when dealing with complex shapes, blurred boundaries, or damaged areas, leading to unclear segmentation results at the boundaries. To address this issue, the edge loss function introduces an edge-aware mechanism which assigns more weight to the edge regions of an image, enabling finer segmentation. The core idea of the edge loss function is to incorporate specific optimization for the edge regions on top of traditional loss functions. Traditional loss functions are usually optimized at the pixel level across the entire image, neglecting the importance of edge regions in image segmentation. In fruit phenotype analysis, accurate segmentation of the fruit’s contours, surface damage, and deformed areas is crucial for the quality of the results. The edge loss function, by weighting the loss of the edge regions, ensures that the model focuses more on the boundary areas during training, thereby improving segmentation precision. The mathematical formula for edge loss function can be expressed as follows:
where
is the predicted value,
is the ground truth value,
N is the total number of pixels, and
is the indicator function. When pixel
i belongs to the edge region,
; otherwise, it is zero. This weighted loss allows the network to place more learning emphasis on the fruit’s edge regions while reducing overemphasis on non-edge areas, resulting in more precise boundary segmentation. In traditional loss functions, such as cross-entropy loss, the formula is
These loss functions typically optimize all pixels with equal weight, ignoring the uniqueness of edge regions in segmentation. The edge loss function, on the other hand, addresses this by adding specialized weighting for the edge areas, allowing the network to focus more on precise segmentation of the edges, especially when handling complex fruit shapes and surface damage. This design significantly improves the recognition accuracy of edge parts, particularly when fruit surfaces exhibit cracks, spots, or rot, preventing mis-segmentation due to blurred edges. The integration of the edge attention mechanism and edge loss function within the edge transformer segmentation network plays a critical role in fruit phenotype analysis and anomaly detection. The Transformer-based segmentation model captures both global context and fine-grained local details, while the edge-specific enhancements ensure that boundary information is accurately preserved and learned. Compared with traditional segmentation methods which rely solely on pixel-wise classification, this approach effectively models complex shape variations and enables precise contour extraction, making it particularly useful in agricultural applications where fruit shape and surface characteristics are crucial quality indicators. By incorporating both attention-based edge enhancement and an edge-sensitive loss function, the proposed approach not only improves the segmentation accuracy but also enhances robustness in real-world agricultural scenarios. The ability to distinguish subtle fruit defects and deformations makes this method valuable for automated quality assessment, early disease detection, and optimized resource management in smart agriculture.
3.6. Experimental Design
3.6.1. Hardware and Software Platforms
In this study, the choice and configuration of the hardware and software platform played a critical role in ensuring efficient execution of the experiments and the reliability of the results. On the hardware side, an NVIDIA A100 GPU was used, which is designed specifically for artificial intelligence and high-performance computing. Based on the Ampere architecture, the A100 supports multi-precision computations (including FP64, FP32, TF32, and FP16), with up to 6912 cores and 40 GB or 80 GB of memory, providing exceptional data processing capabilities. During large-scale data training and deep learning model execution, the A100 GPU significantly accelerates computation and supports parallel processing of multiple tasks. The experimental platform also includes a high-performance CPU, more than 256 GB of memory, and high-speed NVMe solid-state drives, ensuring efficient data loading, processing, and storage. For the software platform, the experiment was run on a Linux operating system, specifically Ubuntu 20.04 LTS, which is widely used for its stability and good support for deep learning frameworks. The deep learning models were developed and trained using the PyTorch framework version 1.12.0, with CUDA 11.6 and cuDNN 8.3 installed to fully leverage GPU acceleration. The Adam optimizer was chosen for its fast convergence and adaptability, making it one of the mainstream optimization algorithms in deep learning. The learning rate was set to 0.001, which was determined through several experimental adjustments to be the optimal value, ensuring quick convergence without oscillation. Additionally, the OpenCV and Albumentations libraries were used for efficient data preprocessing and augmentation, while model performance evaluation and visualization were performed using tools such as Matplotlib and Seaborn. The entire experimental environment was deployed through Docker containerization, which not only improved the reproducibility of the experiments but also facilitated cross-platform migration.
3.6.2. Baseline Models
To comprehensively assess the performance of the proposed method, several classic deep learning models were chosen as baseline models, including Tiny-Segformer [
23], Mask R-CNN [
33], UNet [
30], UNet++ [
46], and DeepLabV3+ [
31]. These models represent different technological directions and architectural characteristics in the field of image segmentation. Tiny-Segformer is a lightweight Transformer architecture which combines an efficient self-attention mechanism with convolution operations, maintaining computational efficiency while offering strong feature extraction capabilities, making it particularly suitable for resource-constrained scenarios. Mask R-CNN is a dual-task model based on object detection and segmentation capable of generating pixel-level segmentation masks in addition to bounding box detection. Its loss function includes the classification loss
, bounding box regression loss
, and segmentation loss
. Both UNet and its improved version, UNet++, use an encoder-decoder structure at their core, integrating multi-scale features through skip connections. These architectures are particularly suitable for fine-grained segmentation tasks in medical and agricultural imaging, with UNet++ further enhancing the network’s expressive power by redesigning the skip connection modules. DeepLabV3+ uses dilated convolution (atrous convolution) and the Atrous Spatial Pyramid Pooling (ASPP) module to effectively capture multi-scale contextual information, with the loss function typically based on pixel-level cross-entropy.
To ensure a fair comparison, all baseline models were trained and tested on the same dataset, using the same image data for phenotype feature extraction and growth anomaly detection. However, since most of these models (such as UNet, UNet++, Mask R-CNN, and DeepLabV3+) are designed primarily for image-based segmentation tasks, they were not originally built to process multimodal information. Therefore, for these models, only the image input was utilized, without directly integrating meteorological data or other agricultural textual information. In contrast, the proposed method incorporates both image and textual data, leveraging a dedicated NLP module to process agricultural knowledge (such as planting records and meteorological data) and fusing it with visual features through an attention-based multimodal learning approach. To maintain a fair experimental set-up, Tiny-Segformer, which is a Transformer-based segmentation model, was extended with an NLP component similar to the one in the proposed system. However, due to its original lightweight design, its capacity for processing and integrating textual information remains more limited than the proposed method. These baseline models provide a reference standard for performance comparison in this study and help thoroughly verify the effectiveness of the proposed method.
3.6.3. Evaluation Metrics
In this study, several evaluation metrics were used to comprehensively assess the model’s performance, including the precision, recall, accuracy, and mean intersection over union (mIoU). These metrics measure the model’s performance in fruit phenotype analysis and anomaly detection from different dimensions. Precision measures the proportion of true positive samples among all samples predicted to be positive, focusing on the correctness of the predictions, which is particularly important in high-precision scenarios. Recall measures the proportion of true positive samples which were correctly predicted to be positive, reflecting the model’s ability to capture positive samples and serving as an important metric for evaluating false negatives. Accuracy represents the proportion of correctly classified samples among all predictions, suitable for evaluating the overall performance of the model. The mIoU, commonly used in semantic segmentation tasks, calculates the ratio of the intersection to the union between the predicted and ground truth regions, averaging this ratio across all categories to measure the global consistency of the model’s segmentation results. The mathematical definitions of these evaluation metrics are as follows:
where TP represents true positives, FP represents false positives, FN represents false negatives, TN represents true negatives, ∩ denotes the intersection, ∪ denotes the union,
C is the total number of classes, and
is the intersection over union for class
c.
4. Results and Discussion
4.1. Experimental Results of Phenotype Feature Extraction Models
The experimental design presented in this study aims to evaluate the performance of different deep learning models in the task of extracting fruit phenotype features. By comparing the precision, recall, accuracy, and mIoU metrics of various models, the study analyzed the advantages and shortcomings of each model in fine segmentation tasks, providing a theoretical foundation for subsequent model optimization. The models used in the experiments included UNet, Mask R-CNN, DeeplabV3+, UNet++, Tiny-Segformer, and the proposed method. Through these comparative experiments, the influence of the model architecture, loss functions, and optimization strategies on the accuracy of phenotype feature extraction could be deeply understood, offering theoretical support for intelligent fruit analysis in agricultural production.
As shown in
Table 2, the experimental results demonstrate different levels of performance across all models in terms of precision, recall, accuracy, and the mIoU. The UNet model achieved a precision of 0.84, recall of 0.82, accuracy of 0.83, and mIoU of 0.80, indicating good performance in basic segmentation tasks but with room for improvement in handling finer details. Compared with UNet, Mask R-CNN showed improvements in its precision and recall, achieving scored of 0.86 and 0.83, respectively, with an accuracy and mIoU of 0.85 and 0.83, respectively, indicating better handling of object boundaries and details in the instance segmentation task. DeeplabV3+ introduced atrous convolution and spatial pyramid pooling modules, which enhanced the model’s ability to capture multi-scale features, with precision, recall, and mIoU scores of 0.89, 0.86, and 0.86, respectively, demonstrating an advantage in processing multi-scale contextual information. UNet++ further improved the precision to 0.90, the recall to 0.88, and the mIoU to 0.87, showing that its enhanced skip connections improved detail recovery. Tiny-Segformer, a lightweight Transformer architecture, demonstrated strong feature extraction capabilities with efficient self-attention and convolution operations, achieving precision and recall scores of 0.92 and 0.89, respectively, and an mIoU of 0.89, indicating that the model could provide strong feature extraction while maintaining computational efficiency. The proposed method outperformed all other models, with a precision of 0.95, recall of 0.91, accuracy of 0.93, and mIoU of 0.92, demonstrating that the model, through combining edge perception mechanisms and global contextual modeling, provided higher precision and robustness in the complex task of fruit phenotype feature extraction. From a theoretical perspective, the architectures of the models significantly influenced the experimental results. Both UNet and UNet++ employ encoder-decoder structures and fuse multi-scale features through skip connections, but UNet++ further enhances network expressiveness through improved skip connections, leading to better performance in detail recovery. Mask R-CNN, a dual-task model for object detection and segmentation, not only provides bounding box detection but also generates pixel-level segmentation masks, which contribute to better segmentation results when handling object boundaries and complex structures. DeeplabV3+ effectively captures multi-scale contextual information through atrous convolution and spatial pyramid pooling modules, which is particularly advantageous when dealing with complex backgrounds and large objects. The lightweight Transformer architecture of Tiny-Segformer, which combines efficient self-attention mechanisms with convolution operations, enables the model to extract strong global features while maintaining computational efficiency, which is why it achieves high precision. The proposed method combines the advantages of these models by incorporating edge perception mechanisms and global contextual information modeling, enabling more precise capture of fruit contours, surface damage, and other detailed features, which is why it outperformed the other models across all evaluation metrics. These results highlight the significant impact of the model structure, loss functions, and optimization strategies on segmentation performance, especially in complex scenarios. The models which incorporated edge information and a global context had greater robustness and accuracy.
4.2. Experimental Results of Growth Anomaly Recognition Models
The design of this experiment aims to evaluate the performance of various deep learning models in the task of growth anomaly recognition, particularly in identifying abnormal conditions during fruit growth, such as pest damage and cracks. As shown in
Table 3, the experiment compared the performance of different models based on various metrics to analyze their advantages and limitations in handling growth anomalies and to provide a theoretical basis for practical applications in fruit quality monitoring and pest warning systems.
From the experimental results, it is evident that all models exhibited varying degrees of performance in the growth anomaly recognition task. The UNet model achieved a precision of 0.82, recall of 0.80, accuracy of 0.81, and mIoU of 0.79, demonstrating basic performance. Compared with UNet, Mask R-CNN showed improvements in precision, recall, and mIoU, with values of 0.84, 0.82, and 0.81, respectively. This indicates that its dual-task structure (object detection and instance segmentation) played an active role in recognizing fruit growth anomalies. DeeplabV3+, with the incorporation of atrous convolution and spatial pyramid pooling modules, showed an advantage in handling multi-scale contextual information, achieving a precision of 0.87, recall of 0.84, and mIoU of 0.82. This suggests that DeeplabV3+ performs better than the previous models in complex scenarios. UNet++, with its improved skip connection module, demonstrated excellent performance in terms of both precision (0.89) and recall (0.87), with an mIoU of 0.84, confirming its advantage in detail recovery and multi-scale information fusion. Tiny-Segformer, a lightweight Transformer-based architecture, combined self-attention mechanisms and convolution operations, achieving a precision of 0.91, recall of 0.88, and mIoU of 0.86, indicating its powerful feature extraction ability and global information modeling capabilities. The proposed method outperformed all other models in all metrics, with a precision of 0.93, recall of 0.90, accuracy of 0.91, and mIoU of 0.89, demonstrating that combining edge perception mechanisms and global information modeling significantly improves recognition accuracy and robustness in handling complex growth anomaly scenarios. From a theoretical analysis perspective, the architectural characteristics of the models directly influenced the experimental results. Both UNet and UNet++ use an encoder-decoder structure and fuse multi-scale features through skip connections. However, UNet++ further optimizes skip connections, improving its ability to recover details, which is why it performed better in growth anomaly recognition. Mask R-CNN, with its dual-task structure for object detection and instance segmentation, can simultaneously precisely segment a fruit’s location and area, making it superior to UNet in handling anomalies with clear boundaries. DeeplabV3+ benefits from atrous convolution and spatial pyramid pooling modules, enabling it to capture richer contextual information in multi-scale contexts and thereby improving its ability to recognize complex backgrounds and irregular anomalies. Tiny-Segformer, through the combination of efficient self-attention mechanisms and convolution operations, excels in feature extraction and global information modeling, allowing it to better capture long-range dependencies, which contributed to its high precision and recall scores. The proposed method introduces an edge perception mechanism, enabling the network to focus more on fruit edges and anomaly regions. By combining this mechanism with global information modeling, the model effectively improves performance in complex growth anomaly scenarios. The edge perception mechanism strengthens the precise recognition of anomaly regions, while the Transformer architecture enhances the understanding of the global context, allowing the model to maintain high precision when dealing with complex fruit shapes and surface damage. These experimental results suggest that optimizing the model architecture, enhancing feature extraction capabilities, and introducing edge perception mechanisms are crucial for improving the accuracy and robustness of growth anomaly recognition.
4.3. Accuracy Results of Different Models for Various Phenotype Features
As shown in
Table 4, the design of this experiment aimed to evaluate the performance of various deep learning models in extracting and recognizing fruit phenotype features such as the fruit shape index, fruit size, color, and surface state. By comparing the accuracy of different models on these phenotype features, the goal was to analyze the performance differences among models when handling various fruit features and to investigate the impact of different model architectures on feature extraction and recognition accuracy. The analysis of the experimental results provides theoretical support for selecting the optimal model in fruit phenotype analysis tasks and reveals the role of model architecture and optimization strategies in enhancing fruit feature extraction accuracy.
The experimental results show significant differences in the performance of all models for different phenotype features. The UNet model exhibited relatively lower accuracy scores across all phenotype features, with values of 0.81, 0.82, 0.83, and 0.84, indicating its basic ability in fruit feature extraction but struggles in capturing complex feature relationships due to its relatively simple architecture. Mask R-CNN showed improvement over UNet, especially in recognizing fruit surface states, with an accuracy of 0.87 compared with 0.84 for UNet. This suggests that the model’s dual-task structure (object detection and instance segmentation) enhances boundary detail extraction in the instance segmentation task. DeeplabV3+ incorporates atrous convolution and spatial pyramid pooling modules, which better handle multi-scale features, leading to improved recognition accuracy for the fruit size and surface state, with values of 0.87 and 0.89, respectively. UNet++ improves upon the skip connection in the encoder-decoder structure, further enhancing the accuracy of all phenotype features, especially the fruit size and surface state, with accuracy values of 0.89 and 0.91, respectively, indicating its advantage in handling complex structural features. Tiny-Segformer, a lightweight Transformer-based model, significantly improved the accuracy across all features, especially fruit color and surface state, reaching 0.92 for both, demonstrating the advantage of the self-attention mechanism in global feature modeling. The proposed method outperformed all other models in feature extraction, with accuracy values of 0.92, 0.93, 0.94, and 0.95 for the various phenotype features, especially in surface state recognition. By combining edge perception mechanisms with global contextual information modeling, the proposed model significantly enhanced its detail capture, allowing it to more precisely identify the surface features of the fruit. From a theoretical analysis perspective, the differences in the experimental results were directly influenced by the model architectures. Both UNet and UNet++ adopt an encoder-decoder structure and fuse multi-scale features through skip connections, but UNet++ further optimizes skip connections, enhancing its ability to integrate multi-scale features, which led to better performance in fruit phenotype feature extraction. Mask R-CNN, as a multi-task learning model, leverages object detection mechanisms to effectively extract object boundaries and generate accurate segmentation masks. This ability enabled the model to achieve better results when handling surface state and complex fruit shape features. DeeplabV3+ uses atrous convolution and spatial pyramid pooling modules, which expand the receptive field and capture multi-scale contextual information, giving it an advantage in handling large objects and complex backgrounds. Tiny-Segformer combines self-attention mechanisms with convolution operations, allowing it to better capture long-range dependencies and efficiently utilize global information during feature extraction, resulting in improved performance in terms of fruit color and shape recognition. The proposed method, with the introduction of an edge perception mechanism, not only improved the segmentation accuracy of the surface state but also effectively captured subtle changes in fruit shape, demonstrating strong robustness and high precision when identifying complex fruit shapes and surface damage. These results indicate that model architecture innovation, enhanced feature extraction capabilities, and effective utilization of edge information are crucial factors in improving the accuracy of phenotype feature extraction.
4.4. Ablation Experiment with Different Attention for Phenotype Features
As shown in
Table 5, the design of this experiment aimed to evaluate the impact of different attention mechanisms on model performance, particularly in the precise extraction of features and recognition of anomalies in fruit phenotype analysis tasks. Specifically, the experiment compared the performance of the standard self-attention mechanism, the channel and spatial attention mechanism (CBAM), and the proposed improved attention mechanism across various metrics. The objective of these comparisons was to clarify the effectiveness of different attention mechanisms in capturing image features, thereby verifying whether the proposed method could effectively enhance model performance, particularly in the tasks of fruit phenotype feature extraction and growth anomaly recognition.
From the experimental results, it can be observed that the standard self-attention mechanism exhibited relatively worse performance, with a precision of 0.76, recall of 0.72, accuracy of 0.74, and mIoU of 0.71. This suggests that while the standard self-attention mechanism can capture global information, its ability to focus on local features and details is limited. In comparison, the CBAM showed significant improvements across all metrics, with a precision of 0.85, recall of 0.81, accuracy of 0.83, and mIoU of 0.80. This indicates that the combination of channel and spatial attention mechanisms effectively enhanced the model’s focus on different regions and features of an image, especially improving sensitivity to fruit features and anomaly areas. The proposed method, which combines edge perception mechanisms and global contextual information modeling, achieved the best results in terms of precision (0.95), recall (0.91), accuracy (0.93), and mIoU (0.92), demonstrating that the method effectively enhances feature capture in edge and complex regions, improving the precision of fruit phenotype feature extraction and growth anomaly recognition. From a theoretical perspective, the standard self-attention mechanism models global information by calculating the similarity between image features, but it lacks specialized attention to key local regions. As a result, it may overlook edge information and details when processing complex image features. The CBAM improves upon the standard self-attention mechanism by introducing attention mechanisms in both the channel and spatial dimensions, enhancing the model’s attention to different channels and spatial regions. This improvement effectively boosts the model’s ability to extract features, particularly in handling the details of fruit shapes and surface states. The proposed method further refines this approach by introducing an edge perception mechanism, which enables the network to focus on the edge regions of a fruit during training. This is especially crucial when dealing with surface damage or deformities, as it allows for precise identification of these complex regions. Mathematically, the standard self-attention mechanism typically models global information by calculating the relationships between each pixel in the input feature map, but its expression of local details, particularly the surface details of a fruit, is weak. The CBAM, by introducing attention mechanisms in both the channel and spatial dimensions, not only allows the model to focus on global features but also enables it to weight the important regions of an image, enhancing its ability to recognize details. The proposed method improves this further by incorporating an edge perception mechanism, which focuses more attention on edge regions, maintaining high segmentation accuracy even in cases where a fruit shape is complex, the surface is damaged, or boundaries are unclear. This design illustrates the advantage of the self-attention mechanism in combining both local and global features, significantly enhancing the model’s robustness and precision, especially when dealing with complex image tasks.
4.5. Application in Agricultural Economics
The model proposed in this study, by integrating agricultural knowledge with deep learning techniques, has made significant contributions to the agricultural economy. Through the precise extraction of fruit phenotypic features and the effective identification of growth anomalies, the model not only enhanced the ability to monitor fruit quality but also provided data support for quantifiable economic assessments in agricultural production. In the context of agricultural economics, the quality of fruit directly influences the market value and production efficiency. Therefore, accurate fruit quality assessment is crucial for improving agricultural productivity and economic benefits. By comprehensively analyzing the shape, surface condition, size, and color of a fruit, the model provides agricultural producers with detailed fruit quality information, enabling farmers to monitor crop growth in real time and take timely measures to optimize production processes and reduce losses. This model further contributes to economic optimization by reducing post-harvest losses through precise defect detection and classification, allowing for better sorting and market positioning of agricultural products.
Additionally, the incorporation of agricultural knowledge for anomaly detection allows the model to address the impact of environmental factors such as climate change and soil conditions on fruit growth, offering targeted warnings and recommendations. This not only helps improve the sustainability of agricultural production but also reduces the occurrence of pests and diseases, thus minimizing pesticide usage and promoting the development of green agriculture. By leveraging real-time phenotypic analysis, producers can adopt data-driven strategies to adjust cultivation practices dynamically, aligning resource investment with the predicted yield and market demand. Practical application of the model in agricultural economics has facilitated the optimization of resource allocation. By accurately identifying anomalies in fruit growth and pest occurrences, producers can precisely deploy fertilization, irrigation, and pest control measures, avoiding excessive resource use and further reducing production costs. The model’s ability to integrate phenotypic analysis with environmental and economic parameters provides a foundation for cost-benefit assessments, allowing for more informed decision making regarding investment in precision farming technologies. In large-scale agricultural production, this intelligent fruit monitoring technology significantly increases labor productivity, reduces labor costs, and enhances the digitalization and automation levels of the agricultural supply chain, promoting the modernization of agriculture.
4.6. Future Work in Smart Agriculture
In future work, the focus will be placed on the practical application of the proposed model in real-world agricultural production. To achieve this, efforts will be directed toward optimizing the model for deployment on edge computing devices such as the Jetson Nano, enabling real-time processing with reduced computational complexity, as shown in
Table 6.
By designing a lightweight version of the proposed edge transformer segmentation network, the feasibility of implementing the model in resource-constrained environments, such as automated agricultural machinery and intelligent monitoring systems, will be explored. This will allow for real-time fruit phenotype analysis and anomaly detection in the field, providing immediate feedback for decision making in agricultural management. Additionally, the integration of the model into unmanned aerial vehicles (UAVs) and robotic harvesting systems will be investigated to enhance precision agriculture practices. These developments will further bridge the gap between theoretical advancements and practical applications, ensuring that the proposed methodology contributes to improving agricultural efficiency, reducing resource waste, and supporting intelligent agricultural decision making in real-world scenarios.
5. Conclusions
With the rapid development of intelligent agriculture, efficiently and accurately assessing fruit growth status and quality in apple production has become a critical factor in enhancing agricultural productivity and economic benefits. This study aimed to propose a deep learning-based approach for apple phenotype analysis and growth anomaly recognition by integrating instance segmentation, NLP, and innovative attention mechanisms. The method addresses the limitations of traditional fruit quality detection and anomaly recognition techniques, providing robust technical support for intelligent agriculture and agricultural economic analysis.
The proposed approach introduces several innovations. First, a comprehensive method combining instance segmentation and NLP is presented. By integrating image analysis and text parsing, the method accurately extracts fruit phenotype features, such as the fruit shape, color, and surface condition, while simultaneously utilizing agricultural textual data, including meteorological information and cultivation records, to identify growth anomalies. This multimodal data fusion overcomes the limitations of traditional image-based methods, enabling a holistic improvement in the accuracy of fruit quality detection and anomaly prediction from multiple perspectives. Moreover, this study introduced innovative edge attention modules and edge loss mechanisms, which enhance the model’s focus on fruit edge regions and refine its handling of abnormal areas. These advancements significantly improve the model’s performance in scenarios involving complex fruit morphology, surface damage, and growth anomalies. Through these innovative designs, the proposed method not only enhances the precision of fruit phenotype feature extraction but also provides more accurate and reliable data support for agricultural production decision making. The experimental results demonstrate that the proposed approach achieved significant improvements in accuracy and offers new perspectives and technological support for agricultural economic analysis.