1. Introduction
Chemical herbicides have become a dominant method of agriculture weed control, widely adopted due to their effectiveness and ease of use. These chemicals are applied both pre-emergence and post-emergence to prevent or eliminate weed competition in crops, allowing for better yield and growth management [
1]. However, the over-reliance on chemical herbicides has had detrimental impacts in and beyond the field. Widespread herbicide application can lead to biodiversity loss by affecting non-target plant species and reducing habitat availability for other organisms [
2]. This loss of biodiversity adversely impacts soil microbiota by disrupting microbial community structure and function, which are essential for nutrient cycling, soil fertility, and plant health [
3]. Reduced microbial diversity can damage soil resilience, making ecosystems more susceptible to diseases and less capable of recovering from disturbances [
4]. Increased weed diversity can counteract some of the negative effects of dominant weed species, promoting a more balanced ecosystem while maintaining crop productivity [
5]. Furthermore, the emergence of herbicide-resistant weed species is increasingly problematic, prompting the need for more sustainable weed management solutions [
6,
7].
In response to these challenges, site-specific weed management (SSWM) offers a promising alternative. This technique focuses on selectively targeting weed species that pose the most significant threat to crops while minimizing the impact on non-target plants. SSWM reduces the overall amount of herbicide applied, thus helping to preserve biodiversity in agricultural fields [
8]. By concentrating treatment on specific areas, farmers can lower their reliance on herbicides and address the growing issue of herbicide-resistant weeds, while also reducing economic costs related to chemical applications. Moreover, SSWM aligns with broader sustainability initiatives by integrating advanced precision technologies to optimize herbicide application, reduce environmental impacts, and enhance resource use efficiency. This approach not only promotes the preservation of non-target plant species but also supports ecosystem health and long-term agricultural sustainability, offering an environmentally friendly and economically viable solution for weed management [
9].
The rise of neural networks, particularly deep learning, has revolutionized image processing. Specifically, deep learning has advanced segmentation in agriculture scenes. For instance, satellite imagery is increasingly used for large-scale agricultural monitoring, allowing the assessment of crop vigor, detection of disease outbreaks, and determination of irrigation needs across expansive areas [
10,
11,
12]. Additionally, drones and aerial platforms are now capable of monitoring high-density crops, enabling large-scale and real-time analysis of crop health and weed distribution [
13,
14,
15]. Furthermore, ground-based autonomous robots equipped with computer vision systems can navigate crop fields to perform detailed soil and plant assessments, providing high-resolution data on plant health and soil conditions [
16,
17].
By integrating computer vision techniques with real-time data from these diverse platforms, deep learning allows for automated weed detection, reducing the reliance on traditional manual methods. However, a significant challenge remains: deep learning-based systems require large annotated datasets to perform effectively. These datasets, which must include numerous images labeled using expert agronomic knowledge, are time-consuming and labor-intensive to create [
18]. Furthermore, applying deep learning to large-scale agricultural datasets inherently increases the number of model parameters, leading to higher demands on computational resources. These increased computational resource requirements not only complicate the training process but also pose challenges for inference and edge computing applications, where limited computational resources are often available. Consequently, the necessity for substantial computational power and optimized memory management strategies creates a bottleneck for scaling deep learning applications in precision weeding.
Data synthesis offers a viable solution to reduce the burden of annotating datasets. Synthetic data generation allows for the creation of artificial datasets that simulate real-world conditions, reducing the need for manual annotation. Techniques such as generative adversarial networks (GANs) [
19] and diffusion-based methods [
20,
21] are increasingly used to generate high-quality synthetic images, enabling deep learning models to be trained more efficiently with fewer real-world samples. This approach not only mitigates the issue of data scarcity, but also improves the model’s ability to generalize across diverse field conditions, leading to more scalable and cost-effective applications of deep learning in SSWM.
Early efforts in data augmentation relied on traditional methods, such as Skovsen et al. [
22], who simulated clover-grass field scenes by overlaying segmented clovers, grasses, and weeds on soil images. Similarly, Toda et al. [
23] created synthetic datasets for crop seed phenotyping by randomly arranging barley seeds on virtual canvases. These approaches allowed for more accurate estimation of botanical composition and seed segmentation, but were limited by their inability to fully replicate natural variability, such as lighting conditions, textures, and plant–environment interactions. Sapkota et al. [
24] addressed data scarcity problem in training deep learning models for weed detection by generating synthetic images using plant instances clipped from UAV-borne real images, demonstrating the effectiveness of using synthetic dataset for improving segmentation performance.
Expanding beyond early applications, GANs emerged as a transformative tool to enhance the fidelity and diversity of synthetic data. Valerio Giuffrida et al. [
25] and Zhu et al. [
26] demonstrated the potential of conditional GANs (cGANs) by generating Arabidopsis plant images with specified leaf counts, significantly reducing counting errors in phenotyping tasks. Madsen et al. [
27], Madsen et al. [
28], Li et al. [
29] further advanced this field by developing GAN architectures capable of producing high-fidelity images of multiple plant species, improving multiclass classification performance.
Building on these foundations, advances in generative techniques have focused on addressing more complex challenges in agricultural data synthesis. For example, Fawakherji et al. [
30] proposed a cGAN framework to replace real plants with synthetic ones in agricultural scenes, reducing the reliance on manual annotations. Similarly, Fawakherji et al. [
31] introduced Shape and Style GANs for multispectral crop and weed segmentation, focusing on generating synthetic images that replicate both the geometry and visual style of plants. Picon et al. [
32] combined real field imagery with synthetic images to effectively distinguish multiple crop and weed species, thereby improving prediction performance. Meanwhile, Modak and Stein [
33] integrated foundation models such as the Segment Anything Model [
34] and Stable Diffusion [
35] into a synthetic image generation pipeline, enhancing weed detection accuracy and enabling zero-shot transfer to new domains. Likewise, Chen et al. [
36] leveraged both GANs and diffusion models to automatically expand the diversity of weed images, substantially improving classification and segmentation performance across various deep learning models.
Despite these strides in enhancing data fidelity and diversity, many approaches remain constrained in their ability to dynamically modify entire scenes, often improving foreground elements while keeping background and layout static. Moreover, the performance of data agumentation through synthetic data depends not only on fidelity and diversity but also on data volume. However, most studies on generative models in agriculture have primarily focused on improving the quality and diversity of synthetic data, with limited attention to scaling data volume.
Therefore, this study explores the use of synthetic datasets to improve semantic segmentation models in natural field scenes. The main contributions of this study are summarized as follows:
The proposed patch-level synthetic data generation pipeline enhances semantic segmentation performance in natural agricultural scenes. This pipeline generates realistic synthetic field scenes by pasting patches of foreground (plants) directly onto background (soil), ensuring diverse and contextually accurate training samples for improved model generalization.
A detailed investigation was conducted to quantify the impact of data augmentation by the proposed pipeline on segmentation performance. By varying the scale of synthetic data generated by the pipeline, we analyzed how increasing dataset size influences segmentation performance.
2. Materials and Methods
2.1. Original Baseline Dataset
The experiments in this study are based on the publicly available WE3DS dataset [
37]. This dataset contains a total of 2568 RGB-D images captured under natural light conditions in Austria. It features seven crop species and ten weed species in their early growth stages. The images include high-resolution (1600 × 1140 pixels) ground-truth masks that segment soil, crop, and weed instances.
Figure 1 illustrates an example of the WE3DS dataset. The left subfigure shows an RGB field image captured under natural light conditions, depicting multiple plant instances. The right subfigure displays the corresponding semantic segmentation mask, where different colors represent different plant species or soil. These annotations enable precise evaluation of segmentation models by distinguishing between soil, crop, and weed instances.
Additionally,
Table 1 provides detailed information on the crop and weed species included in the dataset, along with their EPPO codes and class labels (crop or weed). This dataset serves as the foundation for both baseline comparisons and the evaluation of synthetic dataset generation methods, ensuring consistency and robustness in the experimental design.
Initially, we adopted the same dataset partitioning strategy as outlined in the original WE3DS study, dividing the 2568 images into 1540 training images and 1028 testing images using the provided train and test txt files. All training and testing were performed using only the RGB channels of the images, while depth information was excluded to establish a fair comparison.
2.2. Traditional Augmentation Dataset
To provide a comprehensive comparison with the following synthetic dataset approach, we also explored the use of traditional data augmentation methods. Specifically, we selected three representative augmentation strategies: Random Horizontal Flip, Random Resized Crop, and Random Brightness Contrast. These strategies were chosen for their widespread use and effectiveness in enhancing the diversity of training data without the need for generating entirely new synthetic images.
We implemented three distinct combinations of these augmentation strategies to evaluate their impact on model performance:
A: Utilized all three augmentation techniques—Random Horizontal Flip, Random Resized Crop, and Random Brightness Contrast—to maximize data variability.
B: Excluded color-based augmentations, employing only Random Horizontal Flip and Random Resized Crop to focus on geometric transformations.
C: Applied only the Random Horizontal Flip strategy to assess the impact of minimal augmentation.
2.3. Pixel-Level Synthetic Dataset
To generate a pixel-level synthetic dataset, we adopted a method based on the approach proposed by Dyrmann et al. [
38]. In this method, individual plants are extracted at the pixel level using their precise polygon annotations, resulting in a detailed segmentation of the plant foreground. These pixel-level segmented foregrounds are then used in combination with soil backgrounds to create synthetic field scenes.
This approach relies on two primary components: the foreground pool and the background pool. The foreground pool consists of segmented plant instances, which were extracted from the baseline dataset at the pixel level using polygon annotations. These extracted objects serve as the main elements of the synthetic dataset. The background pool, on the other hand, comprises soil images cropped from the baseline dataset, which provide realistic context for the synthetic field scenes. Together, these pools form the basis for generating pixel-level synthetic images.
2.3.1. Foreground Pool Construction
To construct the foreground pool for pixel-level synthetic datasets, individual plant instances were segmented directly using the provided mask annotations from the original WE3DS baseline dataset. This process involved extracting plants with precise polygon annotations, ensuring accurate separation of the foreground from the background.
A summary of the number of instances for each species in the dataset is presented in
Table 2, where, for convenience, the EPPO code for each plant species is provided as a reference. Additionally, in the crop/weed column, C denotes crop species, and W denotes weed species.
2.3.2. Background Pool Construction
To construct the background pool for pixel-level synthetic datasets, the largest square cropping method was employed. In this method, the largest possible square patches of soil were cropped from the training dataset, resulting in 1540 background images. However, the sizes of these largest square patches varied greatly, and, in some cases, the aspect ratios were highly extreme, with the length greatly exceeding the width or vice versa. This variability posed challenges when processing these patches during network training, as resizing them to a fixed resolution (e.g., 640 × 480) introduced distortions in the object proportions.
To provide a basis for comparison, we also prepared backgrounds using the composite background construction method. In this approach, smaller cropped patches were randomly selected and combined into a single composite background that matched the original resolution (1600 × 1140). This method preserved the aspect ratios of the individual patches and increased variability in the background pool by introducing new combinations of smaller patches. The two methods were compared to evaluate their effectiveness in generating synthetic datasets for improving segmentation performance.
2.3.3. Data Synthesis Strategy
The data synthesis strategy used in this study was inspired by the method proposed by Skovsen et al. [
39], which involves combining foreground objects with background images to create synthetic datasets for segmentation tasks. In our approach, synthetic images were generated through the following steps:
- 1.
A background image was randomly selected from the background pool. This background served as the canvas for placing foreground objects.
- 2.
A random number of crop instances (1 to 3) and weed instances (1 to 5) were selected from the foreground pool and pasted onto the background at random coordinates. This ensured diversity in the synthetic images and reflected the natural variation observed in real-world scenes.
- 3.
Pixel-level annotations corresponding to the placement of each object were generated to create paired segmentation data for training.
The choice of instance numbers for crops (1 to 3) and weeds (1 to 5) was based on the distribution observed in the original WE3DS dataset, ensuring that the synthetic data mirrored the proportions found in the real-world dataset.
Five different scales—1×, 5×, 10×, 15×, and 20× the size of the real training dataset (1540 images)—were generated to provide a broad range for evaluating the impact of data volume on segmentation performance. (Here,
x represents the scaling factor, with
indicating the number of synthetic images generated is
x times the size of the original training dataset. For example, 1× corresponds to 1540 synthetic images, while 5× corresponds to 7700 synthetic images.)
Figure 2 illustrates an example of a pixel-level synthetic dataset generated using the largest possible square patches of soil method. The left image shows the generated synthetic field scene, while the right image displays its corresponding pixel-level semantic segmentation mask. Different colors in the mask represent different plant species.
2.4. Patch-Level Synthetic Dataset
Figure 3 shows our proposed patch-level data synthesis pipeline. The process begins from the original real dataset, where foreground objects (e.g., crops and weeds) and background soil patches are extracted to form a foreground pool and background pool, respectively. These patches are then combined to generate a patch-based synthetic dataset according specific logic, which can then be merged with the original real dataset to create a hybrid dataset. Our pipeline can improve species-level segmentation of crops and weeds in agricultural scenes by enriching the original real dataset with synthetic samples.
The patch-level synthetic dataset builds upon the pixel-level approach with a key modification to the foreground extraction process. Instead of segmenting plant instances at the pixel level using polygon annotations, this method extracts rectangular patches that include both the plant and a portion of its immediate background. These patches are defined by the bounding boxes of the polygon annotations, preserving contextual information around the plants and offering a simpler yet effective alternative to precise segmentation.
The background pool construction for the patch-level dataset follows the same procedure as the pixel-level dataset. Backgrounds were prepared using the largest square cropping method and the composite background construction method to address challenges related to size and aspect ratio variability. Both methods provide a basis for creating synthetic field scenes.
The data synthesis strategy for the patch-level dataset is identical to the pixel-level approach. Backgrounds from the background pool were combined with a random number of crop instances (1 to 3) and weed instances (1 to 5) selected from the patch-level foreground pool. Placement coordinates were generated randomly, and pixel-level annotations corresponding to the placement of each object were created to ensure paired segmentation data. Datasets were generated at the same scales of , , , , and , where x denotes the number of synthetic images relative to the real training dataset (e.g., corresponds to 1540 images, while corresponds to 7700 images).
Figure 4 illustrates an example of a patch-level synthetic dataset. The left subfigure (a) shows the generated synthetic field scene created using rectangular patches of plants, while the right subfigure (b) displays the corresponding pixel-level semantic segmentation mask. Different colors in the mask represent various plant species, and this example highlights the inclusion of contextual background information around the plants.
2.5. Hybrid Dataset
To determine the optimal combination of real and synthetic data for segmentation tasks, we created hybrid datasets by mixing the original baseline dataset (1540 images) with synthetic datasets.
To be detailed, hybrid datasets were constructed by combining the full real training dataset with synthetic data at different ratios from to . For example, a hybrid included 1540 real images and 1540 synthetic images, while a hybrid comprised 1540 real images and 7700 synthetic images. By systematically varying these ratios, we investigated how different hybrid configurations influence segmentation performance and computational efficiency.
2.6. Fine-Tuning on Real Data
To provide a comprehensive comparison, we also implemented a fine-tuning strategy, a commonly adopted method where models are pretrained on synthetic data and subsequently fine-tuned on real data.
Specifically, models were first trained on solely synthetic datasets with varying data volumes from to as introduced before. Following this, the models were fine-tuned on the original baseline dataset using a reduced learning rate to adapt to the real data distribution.
2.7. Evaluation
Our task aimed to compare the performance of plant species semantic segmentation models for three types of datasets: the WE3DS dataset (real data), fully synthetic datasets, and hybrid datasets (combinations of real and synthetic data).
For evaluation, we used the pre-defined test split of the WE3DS dataset, which accounts for 40% of the total dataset, comprising 1028 images. This ensures consistency across evaluations and provides a robust benchmark for assessing model generalization.
2.7.1. Data Preprocessing
To ensure a fair comparison with the original WE3DS results, we adopted a consistent preprocessing strategy as outlined in the WE3DS paper, applying it uniformly to real, synthetic, and hybrid datasets without any additional data augmentation. Specifically, all images were resized to 640 × 480 pixels following the WE3DS preprocessing guidelines. Pixel values were then normalized using ImageNet statistics (
and
) to match the input requirements of the ResNet-50 backbone [
40].
2.7.2. Metrics
In semantic segmentation, the objective is to classify each pixel into a specific class, such as soil, and in the case of WE3DS dataset, one of seven crop species or ten weed species. The output of this task is a color-coded semantic segmentation map that shows the composition and location of various species within an image.
To evaluate model performance, we use the intersection over union (IoU) metric. IoU measures the overlap between the predicted segmentation and the ground truth for each class and is defined as
where
c represents a class, and
.
,
, and
represent true positives, false positives, and false negatives for class
c, respectively.
To measure the overall segmentation performance across all classes, the mean intersection over union (mIoU) is calculated as
where
is the set of all classes. For this study,
includes 1 soil class, 7 crop species, and 10 weed species.
However, since the IoU for the soil class (
) is consistently high across models due to its large area and distinct features, it can disproportionately influence the mIoU calculation. To provide a more precise assessment of the models’ performance on plant species, we also calculate the mIoU excluding the soil class:
where
is the set of all classes excluding the soil class, comprising 7 crop species and 10 weed species (a total of 17 classes).
The time efficiency of synthetic dataset augmentation was evaluated using the performance efficiency index (PEI). This metric combines both performance improvement and computational cost into a single evaluation framework. Inspired by the efficiency metrics used in various domains, PEI is defined as
Here, represents the improvement in over the baseline performance, while denotes the achieved by the model trained on the baseline dataset. Similarly, indicates the difference in training time compared to baseline training time, and refers to the training time required for the baseline model.
This proportional approach evaluates how efficiently a given augmentation strategy improves segmentation performance relative to the additional computational cost.
2.7.3. Semantic Segmentation Model Selection
Our research primarily focuses on improving model performance within the same architecture rather than tweaking architecture to optimize performance metrics. We consider fine-tuning model architecture to be a downstream step that is often domain specific. However, we believe that selecting a model with stronger baseline performance better demonstrates the potential of our proposed method. Thus, we conducted comparative experiments using several neural network-based segmentation models, including UNet [
41], UNet++ [
42], Xception-style UNet, and DeepLab v3+ [
43]. Notably, all models employed the same ResNet-50 [
40] backbone pretrained on ImageNet [
44] to maintain consistency across experiments. Additionally, the output dimension of the final Conv2d layer was modified to 18, corresponding to the categories in this study: 1 soil class, 7 crop species, and 10 weed species.
Table 3 shows the performance comparison on the original baseline dataset, which revealed that DeepLab v3+ achieved the highest
and
, outperforming the others. As a result, DeepLab v3+ was selected as the evaluation model for all the following experiments.
We chose to focus on neural network-based models rather than Transformer-based models such as Vision Transformers [
45] and Swin Transformers [
46]. Transformer-based models typically have a larger number of parameters and require more extensive training datasets, leading to higher data collection, annotation and computational resource demands. These requirements make them less suitable for our current objectives, which prioritize computational efficiency with a minimum amount of labeled data. However, we believe that our methodology can be extended to enhance the performance of Transformer-based models in future research.
3. Results
All experiments were conducted using a single NVIDIA RTX 3090 GPU with 24 GB of RAM. The models were trained with a batch size of 8 for 100 epochs using the Adam optimizer, with , values set to 0.9 and 0.999, respectively. The initial learning rate was set to 0.001 and remained constant throughout the training process. This configuration ensured efficient training while maintaining consistency across all datasets.
For the fine-tuning experiment, models pretrained on synthetic datasets with varying data volumes were fine-tuned on the original baseline dataset using a reduced learning rate of 0.0005. The fine-tuning phase was limited to 20 epochs, while all other parameters, including batch size and optimizer settings, remained unchanged.
3.1. Baseline Model Performance on Original Baseline Dataset
To establish a baseline for comparison, we trained a DeepLab v3+ model using the unmodified real training dataset (1540 images) without any data augmentation. The model achieved an mIoU of approximately 0.62 on the real test dataset, providing a benchmark against which the synthetic datasets were evaluated.
3.2. Performance on Traditional Augmentation Dataset
To evaluate traditional data augmentation methods and compare them with the synthetic data-based methods, we implemented three different configurations to assess their impact on segmentation performance, as introduced in
Section 2.2.
The results, illustrated in
Table 4, indicate that the models trained without any augmentation achieved the highest mIoU scores of 0.646 and 0.626 for mIoU and
, respectively.
This counterintuitive result may be attributed to several factors. First, the introduced augmentations could have led to over-transformation of the data, causing the models to learn irrelevant features rather than the intrinsic characteristics of the plants. Additionally, the limited diversity introduced by the chosen augmentation strategies may not have been sufficient to generalize effectively across different real-world scenarios.
Moreover, finding the optimal combination and parameters for data augmentation can be time-consuming and computationally expensive. Improperly tuned augmentations may introduce noise or artifacts that negatively impact model learning.
Consequently, for the subsequent experiments, we adopted the version without augmentation as the baseline to ensure a fair and consistent comparison with our synthetic dataset approach.
3.3. Comparison of Foreground Construction Methods
To compare the effectiveness of pixel-level and patch-level synthetic datasets, we generated datasets with the same number of images as the original baseline dataset (1540 images). Using identical training protocols, we trained DeepLab v3+ models on
- 1.
Pixel-level synthetic datasets: These were generated by pasting plant instances extracted at the pixel level onto randomly selected backgrounds.
- 2.
Patch-level synthetic datasets: These were generated by pasting plant patches (foreground plus surrounding background) onto backgrounds.
Table 5 presents the mIoU performance of the DeepLab v3+ model trained on three different datasets: the original baseline dataset, pixel-level synthetic data, and patch-level synthetic data. The results are shown for both training and testing scenarios, with and without soil in the input data.
The baseline model trained on the original baseline dataset achieved the highest mIoU scores, with 0.79 and 0.65 for training and testing (with soil), respectively. In comparison, the model trained on pixel-level synthetic data exhibited significantly lower performance, particularly in the testing phase, with mIoU scores of 0.41 (with soil) and 0.38 (no soil). Notably, the patch-level synthetic data showed marked improvements over the pixel-level approach, achieving test mIoU scores of 0.50 (with soil) and 0.47 (no soil). This highlights the effectiveness of the patch-level generation method, which better preserves contextual information around objects, resulting in enhanced model generalization.
Overall, the results emphasize the limitations of pixel-level synthetic data and the importance of incorporating additional context through patch-level methods to improve segmentation performance. Pixel-level segmentation methods tend to introduce edge artifacts, leading to pixel errors along the boundaries of plant instances. These artifacts are closely related to the precision of the polygon annotations used during extraction. Additionally, pixel-level approaches often result in the loss of foreground shadows, which are critical for accurate feature representation. These issues can mislead the segmentation network, causing it to rely on these irrelevant or distorted features rather than the characteristics of the plants, thereby reducing model accuracy. In contrast, patch-level methods effectively preserve shadows and minimize edge artifacts by maintaining the contextual integrity of the plant instances within their surrounding environment. Although bounding box edges may introduce pixel-value discontinuities, the impact of these discontinuities is mitigated through the network’s pooling and convolution layers, which help smooth out these irregularities. Consequently, patch-level approaches enhance data diversity while maintaining high levels of realism, leading to more accurate and reliable segmentation performance.
3.4. Comparison of Background Construction Methods
To evaluate the effectiveness of different background construction methods for patch-level synthetic datasets, we compared the performance of DeepLab v3+ models trained on datasets generated using the largest square cropping method and the composite background construction method. For this comparison, synthetic datasets were generated at varying scales: 1×, 5×, 10×, 15×, and 20× the size of the real training dataset (1540 images per scale). This scaling provided a consistent basis for examining the impact of dataset construction on segmentation performance. The results, measured as mIoU on the real test dataset, are presented in
Figure 5.
Figure 5 highlights the comparative performance of the two background construction methods across different dataset scales. The mIoU of the original baseline dataset (0.626) is included as a reference. The results demonstrate that the largest square cropping method consistently outperformed the composite background construction method across all dataset scales. At smaller scales (e.g.,
and
), the largest square cropping method achieved notably higher mIoU values, with 0.440 vs. 0.472 at the
scale and 0.575 vs. 0.610 at the
scale. This trend continued at larger scales, with the largest square cropping method achieving 0.594 at the 20× scale compared to 0.633 for the composite background construction method.
Both methods showed an improvement in mIoU as dataset size increased, but the largest square cropping method consistently delivered better results. This may be attributed to the introduction of varying aspect ratios in the background regions. This variability, introduced through direct resizing of images with inconsistent dimensions, effectively acts as an implicit form of data augmentation. Such aspect ratio inconsistencies can help the model generalize better by exposing it to a broader range of image proportions, thereby contributing to the observed performance gains.
3.5. Determining the Optimal Ratio for Hybrid Datasets in Semantic Segmentation
The hybrid dataset approach was evaluated to identify the optimal combination of real and synthetic data for plant species semantic segmentation. Models were trained using various hybrid ratios, ranging from 1:1 (1540 real images + 1540 synthetic images) to 1:20 (1540 real images + 30,800 synthetic images). The mIoU and mIoU excluding soil (
) metrics were used to assess performance, and the results are presented in
Table 6.
The results in
Figure 6 show that increasing the proportion of synthetic data improves segmentation performance, with the
peaking at 0.719 for the 1:15 hybrid ratio. However, the improvement diminishes with larger ratios such as 1:20, where the
decreases to 0.704.
Figure 7 provides a complementary perspective by presenting the PEI (performance efficiency index), which evaluates the balance between performance gains and the training time required. The results suggest that the 1:10 and 1:15 hybrid ratios provide an optimal trade-off, achieving high
with reasonable training times. Ratios larger than 1:15 show a reduced PEI, indicating efficiency loss.
Overall, the analysis highlights the importance of balancing dataset size, segmentation performance, and computational cost. For applications prioritizing efficiency, the 1:10 ratio is recommended, while the 1:15 ratio is more suitable for scenarios demanding peak segmentation performance.
3.6. Fine-Tuning Results and Comparison with Hybrid Dataset
The fine-tuning strategy is also a commonly used strategy to bridge the gap between synthetic and real-world data distributions. In this section, experiments of fine-tuning methods were also conducted to compare with the hybrid dataset. Specifically, models were pretrained on solely synthetic datasets with varying data volumes and subsequently fine-tuned on the original baseline dataset.
The performance of the fine-tuning approach was evaluated using
and
metrics. The results are presented in
Table 7, which also includes the results from the hybrid dataset method for comparison.
Results indicated that both the fine-tuning and hybrid dataset approaches enhance segmentation performance compared to training on the original baseline dataset alone. Notably, the fine-tuning method achieved a slightly higher mIoU of 0.738 and of 0.728 at the 20× data volume, outperforming the 1:15 hybrid dataset, which achieved 0.734 and 0.719, respectively. Both methods demonstrated similar training times, indicating that incorporating synthetic data either through hybrid datasets or fine-tuning does not impose additional computational burdens beyond the initial training phase.
The hybrid dataset method offers a straightforward and efficient means of leveraging both real and synthetic data simultaneously during the training process. This approach simplifies the training workflow by eliminating the need for a separate fine-tuning phase, thereby reducing the complexity of the training pipeline. Additionally, the hybrid method effectively balances data diversity and computational efficiency, ensuring that the model benefits from the rich feature representations provided by synthetic data while maintaining robust performance on real-world data.
In contrast, the fine-tuning approach, while achieving slightly higher mIoU scores, requires an additional training phase that involves adjusting the model on real data after initial training on synthetic data. This two-step process may introduce complexities in the training workflow and requires careful management of training parameters to avoid overfitting during fine-tuning. Nevertheless, fine-tuning provides a flexible framework for further enhancing model performance, particularly when more real data becomes available.
Overall, both the hybrid dataset and fine-tuning methods demonstrate significant improvements in segmentation performance by incorporating synthetic data. The choice between these methods depends on the specific requirements and constraints of the application, such as the need for streamlined training processes versus the desire for incremental performance gains through specialized training phases.
4. Discussion
Compared to other studies in agricultural semantic segmentation, our patch-based synthetic data generation method offers distinct advantages. Compared to traditional approaches that rely on collecting extensive real-world datasets, our method leverages synthetic data to augment existing datasets, reducing the need for extensive field data collection. Compared to methods that also utilize synthetic data to enhance segmentation models, these methods often rely on additional deep learning models to generate synthetic data, while our method employs a straightforward “copy–paste” technique. This not only reduces the computational complexity and potential instability introduced by training new models but also achieves comparable or even superior improvements in model performance.
While the proposed patch-based synthetic data generation method shows promise, limitations also exist. Pasting patches onto soil backgrounds does not account for variation in soil conditions, such as wetness, dryness, or changes in color, which can differ significantly across environmental conditions or seasons. Future work can address these limitations by developing methods that dynamically adapt patches to varying soil types, textures, moisture levels, and lighting conditions. One promising strategy is the integration of GANs to simulate soil variability. GANs can be trained to generate realistic soil backgrounds with diverse characteristics, such as different moisture levels, textures, and color variations, thereby enhancing the realism and diversity of the synthetic data.
Additionally, the approach assumes that the same crop or weed species appear consistent across different growth stages, which may not always reflect reality. These factors limit the generalization of synthetic datasets created with this method, making them most effective when applied to specific sites, growth stages, or environmental conditions. This limitation can be addressed by incorporating temporal growth stage data, which captures the progression of plant development over time, including changes in size, shape, color, and density. By integrating temporal growth stage data into the pipeline, synthetic datasets can more accurately represent the dynamic nature of plant growth. Practically, this can be achieved by creating stage-specific patch libraries that reflect different phenological stages and systematically updating these patches as plants mature.
Extending the proposed patch-based synthetic data generation method to other crops or datasets is promising but requires careful consideration of crop characteristics and dataset properties. When multiple instances of the same category overlap in the mask, semantic segmentation cannot distinguish individual plants, leading to patches that contain overlapping individuals or plants from other categories. While this issue has a minor impact on low-density data due to its rarity, it poses significant challenges for real-world scenarios with high planting densities. In high-density environments, patches are more likely to include multiple overlapping individuals or other category plants, which results in a reduced number of usable patches and decreased diversity. Additionally, the presence of multiple categories within a single patch can introduce annotation noise, misleading the model during training and degrading segmentation accuracy. These factors collectively limit the generalizability of synthetic datasets created with the current method when applied to densely planted agricultural fields. To enhance generalizability, future adaptations of the patch-based method should incorporate instance-level annotations and employ advanced techniques to accurately simulate plant overlaps and maintain annotation integrity in high-density settings.
Future work can focus on addressing several key challenges and expanding the potential of synthetic data generation for semantic segmentation. Enhancing the diversity of both foreground and background elements is essential, as current methods primarily focus on layout diversity. Techniques like GAN-based texture and shape augmentation, as well as methods that jointly synthesize foreground and background components, could improve visual realism and eliminate inconsistencies at the boundaries. Additionally, tackling class imbalance, particularly for underrepresented species, remains critical; future efforts could integrate advanced sampling strategies or semi-supervised learning techniques to better handle rare classes. Improving the fidelity and diversity of GAN-generated data through advanced architectures and loss functions will also be crucial. Finally, using emerging generative models, such as diffusion models or text-to-image frameworks, may open new possibilities for generating highly diverse and realistic synthetic datasets. Ensuring the robustness and applicability of these methods in real-world scenarios, with diverse environmental and crop conditions, will be vital for advancing the use of segmentation models in agricultural applications.