A Patch-Level Data Synthesis Pipeline Enhances Species-Level Crop and Weed Segmentation in Natural Agricultural Scenes

Li, Tang; Burridge, James; Blok, Pieter M.; Guo, Wei

doi:10.3390/agriculture15020138

Open AccessArticle

A Patch-Level Data Synthesis Pipeline Enhances Species-Level Crop and Weed Segmentation in Natural Agricultural Scenes

Laboratory of Field Phenomics, Institute for Sustainable Agro-Ecosystem Services, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Tokyo 188-0002, Japan

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(2), 138; https://doi.org/10.3390/agriculture15020138

Submission received: 11 December 2024 / Revised: 3 January 2025 / Accepted: 7 January 2025 / Published: 9 January 2025

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Species-level crop and weed semantic segmentation in agricultural field images enables plant identification and enhanced precision weed management. However, the scarcity of labeled data poses significant challenges for model development. Here, we report a patch-level synthetic data generation pipeline that improves semantic segmentation performance in natural agriculture scenes by creating realistic training samples, achieved by pasting patches of segmented plants onto soil backgrounds. This pipeline effectively preserves foreground context and ensures diverse and accurate samples, thereby enhancing model generalization. The semantic segmentation performance of the baseline model was higher when trained solely on data synthesized by our proposed method compared to training solely on real data, with an approximate increase in the mean intersection over union (mIoU) by approximately 1.1% (from 0.626 to 0.633). Building on this, we created hybrid datasets by combining synthetic and real data and investigated the impact of synthetic data volume. By increasing the number of synthetic images in these hybrid datasets from 1× to 20×, we observed a substantially performance improvement, with mIoU increasing by 15% at 15×. However, the gains diminish beyond this point, with the optimal balance between accuracy and efficiency achieved at 10×. These findings highlight synthetic data as a scalable and effective augmentation strategy for addressing the challenges of limited labeled data in agriculture.

Keywords:

data synthesis; data augmentation; generative model; semantic segmentation; precision weed management

1. Introduction

Chemical herbicides have become a dominant method of agriculture weed control, widely adopted due to their effectiveness and ease of use. These chemicals are applied both pre-emergence and post-emergence to prevent or eliminate weed competition in crops, allowing for better yield and growth management [1]. However, the over-reliance on chemical herbicides has had detrimental impacts in and beyond the field. Widespread herbicide application can lead to biodiversity loss by affecting non-target plant species and reducing habitat availability for other organisms [2]. This loss of biodiversity adversely impacts soil microbiota by disrupting microbial community structure and function, which are essential for nutrient cycling, soil fertility, and plant health [3]. Reduced microbial diversity can damage soil resilience, making ecosystems more susceptible to diseases and less capable of recovering from disturbances [4]. Increased weed diversity can counteract some of the negative effects of dominant weed species, promoting a more balanced ecosystem while maintaining crop productivity [5]. Furthermore, the emergence of herbicide-resistant weed species is increasingly problematic, prompting the need for more sustainable weed management solutions [6,7].

In response to these challenges, site-specific weed management (SSWM) offers a promising alternative. This technique focuses on selectively targeting weed species that pose the most significant threat to crops while minimizing the impact on non-target plants. SSWM reduces the overall amount of herbicide applied, thus helping to preserve biodiversity in agricultural fields [8]. By concentrating treatment on specific areas, farmers can lower their reliance on herbicides and address the growing issue of herbicide-resistant weeds, while also reducing economic costs related to chemical applications. Moreover, SSWM aligns with broader sustainability initiatives by integrating advanced precision technologies to optimize herbicide application, reduce environmental impacts, and enhance resource use efficiency. This approach not only promotes the preservation of non-target plant species but also supports ecosystem health and long-term agricultural sustainability, offering an environmentally friendly and economically viable solution for weed management [9].

The rise of neural networks, particularly deep learning, has revolutionized image processing. Specifically, deep learning has advanced segmentation in agriculture scenes. For instance, satellite imagery is increasingly used for large-scale agricultural monitoring, allowing the assessment of crop vigor, detection of disease outbreaks, and determination of irrigation needs across expansive areas [10,11,12]. Additionally, drones and aerial platforms are now capable of monitoring high-density crops, enabling large-scale and real-time analysis of crop health and weed distribution [13,14,15]. Furthermore, ground-based autonomous robots equipped with computer vision systems can navigate crop fields to perform detailed soil and plant assessments, providing high-resolution data on plant health and soil conditions [16,17].

By integrating computer vision techniques with real-time data from these diverse platforms, deep learning allows for automated weed detection, reducing the reliance on traditional manual methods. However, a significant challenge remains: deep learning-based systems require large annotated datasets to perform effectively. These datasets, which must include numerous images labeled using expert agronomic knowledge, are time-consuming and labor-intensive to create [18]. Furthermore, applying deep learning to large-scale agricultural datasets inherently increases the number of model parameters, leading to higher demands on computational resources. These increased computational resource requirements not only complicate the training process but also pose challenges for inference and edge computing applications, where limited computational resources are often available. Consequently, the necessity for substantial computational power and optimized memory management strategies creates a bottleneck for scaling deep learning applications in precision weeding.

Data synthesis offers a viable solution to reduce the burden of annotating datasets. Synthetic data generation allows for the creation of artificial datasets that simulate real-world conditions, reducing the need for manual annotation. Techniques such as generative adversarial networks (GANs) [19] and diffusion-based methods [20,21] are increasingly used to generate high-quality synthetic images, enabling deep learning models to be trained more efficiently with fewer real-world samples. This approach not only mitigates the issue of data scarcity, but also improves the model’s ability to generalize across diverse field conditions, leading to more scalable and cost-effective applications of deep learning in SSWM.

Early efforts in data augmentation relied on traditional methods, such as Skovsen et al. [22], who simulated clover-grass field scenes by overlaying segmented clovers, grasses, and weeds on soil images. Similarly, Toda et al. [23] created synthetic datasets for crop seed phenotyping by randomly arranging barley seeds on virtual canvases. These approaches allowed for more accurate estimation of botanical composition and seed segmentation, but were limited by their inability to fully replicate natural variability, such as lighting conditions, textures, and plant–environment interactions. Sapkota et al. [24] addressed data scarcity problem in training deep learning models for weed detection by generating synthetic images using plant instances clipped from UAV-borne real images, demonstrating the effectiveness of using synthetic dataset for improving segmentation performance.

Expanding beyond early applications, GANs emerged as a transformative tool to enhance the fidelity and diversity of synthetic data. Valerio Giuffrida et al. [25] and Zhu et al. [26] demonstrated the potential of conditional GANs (cGANs) by generating Arabidopsis plant images with specified leaf counts, significantly reducing counting errors in phenotyping tasks. Madsen et al. [27], Madsen et al. [28], Li et al. [29] further advanced this field by developing GAN architectures capable of producing high-fidelity images of multiple plant species, improving multiclass classification performance.

Building on these foundations, advances in generative techniques have focused on addressing more complex challenges in agricultural data synthesis. For example, Fawakherji et al. [30] proposed a cGAN framework to replace real plants with synthetic ones in agricultural scenes, reducing the reliance on manual annotations. Similarly, Fawakherji et al. [31] introduced Shape and Style GANs for multispectral crop and weed segmentation, focusing on generating synthetic images that replicate both the geometry and visual style of plants. Picon et al. [32] combined real field imagery with synthetic images to effectively distinguish multiple crop and weed species, thereby improving prediction performance. Meanwhile, Modak and Stein [33] integrated foundation models such as the Segment Anything Model [34] and Stable Diffusion [35] into a synthetic image generation pipeline, enhancing weed detection accuracy and enabling zero-shot transfer to new domains. Likewise, Chen et al. [36] leveraged both GANs and diffusion models to automatically expand the diversity of weed images, substantially improving classification and segmentation performance across various deep learning models.

Despite these strides in enhancing data fidelity and diversity, many approaches remain constrained in their ability to dynamically modify entire scenes, often improving foreground elements while keeping background and layout static. Moreover, the performance of data agumentation through synthetic data depends not only on fidelity and diversity but also on data volume. However, most studies on generative models in agriculture have primarily focused on improving the quality and diversity of synthetic data, with limited attention to scaling data volume.

Therefore, this study explores the use of synthetic datasets to improve semantic segmentation models in natural field scenes. The main contributions of this study are summarized as follows:

The proposed patch-level synthetic data generation pipeline enhances semantic segmentation performance in natural agricultural scenes. This pipeline generates realistic synthetic field scenes by pasting patches of foreground (plants) directly onto background (soil), ensuring diverse and contextually accurate training samples for improved model generalization.
A detailed investigation was conducted to quantify the impact of data augmentation by the proposed pipeline on segmentation performance. By varying the scale of synthetic data generated by the pipeline, we analyzed how increasing dataset size influences segmentation performance.

2. Materials and Methods

2.1. Original Baseline Dataset

The experiments in this study are based on the publicly available WE3DS dataset [37]. This dataset contains a total of 2568 RGB-D images captured under natural light conditions in Austria. It features seven crop species and ten weed species in their early growth stages. The images include high-resolution (1600 × 1140 pixels) ground-truth masks that segment soil, crop, and weed instances.

Figure 1 illustrates an example of the WE3DS dataset. The left subfigure shows an RGB field image captured under natural light conditions, depicting multiple plant instances. The right subfigure displays the corresponding semantic segmentation mask, where different colors represent different plant species or soil. These annotations enable precise evaluation of segmentation models by distinguishing between soil, crop, and weed instances.

Additionally, Table 1 provides detailed information on the crop and weed species included in the dataset, along with their EPPO codes and class labels (crop or weed). This dataset serves as the foundation for both baseline comparisons and the evaluation of synthetic dataset generation methods, ensuring consistency and robustness in the experimental design.

Initially, we adopted the same dataset partitioning strategy as outlined in the original WE3DS study, dividing the 2568 images into 1540 training images and 1028 testing images using the provided train and test txt files. All training and testing were performed using only the RGB channels of the images, while depth information was excluded to establish a fair comparison.

2.2. Traditional Augmentation Dataset

To provide a comprehensive comparison with the following synthetic dataset approach, we also explored the use of traditional data augmentation methods. Specifically, we selected three representative augmentation strategies: Random Horizontal Flip, Random Resized Crop, and Random Brightness Contrast. These strategies were chosen for their widespread use and effectiveness in enhancing the diversity of training data without the need for generating entirely new synthetic images.

We implemented three distinct combinations of these augmentation strategies to evaluate their impact on model performance:

A: Utilized all three augmentation techniques—Random Horizontal Flip, Random Resized Crop, and Random Brightness Contrast—to maximize data variability.
B: Excluded color-based augmentations, employing only Random Horizontal Flip and Random Resized Crop to focus on geometric transformations.
C: Applied only the Random Horizontal Flip strategy to assess the impact of minimal augmentation.

2.3. Pixel-Level Synthetic Dataset

To generate a pixel-level synthetic dataset, we adopted a method based on the approach proposed by Dyrmann et al. [38]. In this method, individual plants are extracted at the pixel level using their precise polygon annotations, resulting in a detailed segmentation of the plant foreground. These pixel-level segmented foregrounds are then used in combination with soil backgrounds to create synthetic field scenes.

This approach relies on two primary components: the foreground pool and the background pool. The foreground pool consists of segmented plant instances, which were extracted from the baseline dataset at the pixel level using polygon annotations. These extracted objects serve as the main elements of the synthetic dataset. The background pool, on the other hand, comprises soil images cropped from the baseline dataset, which provide realistic context for the synthetic field scenes. Together, these pools form the basis for generating pixel-level synthetic images.

2.3.1. Foreground Pool Construction

To construct the foreground pool for pixel-level synthetic datasets, individual plant instances were segmented directly using the provided mask annotations from the original WE3DS baseline dataset. This process involved extracting plants with precise polygon annotations, ensuring accurate separation of the foreground from the background.

A summary of the number of instances for each species in the dataset is presented in Table 2, where, for convenience, the EPPO code for each plant species is provided as a reference. Additionally, in the crop/weed column, C denotes crop species, and W denotes weed species.

2.3.2. Background Pool Construction

To construct the background pool for pixel-level synthetic datasets, the largest square cropping method was employed. In this method, the largest possible square patches of soil were cropped from the training dataset, resulting in 1540 background images. However, the sizes of these largest square patches varied greatly, and, in some cases, the aspect ratios were highly extreme, with the length greatly exceeding the width or vice versa. This variability posed challenges when processing these patches during network training, as resizing them to a fixed resolution (e.g., 640 × 480) introduced distortions in the object proportions.

To provide a basis for comparison, we also prepared backgrounds using the composite background construction method. In this approach, smaller cropped patches were randomly selected and combined into a single composite background that matched the original resolution (1600 × 1140). This method preserved the aspect ratios of the individual patches and increased variability in the background pool by introducing new combinations of smaller patches. The two methods were compared to evaluate their effectiveness in generating synthetic datasets for improving segmentation performance.

2.3.3. Data Synthesis Strategy

The data synthesis strategy used in this study was inspired by the method proposed by Skovsen et al. [39], which involves combining foreground objects with background images to create synthetic datasets for segmentation tasks. In our approach, synthetic images were generated through the following steps:

1.: A background image was randomly selected from the background pool. This background served as the canvas for placing foreground objects.
2.: A random number of crop instances (1 to 3) and weed instances (1 to 5) were selected from the foreground pool and pasted onto the background at random coordinates. This ensured diversity in the synthetic images and reflected the natural variation observed in real-world scenes.
3.: Pixel-level annotations corresponding to the placement of each object were generated to create paired segmentation data for training.

The choice of instance numbers for crops (1 to 3) and weeds (1 to 5) was based on the distribution observed in the original WE3DS dataset, ensuring that the synthetic data mirrored the proportions found in the real-world dataset.

Five different scales—1×, 5×, 10×, 15×, and 20× the size of the real training dataset (1540 images)—were generated to provide a broad range for evaluating the impact of data volume on segmentation performance. (Here, x represents the scaling factor, with

x \times

indicating the number of synthetic images generated is x times the size of the original training dataset. For example, 1× corresponds to 1540 synthetic images, while 5× corresponds to 7700 synthetic images.) Figure 2 illustrates an example of a pixel-level synthetic dataset generated using the largest possible square patches of soil method. The left image shows the generated synthetic field scene, while the right image displays its corresponding pixel-level semantic segmentation mask. Different colors in the mask represent different plant species.

2.4. Patch-Level Synthetic Dataset

Figure 3 shows our proposed patch-level data synthesis pipeline. The process begins from the original real dataset, where foreground objects (e.g., crops and weeds) and background soil patches are extracted to form a foreground pool and background pool, respectively. These patches are then combined to generate a patch-based synthetic dataset according specific logic, which can then be merged with the original real dataset to create a hybrid dataset. Our pipeline can improve species-level segmentation of crops and weeds in agricultural scenes by enriching the original real dataset with synthetic samples.

The patch-level synthetic dataset builds upon the pixel-level approach with a key modification to the foreground extraction process. Instead of segmenting plant instances at the pixel level using polygon annotations, this method extracts rectangular patches that include both the plant and a portion of its immediate background. These patches are defined by the bounding boxes of the polygon annotations, preserving contextual information around the plants and offering a simpler yet effective alternative to precise segmentation.

The background pool construction for the patch-level dataset follows the same procedure as the pixel-level dataset. Backgrounds were prepared using the largest square cropping method and the composite background construction method to address challenges related to size and aspect ratio variability. Both methods provide a basis for creating synthetic field scenes.

The data synthesis strategy for the patch-level dataset is identical to the pixel-level approach. Backgrounds from the background pool were combined with a random number of crop instances (1 to 3) and weed instances (1 to 5) selected from the patch-level foreground pool. Placement coordinates were generated randomly, and pixel-level annotations corresponding to the placement of each object were created to ensure paired segmentation data. Datasets were generated at the same scales of

1 \times

,

5 \times

,

10 \times

,

15 \times

, and

20 \times

, where x denotes the number of synthetic images relative to the real training dataset (e.g.,

1 \times

corresponds to 1540 images, while

5 \times

corresponds to 7700 images).

Figure 4 illustrates an example of a patch-level synthetic dataset. The left subfigure (a) shows the generated synthetic field scene created using rectangular patches of plants, while the right subfigure (b) displays the corresponding pixel-level semantic segmentation mask. Different colors in the mask represent various plant species, and this example highlights the inclusion of contextual background information around the plants.

2.5. Hybrid Dataset

To determine the optimal combination of real and synthetic data for segmentation tasks, we created hybrid datasets by mixing the original baseline dataset (1540 images) with synthetic datasets.

To be detailed, hybrid datasets were constructed by combining the full real training dataset with synthetic data at different ratios from

1 \times

to

20 \times

. For example, a

1 \times

hybrid included 1540 real images and 1540 synthetic images, while a

5 \times

hybrid comprised 1540 real images and 7700 synthetic images. By systematically varying these ratios, we investigated how different hybrid configurations influence segmentation performance and computational efficiency.

2.6. Fine-Tuning on Real Data

To provide a comprehensive comparison, we also implemented a fine-tuning strategy, a commonly adopted method where models are pretrained on synthetic data and subsequently fine-tuned on real data.

Specifically, models were first trained on solely synthetic datasets with varying data volumes from

1 \times

to

20 \times

as introduced before. Following this, the models were fine-tuned on the original baseline dataset using a reduced learning rate to adapt to the real data distribution.

2.7. Evaluation

Our task aimed to compare the performance of plant species semantic segmentation models for three types of datasets: the WE3DS dataset (real data), fully synthetic datasets, and hybrid datasets (combinations of real and synthetic data).

For evaluation, we used the pre-defined test split of the WE3DS dataset, which accounts for 40% of the total dataset, comprising 1028 images. This ensures consistency across evaluations and provides a robust benchmark for assessing model generalization.

2.7.1. Data Preprocessing

To ensure a fair comparison with the original WE3DS results, we adopted a consistent preprocessing strategy as outlined in the WE3DS paper, applying it uniformly to real, synthetic, and hybrid datasets without any additional data augmentation. Specifically, all images were resized to 640 × 480 pixels following the WE3DS preprocessing guidelines. Pixel values were then normalized using ImageNet statistics (

m e a n = {0.485, 0.456, 0.406}

and

s t d = {0.229, 0.224, 0.225}

) to match the input requirements of the ResNet-50 backbone [40].

2.7.2. Metrics

In semantic segmentation, the objective is to classify each pixel into a specific class, such as soil, and in the case of WE3DS dataset, one of seven crop species or ten weed species. The output of this task is a color-coded semantic segmentation map that shows the composition and location of various species within an image.

To evaluate model performance, we use the intersection over union (IoU) metric. IoU measures the overlap between the predicted segmentation and the ground truth for each class and is defined as

{IoU}^{c} = \frac{{TP}^{c}}{{TP}^{c} + {FP}^{c} + {FN}^{c}}

(1)

where c represents a class, and

c \in {1, . . ., n,}

.

{TP}^{c}

,

{FP}^{c}

, and

{FN}^{c}

represent true positives, false positives, and false negatives for class c, respectively.

To measure the overall segmentation performance across all classes, the mean intersection over union (mIoU) is calculated as

mIoU = \frac{1}{| C |} \sum_{c \in C} {IoU}^{c}

(2)

where

C

is the set of all classes. For this study,

C

includes 1 soil class, 7 crop species, and 10 weed species.

However, since the IoU for the soil class (

{IoU}^{soil}

) is consistently high across models due to its large area and distinct features, it can disproportionately influence the mIoU calculation. To provide a more precise assessment of the models’ performance on plant species, we also calculate the mIoU excluding the soil class:

{mIoU}_{no - soil} = \frac{1}{| C^{'} |} \sum_{c \in C^{'}} {IoU}^{c}

(3)

where

C^{'}

is the set of all classes excluding the soil class, comprising 7 crop species and 10 weed species (a total of 17 classes).

The time efficiency of synthetic dataset augmentation was evaluated using the performance efficiency index (PEI). This metric combines both performance improvement and computational cost into a single evaluation framework. Inspired by the efficiency metrics used in various domains, PEI is defined as

PEI = \frac{Δ {mIoU}_{no - soil} / {mIoU}_{no - soil}^{baseline}}{Δ T / T^{baseline}}

(4)

Here,

Δ {mIoU}_{no - soil}

represents the improvement in

{mIoU}_{no - soil}

over the baseline performance, while

{Baseline mIoU}_{no - soil}

denotes the

{mIoU}_{no - soil}

achieved by the model trained on the baseline dataset. Similarly,

Δ T

indicates the difference in training time compared to baseline training time, and

Baseline Time

refers to the training time required for the baseline model.

This proportional approach evaluates how efficiently a given augmentation strategy improves segmentation performance relative to the additional computational cost.

2.7.3. Semantic Segmentation Model Selection

Our research primarily focuses on improving model performance within the same architecture rather than tweaking architecture to optimize performance metrics. We consider fine-tuning model architecture to be a downstream step that is often domain specific. However, we believe that selecting a model with stronger baseline performance better demonstrates the potential of our proposed method. Thus, we conducted comparative experiments using several neural network-based segmentation models, including UNet [41], UNet++ [42], Xception-style UNet, and DeepLab v3+ [43]. Notably, all models employed the same ResNet-50 [40] backbone pretrained on ImageNet [44] to maintain consistency across experiments. Additionally, the output dimension of the final Conv2d layer was modified to 18, corresponding to the categories in this study: 1 soil class, 7 crop species, and 10 weed species.

Table 3 shows the performance comparison on the original baseline dataset, which revealed that DeepLab v3+ achieved the highest

mIoU

and

{mIoU}_{no - soil}

, outperforming the others. As a result, DeepLab v3+ was selected as the evaluation model for all the following experiments.

We chose to focus on neural network-based models rather than Transformer-based models such as Vision Transformers [45] and Swin Transformers [46]. Transformer-based models typically have a larger number of parameters and require more extensive training datasets, leading to higher data collection, annotation and computational resource demands. These requirements make them less suitable for our current objectives, which prioritize computational efficiency with a minimum amount of labeled data. However, we believe that our methodology can be extended to enhance the performance of Transformer-based models in future research.

3. Results

All experiments were conducted using a single NVIDIA RTX 3090 GPU with 24 GB of RAM. The models were trained with a batch size of 8 for 100 epochs using the Adam optimizer, with

β_{1}

,

β_{2}

values set to 0.9 and 0.999, respectively. The initial learning rate was set to 0.001 and remained constant throughout the training process. This configuration ensured efficient training while maintaining consistency across all datasets.

For the fine-tuning experiment, models pretrained on synthetic datasets with varying data volumes were fine-tuned on the original baseline dataset using a reduced learning rate of 0.0005. The fine-tuning phase was limited to 20 epochs, while all other parameters, including batch size and optimizer settings, remained unchanged.

3.1. Baseline Model Performance on Original Baseline Dataset

To establish a baseline for comparison, we trained a DeepLab v3+ model using the unmodified real training dataset (1540 images) without any data augmentation. The model achieved an mIoU of approximately 0.62 on the real test dataset, providing a benchmark against which the synthetic datasets were evaluated.

3.2. Performance on Traditional Augmentation Dataset

To evaluate traditional data augmentation methods and compare them with the synthetic data-based methods, we implemented three different configurations to assess their impact on segmentation performance, as introduced in Section 2.2.

The results, illustrated in Table 4, indicate that the models trained without any augmentation achieved the highest mIoU scores of 0.646 and 0.626 for mIoU and

{mIoU}_{no - soil}

, respectively.

This counterintuitive result may be attributed to several factors. First, the introduced augmentations could have led to over-transformation of the data, causing the models to learn irrelevant features rather than the intrinsic characteristics of the plants. Additionally, the limited diversity introduced by the chosen augmentation strategies may not have been sufficient to generalize effectively across different real-world scenarios.

Moreover, finding the optimal combination and parameters for data augmentation can be time-consuming and computationally expensive. Improperly tuned augmentations may introduce noise or artifacts that negatively impact model learning.

Consequently, for the subsequent experiments, we adopted the version without augmentation as the baseline to ensure a fair and consistent comparison with our synthetic dataset approach.

3.3. Comparison of Foreground Construction Methods

To compare the effectiveness of pixel-level and patch-level synthetic datasets, we generated datasets with the same number of images as the original baseline dataset (1540 images). Using identical training protocols, we trained DeepLab v3+ models on

1.: Pixel-level synthetic datasets: These were generated by pasting plant instances extracted at the pixel level onto randomly selected backgrounds.
2.: Patch-level synthetic datasets: These were generated by pasting plant patches (foreground plus surrounding background) onto backgrounds.

Table 5 presents the mIoU performance of the DeepLab v3+ model trained on three different datasets: the original baseline dataset, pixel-level synthetic data, and patch-level synthetic data. The results are shown for both training and testing scenarios, with and without soil in the input data.

The baseline model trained on the original baseline dataset achieved the highest mIoU scores, with 0.79 and 0.65 for training and testing (with soil), respectively. In comparison, the model trained on pixel-level synthetic data exhibited significantly lower performance, particularly in the testing phase, with mIoU scores of 0.41 (with soil) and 0.38 (no soil). Notably, the patch-level synthetic data showed marked improvements over the pixel-level approach, achieving test mIoU scores of 0.50 (with soil) and 0.47 (no soil). This highlights the effectiveness of the patch-level generation method, which better preserves contextual information around objects, resulting in enhanced model generalization.

Overall, the results emphasize the limitations of pixel-level synthetic data and the importance of incorporating additional context through patch-level methods to improve segmentation performance. Pixel-level segmentation methods tend to introduce edge artifacts, leading to pixel errors along the boundaries of plant instances. These artifacts are closely related to the precision of the polygon annotations used during extraction. Additionally, pixel-level approaches often result in the loss of foreground shadows, which are critical for accurate feature representation. These issues can mislead the segmentation network, causing it to rely on these irrelevant or distorted features rather than the characteristics of the plants, thereby reducing model accuracy. In contrast, patch-level methods effectively preserve shadows and minimize edge artifacts by maintaining the contextual integrity of the plant instances within their surrounding environment. Although bounding box edges may introduce pixel-value discontinuities, the impact of these discontinuities is mitigated through the network’s pooling and convolution layers, which help smooth out these irregularities. Consequently, patch-level approaches enhance data diversity while maintaining high levels of realism, leading to more accurate and reliable segmentation performance.

3.4. Comparison of Background Construction Methods

To evaluate the effectiveness of different background construction methods for patch-level synthetic datasets, we compared the performance of DeepLab v3+ models trained on datasets generated using the largest square cropping method and the composite background construction method. For this comparison, synthetic datasets were generated at varying scales: 1×, 5×, 10×, 15×, and 20× the size of the real training dataset (1540 images per scale). This scaling provided a consistent basis for examining the impact of dataset construction on segmentation performance. The results, measured as mIoU on the real test dataset, are presented in Figure 5.

Figure 5 highlights the comparative performance of the two background construction methods across different dataset scales. The mIoU of the original baseline dataset (0.626) is included as a reference. The results demonstrate that the largest square cropping method consistently outperformed the composite background construction method across all dataset scales. At smaller scales (e.g.,

1 \times

and

5 \times

), the largest square cropping method achieved notably higher mIoU values, with 0.440 vs. 0.472 at the

1 \times

scale and 0.575 vs. 0.610 at the

5 \times

scale. This trend continued at larger scales, with the largest square cropping method achieving 0.594 at the 20× scale compared to 0.633 for the composite background construction method.

Both methods showed an improvement in mIoU as dataset size increased, but the largest square cropping method consistently delivered better results. This may be attributed to the introduction of varying aspect ratios in the background regions. This variability, introduced through direct resizing of images with inconsistent dimensions, effectively acts as an implicit form of data augmentation. Such aspect ratio inconsistencies can help the model generalize better by exposing it to a broader range of image proportions, thereby contributing to the observed performance gains.

3.5. Determining the Optimal Ratio for Hybrid Datasets in Semantic Segmentation

The hybrid dataset approach was evaluated to identify the optimal combination of real and synthetic data for plant species semantic segmentation. Models were trained using various hybrid ratios, ranging from 1:1 (1540 real images + 1540 synthetic images) to 1:20 (1540 real images + 30,800 synthetic images). The mIoU and mIoU excluding soil (

{mIoU}_{no - soil}

) metrics were used to assess performance, and the results are presented in Table 6.

The results in Figure 6 show that increasing the proportion of synthetic data improves segmentation performance, with the

{mIoU}_{no - soil}

peaking at 0.719 for the 1:15 hybrid ratio. However, the improvement diminishes with larger ratios such as 1:20, where the

{mIoU}_{no - soil}

decreases to 0.704.

Figure 7 provides a complementary perspective by presenting the PEI (performance efficiency index), which evaluates the balance between performance gains and the training time required. The results suggest that the 1:10 and 1:15 hybrid ratios provide an optimal trade-off, achieving high

{mIoU}_{no - soil}

with reasonable training times. Ratios larger than 1:15 show a reduced PEI, indicating efficiency loss.

Overall, the analysis highlights the importance of balancing dataset size, segmentation performance, and computational cost. For applications prioritizing efficiency, the 1:10 ratio is recommended, while the 1:15 ratio is more suitable for scenarios demanding peak segmentation performance.

3.6. Fine-Tuning Results and Comparison with Hybrid Dataset

The fine-tuning strategy is also a commonly used strategy to bridge the gap between synthetic and real-world data distributions. In this section, experiments of fine-tuning methods were also conducted to compare with the hybrid dataset. Specifically, models were pretrained on solely synthetic datasets with varying data volumes and subsequently fine-tuned on the original baseline dataset.

The performance of the fine-tuning approach was evaluated using

mIoU

and

{mIoU}_{no - soil}

metrics. The results are presented in Table 7, which also includes the results from the hybrid dataset method for comparison.

Results indicated that both the fine-tuning and hybrid dataset approaches enhance segmentation performance compared to training on the original baseline dataset alone. Notably, the fine-tuning method achieved a slightly higher mIoU of 0.738 and

{mIoU}_{no - soil}

of 0.728 at the 20× data volume, outperforming the 1:15 hybrid dataset, which achieved 0.734 and 0.719, respectively. Both methods demonstrated similar training times, indicating that incorporating synthetic data either through hybrid datasets or fine-tuning does not impose additional computational burdens beyond the initial training phase.

The hybrid dataset method offers a straightforward and efficient means of leveraging both real and synthetic data simultaneously during the training process. This approach simplifies the training workflow by eliminating the need for a separate fine-tuning phase, thereby reducing the complexity of the training pipeline. Additionally, the hybrid method effectively balances data diversity and computational efficiency, ensuring that the model benefits from the rich feature representations provided by synthetic data while maintaining robust performance on real-world data.

In contrast, the fine-tuning approach, while achieving slightly higher mIoU scores, requires an additional training phase that involves adjusting the model on real data after initial training on synthetic data. This two-step process may introduce complexities in the training workflow and requires careful management of training parameters to avoid overfitting during fine-tuning. Nevertheless, fine-tuning provides a flexible framework for further enhancing model performance, particularly when more real data becomes available.

Overall, both the hybrid dataset and fine-tuning methods demonstrate significant improvements in segmentation performance by incorporating synthetic data. The choice between these methods depends on the specific requirements and constraints of the application, such as the need for streamlined training processes versus the desire for incremental performance gains through specialized training phases.

4. Discussion

Compared to other studies in agricultural semantic segmentation, our patch-based synthetic data generation method offers distinct advantages. Compared to traditional approaches that rely on collecting extensive real-world datasets, our method leverages synthetic data to augment existing datasets, reducing the need for extensive field data collection. Compared to methods that also utilize synthetic data to enhance segmentation models, these methods often rely on additional deep learning models to generate synthetic data, while our method employs a straightforward “copy–paste” technique. This not only reduces the computational complexity and potential instability introduced by training new models but also achieves comparable or even superior improvements in model performance.

While the proposed patch-based synthetic data generation method shows promise, limitations also exist. Pasting patches onto soil backgrounds does not account for variation in soil conditions, such as wetness, dryness, or changes in color, which can differ significantly across environmental conditions or seasons. Future work can address these limitations by developing methods that dynamically adapt patches to varying soil types, textures, moisture levels, and lighting conditions. One promising strategy is the integration of GANs to simulate soil variability. GANs can be trained to generate realistic soil backgrounds with diverse characteristics, such as different moisture levels, textures, and color variations, thereby enhancing the realism and diversity of the synthetic data.

Additionally, the approach assumes that the same crop or weed species appear consistent across different growth stages, which may not always reflect reality. These factors limit the generalization of synthetic datasets created with this method, making them most effective when applied to specific sites, growth stages, or environmental conditions. This limitation can be addressed by incorporating temporal growth stage data, which captures the progression of plant development over time, including changes in size, shape, color, and density. By integrating temporal growth stage data into the pipeline, synthetic datasets can more accurately represent the dynamic nature of plant growth. Practically, this can be achieved by creating stage-specific patch libraries that reflect different phenological stages and systematically updating these patches as plants mature.

Extending the proposed patch-based synthetic data generation method to other crops or datasets is promising but requires careful consideration of crop characteristics and dataset properties. When multiple instances of the same category overlap in the mask, semantic segmentation cannot distinguish individual plants, leading to patches that contain overlapping individuals or plants from other categories. While this issue has a minor impact on low-density data due to its rarity, it poses significant challenges for real-world scenarios with high planting densities. In high-density environments, patches are more likely to include multiple overlapping individuals or other category plants, which results in a reduced number of usable patches and decreased diversity. Additionally, the presence of multiple categories within a single patch can introduce annotation noise, misleading the model during training and degrading segmentation accuracy. These factors collectively limit the generalizability of synthetic datasets created with the current method when applied to densely planted agricultural fields. To enhance generalizability, future adaptations of the patch-based method should incorporate instance-level annotations and employ advanced techniques to accurately simulate plant overlaps and maintain annotation integrity in high-density settings.

Future work can focus on addressing several key challenges and expanding the potential of synthetic data generation for semantic segmentation. Enhancing the diversity of both foreground and background elements is essential, as current methods primarily focus on layout diversity. Techniques like GAN-based texture and shape augmentation, as well as methods that jointly synthesize foreground and background components, could improve visual realism and eliminate inconsistencies at the boundaries. Additionally, tackling class imbalance, particularly for underrepresented species, remains critical; future efforts could integrate advanced sampling strategies or semi-supervised learning techniques to better handle rare classes. Improving the fidelity and diversity of GAN-generated data through advanced architectures and loss functions will also be crucial. Finally, using emerging generative models, such as diffusion models or text-to-image frameworks, may open new possibilities for generating highly diverse and realistic synthetic datasets. Ensuring the robustness and applicability of these methods in real-world scenarios, with diverse environmental and crop conditions, will be vital for advancing the use of segmentation models in agricultural applications.

5. Conclusions

In this study, we investigated the generation and integration of synthetic datasets to enhance semantic segmentation for agricultural applications. A patch-level synthetic dataset generation pipeline was proposed and outperformed the original baseline dataset in terms of segmentation accuracy. Furthermore, we explored hybrid datasets combining real and synthetic data at varying ratios and identified the optimal balance between performance and computational efficiency. To this end, we proposed the PEI, a novel efficiency metric to quantify the performance gains of adding synthetic data relative to the associated computational costs. The results highlight the potential of leveraging synthetic data, with the best-performing hybrid dataset achieving significant improvements in segmentation metrics compared to real data alone.

In conclusion, our study underscores the potential of synthetic data as a tool in agricultural research, enabling cost-effective and scalable solutions for semantic segmentation tasks. We contributed to advancing synthetic data methodologies by analyzing different foreground and background generation techniques, introducing an efficiency metric to evaluate performance relative to training cost, and investigating the limitations of GAN-based diversity augmentation. These contributions lay a foundation for further innovations in synthetic data generation in diverse environmental conditions and impact on precision weed management in real fields.

Author Contributions

Conceptualization, T.L., J.B., P.M.B. and W.G.; methodology, T.L.; software, T.L.; validation, T.L.; writing—original draft preparation, T.L.; writing—review and editing, T.L., J.B., P.M.B. and W.G.; supervision, W.G.; project administration, W.G.; funding acquisition, W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partially supported by the Kubota-UTokyoLab project and the Sarabetsu Village “Endowed Chair for Field Phenomics” project in Hokkaido, Japan.

Data Availability Statement

The Python code and the data that are necessary to reproduce the results section in this publication are openly available at https://github.com/UTokyo-FieldPhenomics-Lab/PatchSegSynth (accessed on 16 December 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

mIoU	Mean intersection over union
SSWM	Site-specific weed management
GANs	Generative adversarial networks
cGANs	Conditional generative adversarial networks
ASPP	Atrous spatial pyramid pooling
PEI	Performance efficiency index

References

Parven, A.; Meftaul, I.M.; Venkateswarlu, K.; Megharaj, M. Herbicides in modern sustainable agriculture: Environmental fate, ecological implications, and human health concerns. Int. J. Environ. Sci. Technol. 2024, 22, 1181–1202. [Google Scholar] [CrossRef]
Schütte, G.; Eckerstorfer, M.; Rastelli, V.; Reichenbecher, W.; Restrepo-Vassalli, S.; Ruohonen-Lehto, M.; Saucy, A.G.W.; Mertens, M. Herbicide resistance and biodiversity: Agronomic and environmental aspects of genetically modified herbicide-resistant plants. Environ. Sci. Eur. 2017, 29, 5. [Google Scholar] [CrossRef]
Hartmann, M.; Six, J. Soil structure and microbiome functions in agroecosystems. Nat. Rev. Earth Environ. 2023, 4, 4–18. [Google Scholar] [CrossRef]
Wagg, C.; Bender, S.F.; Widmer, F.; Van Der Heijden, M.G. Soil biodiversity and soil community composition determine ecosystem multifunctionality. Proc. Natl. Acad. Sci. USA 2014, 111, 5266–5270. [Google Scholar] [CrossRef] [PubMed]
Adeux, G.; Vieren, E.; Carlesi, S.; Bàrberi, P.; Munier-Jolain, N.; Cordeau, S. Mitigating crop yield losses through weed diversity. Nat. Sustain. 2019, 2, 1018–1026. [Google Scholar] [CrossRef]
Sharma, A.; Kumar, V.; Shahzad, B.; Tanveer, M.; Sidhu, G.P.S.; Handa, N.; Kohli, S.K.; Yadav, P.; Bali, A.S.; Parihar, R.D.; et al. Worldwide pesticide usage and its impacts on ecosystem. SN Appl. Sci. 2019, 1, 1446. [Google Scholar] [CrossRef]
Kraehmer, H.; Laber, B.; Rosinger, C.; Schulz, A. Herbicides as weed control agents: State of the art: I. Weed control research and safener technology: The path to modern agriculture. Plant Physiol. 2014, 166, 1119–1131. [Google Scholar] [CrossRef]
Wiles, L. Beyond patch spraying: Site-specific weed management with several herbicides. Precis. Agric. 2009, 10, 277–290. [Google Scholar] [CrossRef]
Lati, R.N.; Rasmussen, J.; Andujar, D.; Dorado, J.; Berge, T.W.; Wellhausen, C.; Pflanz, M.; Nordmeyer, H.; Schirrmann, M.; Eizenberg, H.; et al. Site-specific weed management—constraints and opportunities for the weed research community: Insights from a workshop. Weed Res. 2021, 61, 147–153. [Google Scholar] [CrossRef]
Nguyen, T.T.; Hoang, T.D.; Pham, M.T.; Vu, T.T.; Nguyen, T.H.; Huynh, Q.T.; Jo, J. Monitoring agriculture areas with satellite images and deep learning. Appl. Soft Comput. 2020, 95, 106565. [Google Scholar] [CrossRef]
Waldner, F.; Diakogiannis, F. Extracting field boundaries from satellite imagery with a convolutional neural network to enable smart farming at scale. In Proceedings of the EGU General Assembly Conference Abstracts, Vienna, Austria, 3–8 May 2020; p. 102. [Google Scholar]
Papadomanolaki, M.; Vakalopoulou, M.; Zagoruyko, S.; Karantzalos, K. Benchmarking deep learning frameworks for the classification of very high resolution satellite multispectral data. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 83–88. [Google Scholar] [CrossRef]
Feng, Q.; Yang, J.; Liu, Y.; Ou, C.; Zhu, D.; Niu, B.; Liu, J.; Li, B. Multi-temporal unmanned aerial vehicle remote sensing for vegetable mapping using an attention-based recurrent convolutional neural network. Remote Sens. 2020, 12, 1668. [Google Scholar] [CrossRef]
Ye, Z.; Yang, K.; Lin, Y.; Guo, S.; Sun, Y.; Chen, X.; Lai, R.; Zhang, H. A comparison between Pixel-based deep learning and Object-based image analysis (OBIA) for individual detection of cabbage plants based on UAV Visible-light images. Comput. Electron. Agric. 2023, 209, 107822. [Google Scholar] [CrossRef]
Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Transformer neural network for weed and crop classification of high resolution UAV images. Remote Sens. 2022, 14, 592. [Google Scholar] [CrossRef]
Sujatha, K.; Reddy, T.K.; Bhavani, N.; Ponmagal, R.; Srividhya, V.; Janaki, N. UGVs for Agri Spray with AI assisted Paddy Crop disease Identification. Procedia Comput. Sci. 2023, 230, 70–81. [Google Scholar] [CrossRef]
Xu, R.; Li, C. A review of high-throughput field phenotyping systems: Focusing on ground robots. Plant Phenomics 2022, 2022, 9760269. [Google Scholar] [CrossRef] [PubMed]
Silva, J.A.O.S.; de Siqueira, V.S.; Mesquita, M.; Vale, L.S.R.; da Silva, J.L.B.; da Silva, M.V.; Lemos, J.P.B.; Lacerda, L.N.; Ferrarezi, R.S.; de Oliveira, H.F.E. Artificial Intelligence Applied to Support Agronomic Decisions for the Automatic Aerial Analysis Images Captured by UAV: A Systematic Review. Agronomy 2024, 14, 2697. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning PMLR, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Skovsen, S.; Dyrmann, M.; Mortensen, A.K.; Steen, K.A.; Green, O.; Eriksen, J.; Gislum, R.; Jørgensen, R.N.; Karstoft, H. Estimation of the botanical composition of clover-grass leys from RGB images using data simulation and fully convolutional neural networks. Sensors 2017, 17, 2930. [Google Scholar] [CrossRef]
Toda, Y.; Okura, F.; Ito, J.; Okada, S.; Kinoshita, T.; Tsuji, H.; Saisho, D. Training instance segmentation neural network with synthetic datasets for crop seed phenotyping. Commun. Biol. 2020, 3, 173. [Google Scholar] [CrossRef] [PubMed]
Sapkota, B.B.; Popescu, S.; Rajan, N.; Leon, R.G.; Reberg-Horton, C.; Mirsky, S.; Bagavathiannan, M.V. Use of synthetic images for training a deep learning model for weed detection and biomass estimation in cotton. Sci. Rep. 2022, 12, 19580. [Google Scholar] [CrossRef]
Valerio Giuffrida, M.; Scharr, H.; Tsaftaris, S.A. Arigan: Synthetic arabidopsis plants using generative adversarial network. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Paris, France, 2–6 October 2017; pp. 2064–2071. [Google Scholar]
Zhu, Y.; Aoun, M.; Krijn, M.; Vanschoren, J.; Campus, H.T. Data Augmentation using Conditional Generative Adversarial Networks for Leaf Counting in Arabidopsis Plants. In Proceedings of the BMVC 2018, Newcastle, UK, 3–6 September 2018; Volume 2018, pp. 121–125. [Google Scholar]
Madsen, S.L.; Dyrmann, M.; Jørgensen, R.N.; Karstoft, H. Generating artificial images of plant seedlings using generative adversarial networks. Biosyst. Eng. 2019, 187, 147–159. [Google Scholar] [CrossRef]
Madsen, S.L.; Mortensen, A.K.; Jorgensen, R.N.; Karstoft, H. Disentangling information in artificial images of plant seedlings using semi-supervised GAN. Remote Sens. 2019, 11, 2671. [Google Scholar] [CrossRef]
Li, T.; Asai, M.; Kato, Y.; Fukano, Y.; Guo, W. Channel Attention GAN-Based Synthetic Weed Generation for Precise Weed Identification. Plant Phenomics 2024, 6, 0122. [Google Scholar] [CrossRef] [PubMed]
Fawakherji, M.; Potena, C.; Pretto, A.; Bloisi, D.D.; Nardi, D. Multi-spectral image synthesis for crop/weed segmentation in precision farming. Robot. Auton. Syst. 2021, 146, 103861. [Google Scholar] [CrossRef]
Fawakherji, M.; Suriani, V.; Nardi, D.; Bloisi, D.D. Shape and style GAN-based multispectral data augmentation for crop/weed segmentation in precision farming. Crop Prot. 2024, 184, 106848. [Google Scholar] [CrossRef]
Picon, A.; San-Emeterio, M.G.; Bereciartua-Perez, A.; Klukas, C.; Eggers, T.; Navarra-Mestre, R. Deep learning-based segmentation of multiple species of weeds and corn crop using synthetic and real image datasets. Comput. Electron. Agric. 2022, 194, 106719. [Google Scholar] [CrossRef]
Modak, S.; Stein, A. Synthesizing training data for intelligent weed control systems using generative ai. In Proceedings of the International Conference on Architecture of Computing Systems; Springer: Berlin/Heidelberg, Germany, 2024; pp. 112–126. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 10684–10695. [Google Scholar]
Chen, D.; Qi, X.; Zheng, Y.; Lu, Y.; Huang, Y.; Li, Z. Synthetic data augmentation by diffusion probabilistic models to enhance weed recognition. Comput. Electron. Agric. 2024, 216, 108517. [Google Scholar] [CrossRef]
Kitzler, F.; Barta, N.; Neugschwandtner, R.W.; Gronauer, A.; Motsch, V. WE3DS: An RGB-D image dataset for semantic segmentation in agriculture. Sensors 2023, 23, 2713. [Google Scholar] [CrossRef]
Dyrmann, M.; Mortensen, A.K.; Midtiby, H.S.; Jørgensen, R.N. Pixel-wise classification of weeds and crops in images by using a fully convolutional neural network. In Proceedings of the International Conference on Agricultural Engineering, Aarhus, Denmark, 26–29 June 2016; pp. 26–29. [Google Scholar]
Skovsen, S.; Dyrmann, M.; Mortensen, A.K.; Laursen, M.S.; Gislum, R.; Eriksen, J.; Farkhani, S.; Karstoft, H.; Jorgensen, R.N. The GrassClover image dataset for semantic and hierarchical species understanding in agriculture. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Zhou, Z.; Rahman Siddiquee, M.M.; Tajbakhsh, N.; Liang, J. Unet++: A nested u-net architecture for medical image segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, 20 September 2018; Proceedings 4. Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–11. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Dosovitskiy, A. An image is worth 16×16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]

Figure 1. An example of the WE3DS dataset. The left image (a) shows the field scene, and the right image (b) represents the corresponding semantic segmentation mask, where different colors indicate different plant species.

Figure 2. An example of a pixel-level synthetic dataset generated using the largest possible square patches of soilmethod. The left image (a) shows the generated synthetic field scene, and the right image (b) represents the corresponding semantic segmentation mask, where different colors indicate different plant species.

Figure 3. Proposed patch-level data synthesis pipeline.

Figure 4. An example of a patch-level synthetic dataset. The left image (a) shows the generated synthetic field scene created using rectangular patches of plants, and the right image (b) displays the corresponding pixel-level semantic segmentation mask, where different colors indicate different plant species.

Figure 5. Effect of background construction methods on mIoU for patch-level synthetic datasets. The x-axis represents the data volume (scaled by factors of the original dataset), and the y-axis shows the

{mIoU}_{no - soil}

on the test data excluding soil regions. The green dashed line indicates the original baseline dataset, the red line represents the patch-level (largest square cropping) method, and the blue line represents the patch-level (composite background construction) method.

Figure 5. Effect of background construction methods on mIoU for patch-level synthetic datasets. The x-axis represents the data volume (scaled by factors of the original dataset), and the y-axis shows the

{mIoU}_{no - soil}

on the test data excluding soil regions. The green dashed line indicates the original baseline dataset, the red line represents the patch-level (largest square cropping) method, and the blue line represents the patch-level (composite background construction) method.

Figure 6. Performance vs. training time for hybrid datasets. The y-axis represents

{mIoU}_{no - soil}

, and the x-axis represents the training time in hours. From left to right, the points correspond to baseline, 1:1, 1:5, 1:10, 1:15, and 1:20 ratios of real to synthetic data.

Figure 6. Performance vs. training time for hybrid datasets. The y-axis represents

{mIoU}_{no - soil}

, and the x-axis represents the training time in hours. From left to right, the points correspond to baseline, 1:1, 1:5, 1:10, 1:15, and 1:20 ratios of real to synthetic data.

Figure 7. PEI across different hybrid ratios. The y-axis represents the PEI metric, with higher values indicating better efficiency in balancing performance gain and training time. The x-axis represents different hybrid ratio of real to synthetic data, showing the proportion of real data to synthetic data in the training process.

Table 1. Crop and weed species in the WE3DS dataset.

English Name	EPPO Code	Crop/Weed
Broad bean	VICFA	C
Corn spurry	SPEAR	W
Red-root amaranth	AMARE	W
Common buckwheat	FAGSA	C
Pea	PISSA	C
Red fingergrass	DIGSA	W
Common wild oat	AVEFA	W
Cornflower	CENCY	W
Corn cockle	AGOGI	W
Corn	ZEAMX	C
Milk thistle	SILMA	W
Rye brome	BROSE	W
Soybean	GLYMA	C
Sunflower	HELAN	C
Narrow-leaved plantain	PLLAN	W
Small-flower geranium	GRAPU	W
Sugar beet	BETVV	C

Table 2. Summary of Foreground pool instances by species.

English Name	EPPO Code	Crop/Weed	Number
Broad bean	VICFA	C	189
Corn spurry	SPEAR	W	32
Red-root amaranth	AMARE	W	31
Common buckwheat	FAGSA	C	156
Pea	PISSA	C	205
Red fingergrass	DIGSA	W	26
Common wild oat	AVEFA	W	67
Cornflower	CENCY	W	866
Corn cockle	AGOGI	W	382
Corn	ZEAMX	C	242
Milk thistle	SILMA	W	242
Rye brome	BROSE	W	205
Soybean	GLYMA	C	225
Sunflower	HELAN	C	113
Narrow-leaved plantain	PLLAN	W	43
Small-flower geranium	GRAPU	W	1036
Sugar beet	BETVV	C	431

Table 3. Performance comparison of different semantic segmentation models on the original baseline dataset.

Model	$mIoU$	${mIoU}_{no - soil}$
UNet	0.567	0.542
UNet++	0.465	0.434
Xception-style UNet	0.622	0.600
DeepLab v3+	0.646	0.626

Table 4. Performance comparison of different traditional augmentation configurations on the original baseline dataset.

Model	$mIoU$	${mIoU}_{no - soil}$
No Augmentation	0.646	0.626
Flip + Resized Crop + Brightness Contrast	0.478	0.448
Flip + Resized Crop	0.506	0.477
Flip	0.632	0.610

Table 5. Comparison of mIoU for different dataset types.

Dataset	Train mIoU		Test mIoU
	$mIoU$	${mIoU}_{no - soil}$	$mIoU$	${mIoU}_{no - soil}$
Original Baseline Dataset	0.79	0.78	0.65	0.63
Pixel-level Synthetic Dataset	0.46	0.43	0.41	0.38
Patch-level Synthetic Dataset	0.56	0.54	0.50	0.47

Table 6. Performance and training time for hybrid datasets at different ratios.

Hybrid Ratio	$mIoU$	${mIoU}_{no - soil}$	Training Time (Hours)
Original Baseline Dataset	0.646	0.626	3.04
1:1 Hybrid	0.683	0.664	5.78
1:5 Hybrid	0.722	0.706	16.63
1:10 Hybrid	0.732	0.716	29.87
1:15 Hybrid	0.734	0.719	39.88
1:20 Hybrid	0.720	0.704	57.12

Table 7. Performance comparison of fine-tuning and hybrid dataset methods.

Method	$mIoU$	${mIoU}_{no - soil}$
Original Baseline Dataset	0.646	0.626
1:1 Hybrid	0.683	0.664
1:5 Hybrid	0.722	0.706
1:10 Hybrid	0.732	0.716
1:15 Hybrid	0.734	0.719
1:20 Hybrid	0.720	0.704
Fine-Tuning 1×	0.673	0.654
Fine-Tuning 5×	0.720	0.704
Fine-Tuning 10×	0.720	0.704
Fine-Tuning 15×	0.722	0.706
Fine-Tuning 20×	0.738	0.728

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, T.; Burridge, J.; Blok, P.M.; Guo, W. A Patch-Level Data Synthesis Pipeline Enhances Species-Level Crop and Weed Segmentation in Natural Agricultural Scenes. Agriculture 2025, 15, 138. https://doi.org/10.3390/agriculture15020138

AMA Style

Li T, Burridge J, Blok PM, Guo W. A Patch-Level Data Synthesis Pipeline Enhances Species-Level Crop and Weed Segmentation in Natural Agricultural Scenes. Agriculture. 2025; 15(2):138. https://doi.org/10.3390/agriculture15020138

Chicago/Turabian Style

Li, Tang, James Burridge, Pieter M. Blok, and Wei Guo. 2025. "A Patch-Level Data Synthesis Pipeline Enhances Species-Level Crop and Weed Segmentation in Natural Agricultural Scenes" Agriculture 15, no. 2: 138. https://doi.org/10.3390/agriculture15020138

APA Style

Li, T., Burridge, J., Blok, P. M., & Guo, W. (2025). A Patch-Level Data Synthesis Pipeline Enhances Species-Level Crop and Weed Segmentation in Natural Agricultural Scenes. Agriculture, 15(2), 138. https://doi.org/10.3390/agriculture15020138

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Patch-Level Data Synthesis Pipeline Enhances Species-Level Crop and Weed Segmentation in Natural Agricultural Scenes

Abstract

1. Introduction

2. Materials and Methods

2.1. Original Baseline Dataset

2.2. Traditional Augmentation Dataset

2.3. Pixel-Level Synthetic Dataset

2.3.1. Foreground Pool Construction

2.3.2. Background Pool Construction

2.3.3. Data Synthesis Strategy

2.4. Patch-Level Synthetic Dataset

2.5. Hybrid Dataset

2.6. Fine-Tuning on Real Data

2.7. Evaluation

2.7.1. Data Preprocessing

2.7.2. Metrics

2.7.3. Semantic Segmentation Model Selection

3. Results

3.1. Baseline Model Performance on Original Baseline Dataset

3.2. Performance on Traditional Augmentation Dataset

3.3. Comparison of Foreground Construction Methods

3.4. Comparison of Background Construction Methods

3.5. Determining the Optimal Ratio for Hybrid Datasets in Semantic Segmentation

3.6. Fine-Tuning Results and Comparison with Hybrid Dataset

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI