Data Augmentation in Earth Observation: A Diffusion Model Approach

Sousa, Tiago; Ries, Benoît; Guelfi, Nicolas

doi:10.3390/info16020081

Open AccessArticle

Data Augmentation in Earth Observation: A Diffusion Model Approach

by

Tiago Sousa

^*,†

,

Benoît Ries

^*,†

and

Nicolas Guelfi

^†

Faculty of Science, Technology and Medicine, Department of Computer Science, University of Luxembourg, L-4364 Esch-sur-Alzette, Luxembourg

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2025, 16(2), 81; https://doi.org/10.3390/info16020081

Submission received: 23 December 2024 / Revised: 16 January 2025 / Accepted: 21 January 2025 / Published: 22 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

High-quality Earth Observation (EO) imagery is essential for accurate analysis and informed decision making across sectors. However, data scarcity caused by atmospheric conditions, seasonal variations, and limited geographical coverage hinders the effective application of Artificial Intelligence (AI) in EO. Traditional data augmentation techniques, which rely on basic parameterized image transformations, often fail to introduce sufficient diversity across key semantic axes. These axes include natural changes such as snow and floods, human impacts like urbanization and roads, and disasters such as wildfires and storms, which limits the accuracy of AI models in EO applications. To address this, we propose a four-stage data augmentation approach that integrates diffusion models to enhance semantic diversity. Our method employs meta-prompts for instruction generation, vision–language models for rich captioning, EO-specific diffusion model fine-tuning, and iterative data augmentation. Extensive experiments using four augmentation techniques demonstrate that our approach consistently outperforms established methods, generating semantically diverse EO images and improving AI model performance.

Keywords:

data augmentation; dataset synthesis; diffusion model; earth observation; remote sensing; satellite imagery; deep learning

1. Introduction

Earth Observation (EO) plays an important role in environmental and geospatial sciences by leveraging remote sensing technologies to collect data on Earth’s ecosystems. This enables continuous observations across various spatial scales and timeframes [1], which is instrumental in monitoring climate change [2], tracking biodiversity [3], assessing Sustainable Development Goal (SDG) indicators [4], and evaluating ecosystem resilience [5], among other things. The integration of Artificial Intelligence (AI) within the EO domain marks a significant advancement, enhancing data processing, analysis, identification, and decision making, thereby advancing our understanding of Earth’s systems [6].

Despite these advancements, AI adoption in EO faces challenges, particularly the high cost and limited availability of diverse, high-quality satellite imagery [7]. Traditional data augmentation techniques [8] mitigate these issues by applying transformations such as rotations, flips, and scaling [9]. However, these methods fail to address meaningful semantic diversity, which is critical for capturing real-world variations in satellite imagery and ensuring model robustness.

In EO applications, it is decisive to capture diversity along key semantic axes, which are the critical dimensions of variation in satellite imagery. These axes include the following:

Natural Changes: Gradual or abrupt changes caused by natural processes, such as snow cover, droughts, or floods.
Human Impacts: Anthropogenic activities, such as urbanization, road construction, and deforestation, that alter landscapes.
Disasters: Extreme events like wildfires, floods, and storms that lead to rapid and significant environmental transformations.

Existing techniques inadequately capture these semantic axes, limiting the generalization and effectiveness of AI models. To address this, we propose a novel data augmentation approach leveraging diffusion models [10] to generate semantically diverse synthetic EO images. By explicitly targeting key semantic axes, our method produces realistic variations that align with natural and anthropogenic processes, as illustrated in Figure 1.

Our approach uniquely enhances EO datasets by addressing data scarcity through the generation of realistic synthetic imagery and capturing semantic diversity that reflects natural changes, human impacts, and disasters, enabling better model generalization. By integrating domain-specific captions and fine-tuning diffusion models with EO data, we advance the quality and diversity of training datasets. This contribution tackles the limitations of traditional augmentation methods, enhancing the performance of AI-driven EO applications in environmental monitoring and analysis.

2. Related Work

Data augmentation is an important technique in AI, particularly in the context of machine learning and deep learning models. It is used to artificially expand and diversify a dataset by applying transformations to existing data, thus generating synthetic samples. This process can be described by a function

A : X \to X^{'}

, where X is the original dataset and

X^{'}

is the augmented dataset. The augmentation function A involves applying a set of transformations

T = {t_{1}, t_{2}, \dots, t_{n}}

to each data point

x \in X

, resulting in augmented data points

x^{'} \in X^{'}

.

Data augmentation techniques have been extensively researched for their potential to enhance the robustness and performance of machine learning models [9], especially in domains where data collection is challenging [11], such as EO [12,13]. This section provides an overview of the existing literature, focusing on traditional data augmentation methods and emerging techniques involving diffusion models.

2.1. Traditional Augmentation Techniques

Traditional data augmentation methods primarily focus on applying parameterized image transformations to enrich datasets. Common techniques include geometric transformations (e.g., rotation, scaling, flipping), color space augmentations, and noise injection [9]. These methods aim to improve model generalization by exposing models to a variety of altered inputs without changing the underlying data distribution.

Hendrycks et al. [14] introduced AugMix, which improves model robustness by mixing various random image transformations in a stochastic manner, resulting in more diverse augmented data. Cubuk et al. [15] presented AutoAugment, employing a search algorithm to find the optimal augmentation policies for specific datasets, significantly boosting model accuracy without manual tuning.

In the context of EO, data augmentation plays a critical role due to the high cost and difficulty of acquiring labeled satellite imagery. Abdelhack [16] compared several image augmentation techniques for satellite image classification, finding that horizontal and vertical flipping were most effective in achieving high accuracy rates.

Illarionova et al. [17] demonstrated an advanced technique where a spectral channel from the original image is replaced with the same channel from another image of the same location but taken on a different date. This method has been shown to help models generalize better to unseen data and outperforms state-of-the-art models in forest type classification problems.

However, its application may not always be feasible in datasets with limited temporal data over the same locations.

Consequently, a significant gap persists in effectively diversifying EO datasets, as traditional data augmentation techniques often fall short in capturing and adequately diversifying the key semantic axes of EO data. These methods typically focus on pixel-level transformations, such as rotation, flipping, or scaling, which fail to address the broader contextual and semantic diversity present in EO imagery [18]. This limitation not only restricts the ability to simulate EO variability such as natural changes, human impacts, and disaster scenarios, but also affects the reliability and robustness of downstream EO applications. By focusing on enhancing the semantic richness of augmented datasets, our approach seeks to bridge this gap, enabling more effective representation of complex and dynamic EO phenomena, thereby improving the interpretability and generalization of EO-based models.

2.2. Augmentation Techniques Using Diffusion Models

In recent years, diffusion models [10] have emerged as powerful generative models capable of producing high-quality synthetic data. Unlike traditional augmentation methods, diffusion models learn the underlying data distribution and can generate entirely new samples that are not merely transformations of existing data [19]. This capability allows for the creation of synthetic datasets that capture the complex patterns and variations inherent in EO imagery.

Advancements such as the introduction of classifier-free guidance [20] have enabled text-to-image generation for data augmentation [21], expanding the range of applications where realistic and semantically rich data generation is crucial. Diffusion models have demonstrated superiority over previous generative models in terms of sample quality and diversity [22].

However, the potential for data augmentation in EO using diffusion models remains largely underexplored. Zhao et al. [23] presented a novel approach to address the challenge of obtaining high-cost, pixel-level annotations for remote sensing image semantic segmentation. The authors leveraged diffusion models to generate annotation–image pairs from scratch, achieving competitive accuracy compared to models trained on manually annotated data. This showcases the potential to significantly reduce the need for laborious annotation processes in remote sensing image analysis.

Moreover, Sebaq et al. [24] introduced a method for creating high-resolution satellite imagery from text prompts via a two-stage diffusion model. Initially, a diffusion model generates low-resolution images by mapping text to image embeddings in a shared space, ensuring the images capture the essence of the desired scenes. Subsequently, another diffusion model refines these images into high-resolution versions with improved detail and visual quality using the original text prompts. Although the authors did not position their method as a data augmentation technique, their approach demonstrated superior performance on the RSICD dataset [25], indicating the potential of diffusion models in EO data augmentation.

While prior work has begun exploring the use of diffusion models in EO [23,24], our approach uniquely combines diffusion models with EO semantic axes tailored to EO classes. By enriching data along these axes, we explicitly target semantic diversity, incorporating variations from human activities and natural events, distinguishing our method from existing work.

3. Materials and Methods

In this section, we detail the materials and methods employed in our work. We first introduce our novel data augmentation approach for Earth Observation images, which leverages diffusion models to generate semantically diverse synthetic data. We then describe the experimental setup used to evaluate the effectiveness of our approach, including the dataset, data augmentation techniques compared, model architectures, and evaluation metrics. This comprehensive methodology aims to demonstrate the impact of our proposed data augmentation method on enhancing the performance of AI models in EO applications.

3.1. Earth Observation Data Augmentation

Our approach aims to address the limitations of current data augmentation techniques in EO, particularly the insufficient diversity along key semantic axes in satellite or aerial imagery. We define key semantic axes as the meaningful interpretations attributed to the visual content of EO images. These axes typically encompass categories such as land features (e.g., forests, rivers, mountains), natural phenomena (e.g., wildfires, flooding), human-made structures (e.g., buildings, roads), and interpretations of environmental or urban landscapes, which might involve recognizing agricultural patterns or urban development. Augmenting EO image datasets with diversity along these axes is crucial for accurately capturing Earth’s varied surfaces. This enhancement is key to improving the robustness and generalizability of machine learning models used for tasks such as land cover classification or object detection, where accuracy is critical. One particularly important semantic axis is ecosystem resilience, which denotes the ability of a system or entity to adapt and recover from disturbances, often returning to or surpassing its previous state. Understanding ecosystem resilience is essential for assessing responses to challenges such as natural disasters, climate change, and human impacts. Incorporating ecosystem resilience into synthetic images enables better modeling of ecosystem impacts and recovery potentials. For example, in disaster management, accurately depicting ecosystem resilience can significantly influence predictions and management strategies for ecological shifts, determining whether an ecosystem quickly rebounds or undergoes long-term degradation.

As depicted in Figure 2, our data augmentation process consists of four stages. First, we generate class-parameterized prompts that take into account the key semantic axes of interest, derived from an EO training dataset. These prompts are then used with a vision–language model (VLM) to generate captions for the remote sensing images in the dataset. Subsequently, these captions aid in fine-tuning a text-to-image diffusion model specifically for remote sensing imagery, leveraging knowledge from the initial dataset. Once fine-tuned, this model serves to augment the dataset by producing additional EO images that are both varied and semantically aligned with the original dataset’s context, thereby improving the robustness and diversity of the data available for downstream applications. The overall process is summarized in Algorithm 1.

In the following subsections, we present each stage of our process in detail.

Algorithm 1: Proposed Four-Stage Data Augmentation Process

Require: Training dataset $T D$ , class labels C, meta-prompt $M P$ , vision–language model $V L M$ , text-to-image diffusion model $D M$ , prompt specification $P S$
Ensure: Augmented dataset $A D$

1:: Initialize temporary datasets: $N D \leftarrow \emptyset$ , $P \leftarrow \emptyset$

2:: Stage 1: Instruction Generation
3:: for each class $c \in C$ do
4:: Generate class-specific prompt $p_{c}$ by substituting <class> in $M P$ with c
5:: Add $(c, p_{c})$ to P
6:: end for

7:: Stage 2: Captioning
8:: for each $(image, class)$ in $T D$ do
9:: Find prompt $p_{class}$ corresponding to $class$ in P
10:: $caption \leftarrow V L M (image, p_{class})$
11:: Add $(image, class, caption)$ to $N D$
12:: end for

13:: Stage 3: Fine-tuning the Diffusion Model
14:: Fine-tune $D M$ using ${(image, caption) ∣ (image, class, caption) \in N D}$

15:: Stage 4: Data Augmentation
16:: Initialize augmented dataset $A D \leftarrow T D$
17:: for each class $c \in C$ do
18:: Retrieve prompt set $P S_{c}$ for class c from prompt specification $P S$
19:: for each prompt $p \in P S_{c}$ do
20:: for $i = 1$ to $N_{c, p}$ do
21:: $synthetic_image \leftarrow D M (p)$
22:: Add $(synthetic_image, c)$ to $A D$
23:: end for
24:: end for
25:: end for

3.1.1. Instruction Generation

The initial stage of our approach involves generating a prompt for each class within the available classes of the EO training dataset, establishing a one-to-one correspondence between classes and prompts. To facilitate this, we define the following meta-prompt, which is used as input artifact:

Generate a detailed and descriptive caption for the provided remote sensing image, focusing on the specified <class> class. In your description, clearly identify the key characteristics visible in the image. If the image suggests any impact of human activity, natural events, or environmental conditions, elaborate on these.

In this meta-prompt, <class> serves as a parameter to be substituted with the different classes from the training dataset. This parameterization enables the meta-prompt to be adapted to the wide range of classes encountered in EO datasets, offering customized guidance for each class. By defining distinctive prompts for each class, we aim to steer the subsequent stage of our approach to focus on the distinct features and implications associated with each <class>, intending to generate detailed, context-aware captions that align with the previously defined key semantic axes. At the end of this stage, a set of prompts is produced as an output artifact.

3.1.2. Captioning

In this second stage, we generate detailed captions that describe the key semantic axes of EO images, capturing natural changes, human impacts, and disaster-related dynamics. EO images are inherently complex, encompassing diverse landforms, natural phenomena, and anthropogenic activities. They provide critical insights into dynamic events such as deforestation, urbanization, and floods. Accurately describing these features in a structured and interpretable way is essential for tasks such as resilience analysis, disaster response, and climate change monitoring.

To achieve this, we use as input the InstructBLIP [26] vision–language model (VLM), along with the prompts artifact from the previous stage. InstructBLIP was chosen for its proven zero-shot performance on large, domain-specific datasets and its instruction-based visual feature extraction capabilities. Each EO image from the dataset is paired with a class-specific prompt derived from the instruction generation stage, guiding the VLM to focus on key features such as land cover types, spatial relationships, environmental attributes, and temporal changes caused by natural or human-driven factors. Despite the domain differences between EO images and the datasets used to train the VLM, its effectiveness in generating captions with relevant semantic information has been demonstrated [27].

To ensure high-quality captions, we configure the VLM with specific parameters. Beam search is performed with a beam width of 5 to balance computational efficiency and output quality. To avoid overly brief captions, we enforce a minimum token length of 10, while a maximum length of 256 tokens prevents excessively long or incomplete captions. A length penalty of

- 1.0

is applied to encourage concise captions that align with human annotations. These settings result in a set of contextually relevant captions for the EO dataset, highlighting key semantic axes and providing comprehensive descriptions of the imagery, which are consolidated into the output artifact of this stage.

Figure 3 illustrates an example where the class-specific prompt and configuration parameters guide the VLM to generate a detailed, context-aware caption for a “river” image. The generated caption captures essential visual aspects such as land features, natural phenomena, and the environmental landscape.

3.1.3. Model Fine-Tuning

In the third stage of our approach, we focus on fine-tuning a pre-trained Stable Diffusion V1.5 model [10], a latent text-to-image diffusion model, to adapt it to our domain of interest, namely EO. Diffusion models are a powerful class of generative AI models characterized by a forward diffusion process that incrementally corrupts data by adding Gaussian noise, and a reverse process that aims to denoise the data to recover the original or generate new samples.

The forward process is formulated as a Markov chain where, at each timestep t, the data

x_{t}

is obtained by adding Gaussian noise to

x_{t - 1}

:

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(1)

where

β_{t}

is the variance schedule controlling the amount of noise added at each step, and I is the identity matrix. The joint distribution over all timesteps is given by the following:

q (x_{1 : T} ∣ x_{0}) = \prod_{t = 1}^{T} q (x_{t} ∣ x_{t - 1}) .

(2)

The reverse process aims to reconstruct

x_{0}

from

x_{T}

by iteratively denoising the data. It is modeled by a neural network

p_{θ}

, often implemented as a U-Net architecture [28], which approximates the reverse conditional distributions as Gaussian:

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) .

(3)

Here,

μ_{θ} (x_{t}, t)

and

Σ_{θ} (x_{t}, t)

are the mean and covariance predicted by the neural network at each timestep t. This formulation allows for the generation of new data by starting from pure noise

x_{T} \sim N (0, I)

and applying the learned denoising steps.

While diffusion models have significantly advanced text-to-image generation, producing realistic images guided by natural language descriptions [10], adapting these models to specific domains like EO presents substantial computational challenges. Fine-tuning the entire set of model parameters is resource-intensive, especially for large, pre-trained models.

To overcome this challenge, our approach adopts the Low-Rank Adaptation (LoRA) technique [29] for efficient fine-tuning of diffusion models. LoRA reduces computational overhead by introducing low-rank updates to the weight matrices in the attention mechanisms within the diffusion model’s U-Net architecture. Specifically, given a pre-trained weight matrix

W \in R^{n \times m}

, LoRA introduces a low-rank update

Δ W = A B^{⊤}

, where

A \in R^{n \times d}

and

B \in R^{m \times d}

, with

d ≪ min (n, m)

. The updated weight matrix becomes the following:

W^{'} = W + Δ W = W + A B^{⊤} .

(4)

During fine-tuning, only the parameters A and B are updated via gradient descent while the original weights W remain fixed. This significantly reduces the number of trainable parameters, enabling efficient domain adaptation with minimal computational resources. The choice of LoRA is specifically motivated by its computational benefits over alternative approaches. Compared to full fine-tuning, such as Dreambooth [30] or naive fine-tuning, LoRA drastically reduces the training time and hardware requirements, making it feasible to adapt large diffusion models using standard GPUs. Additionally, the compactness of the resulting low-rank updates minimizes storage overhead, simplifying model management when handling multiple fine-tunings. Unlike methods like Hypernetworks [31], LoRA avoids additional inference latency by merging the updates into the original weights, ensuring efficient deployment. These advantages, combined with its demonstrated performance for domain adaptation tasks, make LoRA an ideal choice for our proposed approach.

At this stage, we fine-tune the pre-trained Stable Diffusion V1.5 model using image–caption pairs, where the images are from the EO dataset and the captions are those generated in the previous captioning stage. To achieve computational efficiency, we employ the LoRA technique during fine-tuning.

Our training configuration includes mixed-precision computation using FP16, an image resolution of

512 \times 512

pixels, random horizontal flipping for data augmentation, and a batch size of 1. To simulate a larger effective batch size and stabilize training updates, we apply gradient accumulation over 4 steps. The model is trained for 20 epochs, which we empirically found to be optimal in our experiments. We set the initial learning rate to

1 \times 10^{- 4}

to achieve gradual and stable convergence. Additionally, we apply gradient clipping with a maximum norm of 1 to ensure training stability and use a constant learning rate scheduler. The training process, implemented in PyTorch, takes 354 min on Google Cloud Platform using a machine configuration with 8 vCPUs, 52 GB of RAM, and a single NVIDIA V100 GPU.

The fine-tuned model’s capability to generate

512 \times 512

pixel images enriched along key semantic axes provides a strong foundation for EO applications, particularly in land use and land cover (LULC) classification tasks. By capturing diversity in human activities and natural events, the generated images offer finer granularity and enhanced semantic richness, which are crucial for accurately identifying different land cover types and monitoring changes over time. This incorporation of diverse semantic content into high-resolution imagery thus offers a solid basis for precise LULC assessments, enabling AI models to better generalize and perform more accurate classifications.

3.1.4. Data Augmentation

In the final stage, we utilize the fine-tuned diffusion model along with a detailed prompt specification (Prompt Spec), to augment the EO dataset by significantly expanding its volume and diversity. The Prompt Spec serves as a structured set of instructions, defining key semantic axes and providing a framework for generating diverse synthetic images that address critical gaps in the dataset. This stage is particularly motivated by the inherent limitations of EO datasets, which often suffer from imbalanced representation of classes, insufficient coverage of real-world variability, and limited scalability. By varying the prompts according to the Prompt Spec, we ensure that the generated images not only reflect diversity across semantic axes, such as natural changes, human impacts, and disaster scenarios, but also simulate realistic and challenging conditions. This enables downstream models to better generalize to unseen data and improves their robustness in operational EO applications.

The complete four-stage data augmentation process is formalized in Algorithm 1.

3.2. Experiment

In this section, we present an experiment designed to compare various data augmentation techniques, including our own, in a classification problem within the EO domain. We aim to assess the effectiveness of our proposed approach by benchmarking it against established methods and providing visualizations to illustrate its impact on model performance.

3.2.1. EO Dataset

We utilize the EuroSAT dataset [32], which is derived from the Sentinel-2 satellite mission of the European Space Agency (ESA). The dataset comprises 27,000 geo-referenced images with a resolution of

64 \times 64

pixels. The images are organized into ten distinct land cover classes: Annual Crop, Forest, Herbaceous Vegetation, Highway, Industrial, Pasture, Permanent Crop, Residential, River, and Sea/Lake. Each class contains approximately 2000 to 3000 images, ensuring a balanced representation across different land cover types.

The EuroSAT dataset captures a diverse range of geographical locations across Europe, providing variability in terms of climate zones, vegetation types, and human activities. The images are multispectral, originally consisting of 13 spectral bands from the Sentinel-2 sensor, covering visible, near-infrared, and shortwave infrared wavelengths. For our experiments, we use the RGB bands (bands 4, 3, and 2), which are sufficient for visual analysis and align with the capabilities of standard image processing models.

Descriptive statistics of the dataset reveal significant diversity across the semantic axes of interest. For instance, the Residential class includes images of urban areas with varying densities and architectural styles, reflecting human activities. The natural classes such as Forest and River capture different natural landscapes and environmental conditions.

The EuroSAT dataset supports our focus on semantic axes by providing a wide range of land cover types and associated features. The inclusion of both human-made structures (e.g., Residential, Industrial, Highway) and natural elements (e.g., Forest, River, Sea/Lake) allows for the exploration of human activities and natural events in EO imagery. This diversity is essential for training and validating AI models that can generalize across different environmental conditions and land cover types. For our experiment, the dataset is partitioned into training (70%), validation (20%), and testing (10%) subsets.

3.2.2. Data Augmentation Techniques

We present the data augmentation strategies used for comparison with our approach. These techniques were selected based on insights from related work, where traditional augmentation methods [9,14,16,17] and automated augmentation strategies [15] have shown effectiveness in EO applications. However, as discussed in Section 2, existing methods primarily focus on pixel-level transformations and fail to capture the broader semantic diversity required for EO tasks. The following techniques, including our proposed approach, form the basis for our comparative evaluation.

Baseline: The baseline serves as a control scenario where no data augmentation is applied beyond resizing the images to $224 \times 224$ pixels. This setup benchmarks the model’s performance on the EuroSAT dataset, providing a reference point for evaluating other augmentation techniques.
Basic Augmentation: Basic augmentation includes random resized cropping to $224 \times 224$ pixels and random horizontal flipping. These simple transformations help the model become less sensitive to variations in orientation and scale, which are common in EO imagery.
Advanced Augmentation: Advanced augmentation introduces additional variability through:
−
Random horizontal and vertical flips;
−
Random rotations between $0^{\circ}$ and $360^{\circ}$ ;
−
Random resized cropping to $224 \times 224$ pixels with a scale range from 0.7 to 1.0.

These transformations aim to improve generalization by introducing a broader range of spatial variations.

AutoAugment for Earth Observation: AutoAugment [15] applies transformations based on policies optimized for the ImageNet dataset. While originally designed for natural images, AutoAugment exposes models to diverse visual variations, which can enhance robustness in EO datasets, based on preliminary findings [33].
Our Augmentation: Our approach, as detailed in Section 3.1, leverages diffusion models to generate semantically rich and diverse synthetic EO images. We generate images at $512 \times 512$ resolution and resize them to $224 \times 224$ for consistency with other methods. To ensure domain-specific relevance, we employ a prompt specification (Prompt Spec) where all prompts begin with the following prefix:

“Generate a remote sensing satellite image capturing”

Each prompt focuses on a specific class in the dataset, incorporating key visual and semantic details:
- Annual Crop: Fields of crops arranged in orderly rows with agricultural machinery, capturing agricultural precision.
- Forest: A forest landscape as seen from space, highlighting the canopy’s texture and diversity, with visible paths, clearings, or water bodies. Include signs of deforestation in a small area.
- Herbaceous Vegetation: Areas dominated by herbaceous vegetation, such as meadows or grasslands, showing the texture and color variations of the vegetation.
- Highway: A major highway traversing various landscapes, including bridges, interchanges, and adjacent urban or rural areas.
- Industrial: A large industrial area featuring factories and warehouses, with clear indications of industrial activity such as large parking lots.
- Pasture: A lush pasture area from above, featuring grazing livestock, with no trees and a nearby fire.
- Permanent Crop: A vineyard showing permanent crop arrangements, rows of trees or vines, and possibly signs of ongoing maintenance or harvesting.
- Residential: A dense residential area with a variety of housing units, surrounded by streets and green spaces.
- River: A winding river cutting through diverse landscapes, with adjacent vegetation or urban areas.
- Sea/Lake: A lake or coastal sea area as seen from above, highlighting the surrounding land, including beaches, docks, or natural vegetation.
This detailed prompt specification ensures that the generated images incorporate semantic variations caused by natural changes, human impacts, and disasters. By explicitly targeting these key semantic axes, our method produces diverse and realistic EO images that traditional augmentation techniques cannot achieve.

Our augmentation method stands out by explicitly addressing the limitations of existing techniques. While traditional methods like basic and advanced augmentation introduce geometric and pixel-level transformations, they fail to address the broader semantic axes of EO imagery. AutoAugment adds automated variability but lacks domain-specific focus. In contrast, our augmentation explicitly targets semantic diversity, generating images that reflect real-world variations such as seasonal changes, urban development, and disaster impacts. This makes our approach particularly suited for EO applications, where capturing such complexity is essential for robust AI model performance.

3.2.3. Model Architecture

To evaluate the impact of our augmented dataset, we fine-tune two widely used variants of CLIP [34]: ResNet-50 (RN50) and Vision Transformer B/32 (ViT-B/32). CLIP-based architectures have demonstrated strong performance in transfer learning tasks and are well suited to handling the variability and complexity of EO imagery. These models allow us to effectively benchmark the improvements achieved by augmenting data with our proposed method.

For the training pipeline, we empirically determine the optimal hyperparameters through experimentation. We utilize the stochastic gradient descent (SGD) optimizer with a learning rate of

1 \times 10^{- 5}

, momentum of 0.95, and weight decay of

1 \times 10^{- 5}

. A cosine annealing learning rate scheduler is applied over 5 epochs, with a minimum learning rate set to

1 \times 10^{- 8}

. Cross-entropy loss is used for classification, where the class labels serve as targets. Early stopping is implemented to prevent overfitting and ensure efficient training. All experiments are conducted on Google Cloud Platform using a machine configuration featuring 8 vCPUs, 52 GB of RAM, and a single NVIDIA V100 GPU. The models are implemented and trained using PyTorch.

3.2.4. Evaluation Metrics

To assess the performance of the models under different augmentation strategies, we use the top-1 and top-3 accuracy metrics on the test set. Top-1 accuracy measures the proportion of test images where the model’s highest probability prediction matches the true class label. Top-3 accuracy measures the proportion of test images where the true class label is among the model’s three highest probability predictions. These metrics provide a clear indication of the model’s classification performance.

3.2.5. Experimental Procedure

For each data augmentation technique, we trained the CLIP variants using the parameters outlined above. We ensured that all other aspects of the training process were kept consistent to isolate the impact of the augmentation strategies. We then evaluated the trained models on the test set of the EuroSAT dataset to obtain the top-1 and top-3 accuracy metrics.

Additionally, to assess the capability of our generated synthetic images to generalize to real-world data, we conducted a zero-shot classification experiment. Zero-shot classification refers to the ability of a model to correctly classify instances from classes that it has not seen during training [35]. In our context, we fine-tuned the CLIP models exclusively on the synthetic images generated by our method and evaluated their performance on the original EuroSAT test set without any additional training. This approach tests the model’s ability to generalize from synthetic data to real-world data it has not encountered before.

4. Results

In this section, we present the results of our experiments, demonstrating the effectiveness of our proposed data augmentation approach in enriching semantic diversity and improving classification performance. We provide both quantitative metrics and qualitative visualizations to assess the impact of our method, particularly in relation to key semantic axes such as human activities and natural events.

4.1. Classification Performance

The proposed data augmentation approach demonstrates a measurable improvement in model generalization capabilities, as evidenced by its superior performance compared with other previously discussed augmentation methodologies. Notably, with the ResNet-50 variant of CLIP, our approach yielded a 3% increment in top-1 accuracy on the EuroSAT dataset over the next best data augmentation technique. As shown in Table 1, which compares the performance of different augmentation techniques, our method achieved a top-1 accuracy of 39% for the ResNet-50 model, a significant improvement over the advanced, basic, and baseline methods, which scored 36%, 34%, and 33%, respectively. In direct comparison with the AutoAugment strategy outlined by Cubuk et al. [15], our method demonstrates a substantial increase in performance, with a delta of +8% in top-1 accuracy on the ResNet-50 model (31% to 39%). Moreover, our approach attains the highest overall F1-score (0.36), outperforming AutoAugment (0.27) by a considerable margin, thereby indicating stronger class-level consistency in predictions.

Similarly, for the ViT-B/32 variant of CLIP, our approach achieved a top-1 accuracy of 90%, outperforming the AutoAugment strategy, which recorded an accuracy of 85%. This improvement is likewise reflected in the F1-scores, with our approach reaching 0.85, notably exceeding the next best approach, AutoAugment, at 0.80. This represents a 5% improvement in top-1 accuracy and a 0.05 improvement in F1-score over the best-performing alternative method.

Regarding the top-3 accuracy, as shown in Table 1, our approach leads to the highest top-3 accuracy with the ResNet-50 model, reaching 66%, which is an improvement over the other strategies. Both the Advanced Augmentation and Baseline strategies achieved a top-3 accuracy of 62%, while the Basic Augmentation strategy resulted in a lower accuracy of 60%. The AutoAugment strategy achieved the lowest top-3 accuracy of 55%.

With the ViT-B/32 variant, our approach again demonstrates superior performance, achieving a top-3 accuracy of 99%, outperforming the other strategies. The Advanced and Basic Augmentation strategies both recorded a top-3 accuracy of 96%, while the Baseline strategy achieved 95%, and the AutoAugment strategy achieved 94%.

In addition to these overall metrics, our method also provides stronger performance in class-by-class evaluations. Figure 4 compares the per-class F1-scores for both the ResNet-50 and ViT-B/32 models under various augmentation strategies. Our data augmentation method consistently yields competitive or superior performance across most classes. For instance, improvements are particularly notable in the “AnnualCrop” and “River” classes under the ResNet-50 model, while the ViT-B/32 model exhibits substantial gains in the “PermanentCrop” and “Residential” classes. These per-class analyses underscore our method’s capacity to enhance class-specific discriminative power, which is crucial for EO applications involving diverse or imbalanced datasets.

Moreover, the training dynamics for each augmentation strategy can be observed in Figure 5, which shows the validation accuracy over epochs for both the ResNet-50 (left) and ViT-B/32 (right) variants. Our approach not only achieves the highest final validation accuracy but also tends to converge more rapidly compared to the other methods, highlighting the effectiveness of comprehensive transformations in quickly boosting model performance.

Finally, to quantify computational overhead, we measured the average training time per epoch (ETT) for each augmentation technique and model variant. These values are reported in Table 1, revealing that more complex augmentation strategies generally incur higher ETTs. In particular, our augmentation approach, which demonstrates the best performance across metrics, also exhibits increased training time compared to simpler strategies. This added overhead can be attributed largely to the higher resolution of generated images (

512 \times 512

), requiring additional resizing and transformations. Similarly, methods like AutoAugment incur overhead from their policy-search phase, where numerous sub-policies are tested and optimized. While our method achieves superior accuracy and class-level consistency, the computational cost may be a constraint in resource-limited settings.

Overall, these results highlight the effectiveness of our proposed data augmentation method and suggest that our method is robust and can significantly improve the performance of models in tasks related to Earth Observation and remote sensing, thereby confirming the value of our contribution to the domain of EO.

4.2. Zero-Shot Performance

To further assess the impact of our augmentation method, we conducted a zero-shot classification experiment as described in the experimental procedure. Table 2 presents the zero-shot performance of models fine-tuned exclusively on the synthetic images generated by our method, applied to the EuroSAT test set.

As shown in Table 2, the fine-tuned models exhibit substantial improvements in zero-shot accuracy over the original CLIP models. Specifically, the CLIP RN50 model achieves an increase of 16.97% in top-1 accuracy, while the CLIP ViT-B/32 model improves by 19.83%. This indicates that the synthetic images generated by our method effectively capture relevant features of EO data, enhancing the models’ generalization capabilities.

4.3. Qualitative Analysis

Despite the widespread use of the Fréchet Inception Distance (FID) metric for assessing the diversity and quality of images generated by generative models [36], we have opted not to employ it in our experiment due to several limitations. Recent studies have shown that FID has a limited representation of the intricate outputs produced by generative text-to-image models, as it may not effectively capture certain aspects of image diversity and semantic content [37]. Additionally, it has been demonstrated that FID is inconsistent with human perception [38], and its reliability increases with the volume of images, typically requiring thousands of samples for a dependable score. Our dataset volume was insufficient to meet this criterion. Therefore, we argue that assessing the impact of the generated images on the performance of the model provides a more pertinent measure of their utility for our specific EO task. By evaluating how the synthetic images enhance classification accuracy and generalization, we directly measure their effectiveness in enriching the dataset along key semantic axes.

Figure 6 presents examples of synthetic images generated using our method alongside corresponding images from the EuroSAT dataset, both resized to to

128 \times 128

. The synthetic images exhibit realistic and diverse representations of the different land cover classes, effectively capturing key features and variations along the semantic axes. This enhanced diversity contributes to the models’ improved ability to generalize and accurately classify unseen data. By incorporating variations in human activities and natural events, our synthetic images enrich the training data, enabling the AI models to better understand and interpret the complex environmental conditions present in EO imagery.

Additional examples are provided in the Appendix, which is divided into two sections. The first section (Appendix A.1) presents a detailed comparison of our generated images with those from EuroSAT, highlighting the quality and relevance of the synthetic images produced by our approach. The second section (Appendix A.2) showcases synthetic images generated using our proposed data augmentation method across four distinct EO applications: agriculture and crop monitoring, climate change and glacier monitoring, disaster management, and environmental monitoring. These examples illustrate the versatility of our approach in generating diverse and realistic data tailored to the specific needs of these applications, capturing essential semantic features and variations. This provides a valuable resource for enhancing AI models and supporting more robust and accurate analyses in the EO domain.

5. Discussion

Our experimentation demonstrates the effectiveness of integrating diffusion models into the data augmentation process for EO imagery. By employing meta-prompts for instruction generation and leveraging a fine-tuned diffusion model, we successfully generated semantically rich and diverse synthetic EO images. This approach addresses the limitations of traditional data augmentation techniques, which often fail to capture the necessary diversity across key semantic axes [18].

Compared to previous studies that have utilized traditional augmentation methods [16,17], our method introduces a novel approach to enriching EO datasets. The inclusion of domain-specific captions generated by vision–language models contributes to the fine-tuning of the diffusion model, enhancing its capability to produce images aligned with specific EO contexts.

While prior work has begun to explore the use of diffusion models in EO [23,24], our approach uniquely combines diffusion models with semantic prompts tailored to EO classes, focusing on enriching data along key semantic axes. This distinguishes our work from previous methods that may not explicitly target semantic diversity or incorporate human activities and natural events in the generated images.

Our findings indicate that integrating general-purpose vision–language models such as InstructBLIP [26] into the data augmentation pipeline can positively impact the quality of synthetic data, even though these models are not yet mature in capturing complex EO semantics. Despite some limitations, such as occasional inaccuracies in the generated captions, the overall effect on the diffusion model’s performance is beneficial. This suggests that the combination of vision–language models and diffusion models can overcome some of the challenges identified in prior research regarding data diversity and semantic richness in EO [23,24].

These results can be interpreted in the context of previous studies that have highlighted the importance of semantic richness in data augmentation for EO applications. While traditional methods enhance datasets through geometric transformations or simple alterations [9], they often lack the ability to introduce new semantic content. Our approach bridges this gap by generating synthetic images that not only increase data volume but also introduce new semantic variations, thereby enriching the dataset in a way that traditional methods cannot achieve.

To illustrate how our method differs from existing approaches, Table 3 compares the main features and capabilities of traditional augmentation techniques, generic diffusion models, and our EO-specific data augmentation approach.

These advancements have significant implications for the EO community and AI applications in this field. By producing diverse synthetic images that capture complex semantic variations, our method helps address data scarcity and diversity challenges. This progress can enhance a range of applications such as land cover classification, environmental monitoring, and disaster management, where accurate and varied data are essential for robust model performance.

The ability to generate semantically rich synthetic EO images also opens new opportunities for studying ecosystem evolution and resilience. By incorporating key semantic axes such as land cover changes, natural phenomena, and anthropogenic impacts, our approach enables the creation of datasets that better capture the temporal and spatial dynamics of ecosystems. Additionally, the prompt-based generation capabilities of our method allow for the creation of entirely new and novel datasets, tailored to specific research questions or application needs. This capability not only enhances traditional data augmentation but also facilitates the exploration of unique scenarios and conditions that are under-represented or missing in existing datasets. These synthetic datasets can support the development of AI models that are more adept at analyzing gradual changes, such as vegetation shifts due to climate change, or abrupt disturbances, such as wildfire impacts on biodiversity.

Moreover, the increased diversity provided by our method helps represent different stages of ecosystem recovery and resilience after natural disasters or human impacts. For example, synthetic images showing forest regrowth, wetland recovery, or the expansion of urban green spaces can offer valuable insights into how ecosystems respond to and recover from challenges. This approach supports the growing need for effective tools to study ecosystem recovery under various conditions, which is essential for making informed decisions in environmental management and sustainable development.

By improving AI models and addressing ecological challenges, our approach enables more nuanced modeling of ecosystem dynamics. This contributes to understanding ecosystem responses and supports efforts to build resilient and sustainable environments.

Finally, our approach aligns with the working hypotheses that leveraging advanced generative models can address the limitations of existing data augmentation techniques in EO. It demonstrates that diffusion models, when fine-tuned with domain-specific captions, can produce high-quality synthetic data that improve AI model performance.

6. Conclusions

This paper introduced a novel four-stage data augmentation approach utilizing generative diffusion models, specifically designed to enrich EO datasets along key semantic axes. By integrating meta-prompt instruction generation and vision–language models, we generated semantically rich synthetic images that capture the complex variations inherent in EO data, including human activities and natural events. Our method demonstrated substantial improvements in classification tasks, evidenced by increased top-1 and top-3 accuracy, enhanced F1-scores, and superior zero-shot classification results. These findings underscore the efficacy of our approach in addressing challenges related to data scarcity and diversity in EO.

To validate our methodology, we used the EuroSAT dataset, which proved to be well aligned with our objectives. EuroSAT’s diverse collection of multispectral Sentinel-2 images, representing various land-cover types across Europe, provided an ideal benchmark for evaluating the potential of our data augmentation strategy. Future research could extend this work by incorporating additional EO datasets such as BigEarthNet [39], SEN12MS [40], and UCMerced [41]. These datasets, with their diverse spectral and spatial characteristics, offer a promising opportunity to assess the adaptability and robustness of our method across varying EO scenarios.

By enhancing dataset richness along semantic axes, our method contributes to more accurate land-cover classification, environmental monitoring, and disaster management applications. It enables AI models to better generalize to real-world scenarios, improving their reliability and applicability in EO contexts.

For future work, we propose exploring hybrid strategies that combine our generative approach with traditional augmentation techniques, such as geometric transformations, to further increase data diversity. Investigating the use of advanced vision–language models, potentially including proprietary models capable of generating more precise captions, could improve the quality of the synthetic data. Additionally, developing EO-specific vision–language models and incorporating multispectral or hyperspectral data may expand the applicability and scope of our method. Evaluating our approach on other EO tasks, such as object detection or semantic segmentation, will further demonstrate its utility and generalizability.

In summary, our approach demonstrates the potential of combining diffusion models with advanced prompting and captioning techniques to enhance EO data augmentation along key semantic axes. This work lays the foundation for future advancements in AI applications within the EO domain, promoting the development of more robust and generalizable models capable of addressing complex environmental challenges.

Author Contributions

T.S. conducted the material preparation, collected and processed the dataset, prepared and trained the AI models, ran the experiments, performed the analysis, and wrote the first draft of the manuscript. B.R. revised the first draft of the manuscript. N.G. proposed the experiment idea. All authors contributed to the study’s conception and design, commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The code and implementation details used in this work are publicly available at https://doi.org/10.5281/zenodo.14510711 (accessed on 22 January 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
BLIP	Bootstrapping Language–Image Pre-training
CLIP	Contrastive Language–Image Pre-training
DM	Diffusion Model
EO	Earth Observation
ESA	European Space Agency
FID	Fréchet Inception Distance
LoRA	Low-Rank Adaptation
LULC	Land Use and Land Cover
MP	Meta-Prompt
PS	Prompt Specification
RN50	ResNet-50
SDG	Sustainable Development Goal
SGD	Stochastic Gradient Descent
U-Net	U-shaped Network
ViT-B/32	Vision Transformer Base Model with 32 × 32 Patches
VLM	Vision–Language Model

Appendix A

Appendix A.1. Synthetic Image Augmentations for Earth Observation

This appendix presents supplementary figures comparing real EuroSAT images with synthetic augmentations generated by our proposed approach. These figures illustrate how our data augmentation method enhances semantic diversity along the key semantic axes in EO imagery.

Each figure consists of four rows: the odd rows (first and third) show real EuroSAT images, while the even rows (second and fourth) display synthetic images generated by our proposed approach. These visual comparisons demonstrate the structural and feature-based fidelity of the synthetic data, while also highlighting the increased variability introduced by the generative model.

The original EuroSAT dataset contains images with a resolution of 64 × 64 pixels, whereas our proposed diffusion-based approach generates images at a resolution of 512 × 512 pixels. Direct comparisons between these resolutions are challenging due to the significant differences in detail and scale. To ensure a meaningful and consistent comparisons, all images were resized to a resolution of 128 × 128 pixels. This intermediate resolution was selected to preserve sufficient detail from the synthetic images while retaining the key semantic axes, thereby reflecting the diversity of characteristics in the images. This approach ensures a fair and interpretable visual analysis of the generated augmentations.

The figures demonstrate the semantic fidelity and diversity of the synthetic data, highlighting their ability to replicate spatial and textural characteristics while introducing controlled variability. By enriching datasets with realistic patterns and features, the synthetic images facilitate tasks such as flood detection, urbanization analysis, biodiversity monitoring, and other related applications.

Figure A1. Comparison for the “Industrial” and “Pasture” classes. The synthetic images reflect the structural regularity of industrial sites and the organic spatial patterns characteristic of pastoral landscapes. These augmentations enhance datasets used for land-use classification, enabling improved monitoring of human-induced transformations, such as industrial expansion and its impact on surrounding ecosystems.

Figure A2. Comparison for the “River” and “SeaLake” classes. The synthetic images accurately replicate the spatial and textural characteristics of aquatic features, including river contours, lake boundaries, and surrounding landscapes. These augmentations improve datasets for flood detection, water resource monitoring, and the assessment of environmental changes in aquatic systems.

Figure A3. Comparison for the “PermanentCrop” and “Residential” classes. The generated augmentations successfully capture the grid-like arrangement of agricultural fields and the structured patterns of urban areas. These augmentations support applications in urbanization studies, city management, and habitat fragmentation analysis.

Figure A4. Comparison for the “HerbaceousVegetation” and “Highway” classes. The synthetic images introduce variability in vegetation textures and highway structures, reflecting both natural and man-made landscapes. These augmentations improve datasets for tasks such as biodiversity monitoring, vegetation health assessment, and transportation infrastructure analysis.

Figure A5. Comparison for the “AnnualCrop” and “Forest” classes. The synthetic augmentations effectively capture the spatial and textural variability of croplands and forests, reflecting gradual changes due to seasonal variations or abrupt transformations caused by deforestation or extreme weather events. These enriched datasets enable more accurate AI models for crop monitoring, deforestation tracking, and climate impact studies.

Appendix A.2. Applications of Synthetic Image Augmentations in Earth Observation

This section highlights the applicability of our proposed approach across four key EO domains: agriculture and crop monitoring, climate change and glacier monitoring, disaster management, and environmental monitoring. The generated image samples, presented in Figure A6, illustrate the realism and diversity of our approach, tailored to the specific needs of each domain.

Figure A6 is organized into four rows, each corresponding to a distinct EO application. The first row demonstrates synthetic images designed for agriculture and crop monitoring, capturing crop patterns, irrigation systems, and field boundaries under typical growing season conditions. These images provide valuable data for monitoring crop health and yield prediction, enhancing AI models’ ability to identify variations in vegetation and agricultural practices.

The second row features images for climate change and glacier monitoring, depicting features such as glacial cracks, retreating ice lines, and meltwater lakes under daylight conditions. These images are essential for training models to track glacial retreat and assess the broader impacts of climate change on polar and glacial regions.

The third row highlights images tailored for disaster management, showing forested areas engulfed in fire with visible smoke. These images provide realistic scenarios for training AI systems in wildfire monitoring and disaster response. The inclusion of features like dense smoke and burning vegetation aids in rapid damage assessment and emergency planning.

The fourth row showcases synthetic images for environmental monitoring, focusing on rainforests with clear patches of deforestation and roads cutting through the landscape. These images replicate human impacts on forested ecosystems, supporting the development of models for detecting deforestation and land-use changes.

The generated images are versatile and realistic, effectively addressing the requirements of diverse EO applications. By mimicking the semantic and visual characteristics of real-world satellite imagery, these synthetic image samples bridge the domain gap and enable more robust and accurate AI models. The customization for each application ensures that the data align with specific use cases and supporting tasks in the EO domain.

Figure A6. Synthetic images generated using our proposed approach for four distinct EO applications: agriculture and crop monitoring, climate change and glacier monitoring, disaster management, and environmental monitoring. Each row corresponds to a specific application, showcasing the diversity and relevance of the generated data.

References

Campbell, J.B.; Wynne, R.H.; Thomas, V.A. Introduction to Remote Sensing, 6th ed.; Guilford Press: New York, NY, USA, 2022. [Google Scholar]
Yang, J.; Gong, P.; Fu, R.; Zhang, M.; Chen, J.; Liang, S.; Xu, B.; Shi, J.; Dickinson, R. The Role of Satellite Remote Sensing in Climate Change Studies. Nat. Clim. Chang. 2013, 3, 875–883. [Google Scholar] [CrossRef]
Purkis, S.J.; Klemas, V.V. Remote Sensing and Global Environmental Change; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Andries, A.; Morse, S.; Murphy, R.J.; Lynch, J.; Woolliams, E.R. Using Data from Earth Observation to Support Sustainable Development Indicators: An Analysis of the Literature and Challenges for the Future. Sustainability 2022, 14, 1191. [Google Scholar] [CrossRef]
Sousa, T. Towards Modeling and Predicting the Resilience of Ecosystems. In Proceedings of the 2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C), Västerås, Sweden, 1–6 October 2023; pp. 159–165. [Google Scholar] [CrossRef]
Sathyaraj, P.; Nirmala, G.; Vijayalakshmi, S.; Rajakumar, S. Artificial Intelligence: Applications, Benefits, and Future Challenges in the Monitoring and Prediction of Earth Observations. In Novel AI Applications for Advancing Earth Sciences; IGI Global: Hershey, PA, USA, 2024; pp. 1–18. [Google Scholar] [CrossRef]
Schmitt, M.; Ahmadi, S.A.; Xu, Y.; Taskin, G.; Verma, U.; Sica, F.; Hansch, R. There Are No Data Like More Data- Datasets for Deep Learning in Earth Observation. IEEE Geosci. Remote Sens. Mag. 2023, 11, 63–97. [Google Scholar] [CrossRef]
Perez, L.; Wang, J. The Effectiveness of Data Augmentation in Image Classification Using Deep Learning. arXiv 2017, arXiv:1712.04621. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10674–10685. [Google Scholar] [CrossRef]
Bansal, M.A.; Sharma, D.R.; Kathuria, D.M. A Systematic Review on Data Scarcity Problem in Deep Learning: Solution and Applications. ACM Comput. Surv. CSUR 2022, 54, 208:1–208:29. [Google Scholar] [CrossRef]
Elmes, A.; Alemohammad, H.; Avery, R.; Caylor, K.; Eastman, J.R.; Fishgold, L.; Friedl, M.A.; Jain, M.; Kohli, D.; Laso Bayas, J.C.; et al. Accounting for Training Data Error in Machine Learning Applied to Earth Observations. Remote Sens. 2020, 12, 1034. [Google Scholar] [CrossRef]
Kansakar, P.; Hossain, F. A review of applications of satellite earth observation data for global societal benefit and stewardship of planet earth. Space Policy 2016, 36, 46–54. [Google Scholar] [CrossRef]
Hendrycks, D.; Mu, N.; Cubuk, E.D.; Zoph, B.; Gilmer, J.; Lakshminarayanan, B. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. arXiv 2019, arXiv:1912.02781. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Policies from Data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
Abdelhack, M. A Comparison of Data Augmentation Techniques in Training Deep Neural Networks for Satellite Image Classification. arXiv 2020, arXiv:2003.13502v1. [Google Scholar]
Illarionova, S.; Nesteruk, S.; Shadrin, D.; Ignatiev, V.; Pukalchik, M.; Oseledets, I. MixChannel: Advanced Augmentation for Multispectral Satellite Images. Remote Sens. 2021, 13, 2181. [Google Scholar] [CrossRef]
Lalitha, V.; Latha, B. A Review on Remote Sensing Imagery Augmentation Using Deep Learning. Mater. Today Proc. 2022, 62, 4772–4778. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 6840–6851. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, Online, 13 December 2021. [Google Scholar] [CrossRef]
Trabucco, B.; Doherty, K.; Gurinas, M.; Salakhutdinov, R. Effective Data Augmentation with Diffusion Models. arXiv 2023, arXiv:2302.07944. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Proceedings of the Neural Information Processing Systems; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates Inc.: Red Hook, NY, USA, 2021; Volume 34, pp. 8780–8794. [Google Scholar] [CrossRef]
Zhao, C.; Ogawa, Y.; Chen, S.; Yang, Z.; Sekimoto, Y. Label Freedom: Stable Diffusion for Remote Sensing Image Semantic Segmentation Data Generation. In Proceedings of the 2023 IEEE International Conference on Big Data, Sorrento, Italy, 15–18 December 2023. [Google Scholar] [CrossRef]
Sebaq, A.; ElHelw, M. RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model. arXiv 2023, arXiv:2309.02455. [Google Scholar] [CrossRef]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring Models and Data for Remote Sensing Image Caption Generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef]
Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; Hoi, S. InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning. arXiv 2023, arXiv:2305.06500. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation. arXiv 2022, arXiv:2201.12086. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Hu, J.E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
Ha, D.; Dai, A.; Le, Q.V. HyperNetworks. arXiv 2016, arXiv:1609.09106. [Google Scholar]
Helber, P.; Bischke, B.; Dengel, A.; Borth, D. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2217–2226. [Google Scholar] [CrossRef]
Li, Y.; Zhang, S.; Li, X.; Ye, F. Remote Sensing Image Classification with Few Labeled Data Using Semisupervised Learning. Wirel. Commun. Mob. Comput. 2023, 2023, e7724264. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Palatucci, M.; Pomerleau, D.; Hinton, G.E.; Mitchell, T.M. Zero-shot Learning with Semantic Output Codes. In Proceedings of the Advances in Neural Information Processing Systems; Bengio, Y., Schuurmans, D., Lafferty, J., Williams, C., Culotta, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2009; Volume 22. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Jayasumana, S.; Ramalingam, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 9307–9315. [Google Scholar] [CrossRef]
Ding, M.; Zheng, W.; Hong, W.; Tang, J. CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2022; Volume 35, pp. 16890–16902. [Google Scholar]
Sumbul, G.; Charfuelan, M.; Demir, B.; Markl, V. Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 5901–5904. [Google Scholar] [CrossRef]
Schmitt, M.; Hughes, L.H.; Qiu, C.; Zhu, X.X. SEN12MS—A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, IV-2/W7, 153–160. [Google Scholar] [CrossRef]
Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, New York, NY, USA, 2–5 November 2010; GIS ’10. pp. 270–279. [Google Scholar] [CrossRef]

Figure 1. Diagram illustrating transformations for data augmentation along key semantic axes in Earth Observation imagery. The central “initial image” represents a forest, with transformations categorized into Natural Changes, Human Impacts, and Disasters, demonstrating how our approach generates semantically diverse synthetic data beyond traditional augmentation methods.

Figure 2. An overview of our four-stage data augmentation process for generating semantically diverse synthetic images in the EO domain. Blue boxes denote the stages, blue arrows indicate the sequential flow between stages, black solid arrows represent input artifacts to each stage, and dashed arrows illustrate the output artifacts generated by the respective stages.

Figure 3. An example caption generated during the captioning stage of the data augmentation process, highlighting the characteristics of an EO image classified as a river.

Figure 4. Comparison of per-class F1-scores for ResNet-50 (left) and ViT-B/32 (right) with various augmentation strategies, illustrating the effectiveness of our augmentation method across diverse classes.

Figure 5. Validation accuracy over epochs for the ResNet-50 (left) and ViT-B/32 (right) models, illustrating the convergence behavior under different augmentation strategies.

Figure 6. Examples of images generated using our method compared with images from the corresponding categories in the EuroSAT dataset.

Table 1. Metrics for CLIP models with different augmentation techniques on the EuroSAT dataset: top-1 and top-3 accuracy (%), F1-score, and average training time per epoch (ETT) in seconds.

Augmentation Technique	ResNet50				ViT-B/32
Augmentation Technique	Top-1	Top-3	F1	ETT	Top-1	Top-3	F1	ETT
Baseline	33%	62%	0.28	261	83%	95%	0.75	333
Basic Augmentation	34%	60%	0.30	264	81%	96%	0.72	337
Advanced Augmentation	36%	62%	0.33	293	81%	96%	0.70	375
AutoAugment	31%	55%	0.27	325	85%	94%	0.80	423
Our Augmentation	39%	66%	0.36	301	90%	99%	0.85	391

Table 2. Zero-shot top-1 accuracy (%) of CLIP models fine-tuned on synthetic images.

Model	Original Top-1	Our Top-1	Improvement (%)
CLIP RN50	41.1%	58.07%	+16.97%
CLIP ViT-B/32	49.4%	69.23%	+19.83%

Table 3. Comparison of data augmentation approaches for EO imagery.

Feature	Traditional Methods	Generic Diffusion Models	Our Approach
Diversity	Low (e.g., flips, rotations)	Moderate (trained on broad imagery)	High (fine-tuned on EO)
Domain Adaptation	Generic	Limited domain customization	Fine-tuned for EO tasks
Semantic Diversity	Not addressed	Implicitly addressed	Explicitly targeted via prompts
EO Suitability	Low (lacks domain realism)	Moderate (not tailored to EO)	High (EO fine-tuned)
Computational Efficiency	High	Low	Balanced via LoRA
Scalability	Scalable, but limited in diversity	Scalable, but high compute cost	Domain-specific scalability

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sousa, T.; Ries, B.; Guelfi, N. Data Augmentation in Earth Observation: A Diffusion Model Approach. Information 2025, 16, 81. https://doi.org/10.3390/info16020081

AMA Style

Sousa T, Ries B, Guelfi N. Data Augmentation in Earth Observation: A Diffusion Model Approach. Information. 2025; 16(2):81. https://doi.org/10.3390/info16020081

Chicago/Turabian Style

Sousa, Tiago, Benoît Ries, and Nicolas Guelfi. 2025. "Data Augmentation in Earth Observation: A Diffusion Model Approach" Information 16, no. 2: 81. https://doi.org/10.3390/info16020081

APA Style

Sousa, T., Ries, B., & Guelfi, N. (2025). Data Augmentation in Earth Observation: A Diffusion Model Approach. Information, 16(2), 81. https://doi.org/10.3390/info16020081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data Augmentation in Earth Observation: A Diffusion Model Approach

Abstract

1. Introduction

2. Related Work

2.1. Traditional Augmentation Techniques

2.2. Augmentation Techniques Using Diffusion Models

3. Materials and Methods

3.1. Earth Observation Data Augmentation

3.1.1. Instruction Generation

3.1.2. Captioning

3.1.3. Model Fine-Tuning

3.1.4. Data Augmentation

3.2. Experiment

3.2.1. EO Dataset

3.2.2. Data Augmentation Techniques

3.2.3. Model Architecture

3.2.4. Evaluation Metrics

3.2.5. Experimental Procedure

4. Results

4.1. Classification Performance

4.2. Zero-Shot Performance

4.3. Qualitative Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Synthetic Image Augmentations for Earth Observation

Appendix A.2. Applications of Synthetic Image Augmentations in Earth Observation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI