RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation

Gao, Xin; Fu, Yao; Jiang, Xiaonan; Wu, Fanlu; Zhang, Yu; Fu, Tianjiao; Li, Chao; Pei, Junyan

doi:10.3390/app15031121

Open AccessArticle

RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation

by

Xin Gao

^1,2,

Yao Fu

^1,*,

Xiaonan Jiang

^1,*,

Fanlu Wu

¹

,

Yu Zhang

¹,

Tianjiao Fu

¹,

Chao Li

¹ and

Junyan Pei

¹

Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, China

²

School of Optoelectronics, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1121; https://doi.org/10.3390/app15031121

Submission received: 26 December 2024 / Revised: 18 January 2025 / Accepted: 21 January 2025 / Published: 23 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Despite significant challenges, the text-guided remote sensing image generation method shows great potential in many practical applications such as generative adversarial networks in remote sensing tasks; generated images still face challenges such as low realism, face challenges, and unclear details. Moreover, the inherent spatial complexity of remote sensing images and the limited scale of publicly available datasets make it particularly challenging to generate high-quality remote sensing images from text descriptions. To address these challenges, this paper proposes the RSVQ-Diffusion model for remote sensing image generation, achieving high-quality text-to-remote-sensing image generation applicable for target detection, simulation, and other fields. Specifically, this paper designs a spatial position encoding mechanism to integrate the spatial information of remote sensing images during model training. Additionally, the Transformer module is improved by incorporating a short-sequence local perception mechanism into the diffusion image decoder, addressing issues of unclear details and regional distortions in generated remote sensing images. Compared with the VQ-Diffusion model, our proposed model achieves significant improvements in the Fréchet Inception Distance (FID), the Inception Score (IS), and the text–image alignment (Contrastive Language-Image Pre-training, CLIP) scores. The FID score successfully decreased from 96.68 to 90.36; the CLIP score increased from 26.92 to 27.22, and the IS increased from 7.11 to 7.24.

Keywords:

remote sensing image; transformer; diffusion model; image generation; VQ-Diffusion

1. Introduction

Remote sensing image generation holds significant importance in fields such as Earth observation, resource management, and environmental protection. When the cost of obtaining actual data is high or there is a lack of data, generating high-quality remote sensing images can provide essential support for modeling and analysis. Such image generation can not only enhance the application of few-shot learning in remote sensing but also expand the coverage of remote sensing tasks, including disaster monitoring, land use analysis, and ecosystem change tracking. With the rapid development of generative models such as generative adversarial networks (GANs) and diffusion models, remote sensing image generation technology has made significant progress in improving generation accuracy and efficiency. However, current technologies still face challenges such as the realism of generated images, resolution enhancement, and semantic accuracy. Therefore, there is a need for improved remote sensing image generation techniques to provide high-quality image data for remote sensing image processing tasks. Currently, there are three main architectural approaches to image generation methods: autoencoders, generative adversarial networks (GANs), and diffusion models.

Autoencoder neural networks, introduced by David E. Rumelhart et al. [1], are unsupervised learning algorithms that can be used for high-dimensional complex data processing. In 2010, researchers [2] applied autoencoders to image reconstruction and feature learning, marking the first use of autoencoder networks for image generation. Diederik P. Kingma et al. [3] proposed the variational autoencoder (VAE), a probabilistic generative model, which achieved significant progress in image generation and latent space interpolation. VAE has since become a popular method in image generation models. However, due to the limitations of the latent space and the insufficient handling of image space consistency, the realism and detail of images generated using autoencoder-based models often fall short compared to other generative models. Therefore, traditional autoencoders usually need to be combined with other generative models to better meet the needs of remote sensing image generation. For instance, David Berthelot et al. [4] introduced boundary equilibrium GANs (BEGANs), which balance image quality and diversity using both autoencoder loss and generative adversarial network loss.

Since their introduction by Ian J. Goodfellow et al. [5] in 2014, generative adversarial networks (GANs) have been widely applied in image generation. The principle behind GANs involves two adversarial neural networks (a generator and a discriminator) that compete with each other. This competition drives the generator to continuously improve its generated images to the point where the discriminator has difficulty distinguishing between the real and generated images, thereby enhancing the quality of the generated images. However, early GANs lacked controllability in their generated images, and the training process was unstable. Although conditional generative adversarial networks (cGANs) [6] introduced conditional generation for specific types of images, thereby improving controllability, they still struggled to capture image details when generating complex types of objects, often resulting in blurriness and distortion.

The introduction of diffusion models has provided new solutions in the field of image generation. The concept of diffusion models, proposed by Jascha Sohl-Dickstein et al. [7], is inspired by non-equilibrium statistical physics. The core idea involves systematically disrupting the data distribution structure through an iterative forward diffusion process and then learning the reverse diffusion process to restore the data structure, resulting in a highly flexible and manageable generative model. Jonathan Ho et al. [8] advanced this concept with the denoising diffusion probabilistic model (DDPM), which learns the data distribution through a progressive noise injection process and generates new samples through a reverse denoising process. Subsequent research by Prafulla Dhariwal et al. [9] demonstrated that diffusion models outperform generative adversarial networks in the field of image generation. This approach excels in generating image details, diversity, and realism, and it has gradually become a focus of research in remote sensing image generation.

Although models like OpenAI’s DALLE2 [10] and Robin Rombach et al.’s stable diffusion model [11] have made significant advancements in image generation, enabling the creation of images in various styles (e.g., anime, oil painting) based on text prompts and keywords, research focused on remote sensing image generation remains relatively scarce. Existing methods for remote sensing image generation, such as the multi-stage structured generative adversarial network (StrucGAN) proposed by Rui Zhao et al. [12], still face challenges like low semantic matching between generated images and textual descriptions, spatial structure errors, and poor fidelity in generated images. Consequently, remote sensing image generation encounters unique challenges, such as high resolution, multi-scale features [13], complex backgrounds, and a lack of annotated data. Specifically, tasks involving remote sensing images demand high resolution and richness in detail. Additionally, the complex and diverse backgrounds found in remote sensing images (e.g., urban areas, forests, and bodies of water) require models to accurately understand and describe these semantic details [14].

To address the aforementioned challenges, this paper employs a method combining diffusion models with the Transformer architecture [15] based on the VQ-Diffusion model [16]. The model is modified and trained to obtain the text-to-remote-sensing image generation model, RSVQ-Diffusion. Experimental results demonstrate that the modified model ensures the quality of the generated remote sensing images, producing spatially coherent images with significant improvements in diversity and realism. Moreover, it alleviates issues such as image distortion and semantic accuracy. The specific contributions of this work are as follows:

The architecture combining diffusion models with the Transformer is applied to remote sensing image generation. To address the spatial characteristics of remote sensing images, spatial position encoding is incorporated into the model’s image decoder, enhancing the Transformer model’s ability to capture the overall spatial positional information of remote sensing images during the sequence processing;
The concept of local feature extraction from sequences [17] is integrated with the self-attention mechanism of the Transformer, leading to the proposed TransLocalBlock module. By controlling the sequence length and combining it with the multi-head attention mechanism, this module enables the Transformer to focus on local information during the processing of long sequences;
We designed ablation experiments and compared them with existing text-to-remote-sensing image generation models. Through evaluation metrics and improvements in aspects such as the spatial structure, realism, and details of the generated images, the effectiveness of the proposed method is effectively demonstrated.

The organization of the paper is as follows: Section 2 describes various studies related to the generation of remote sensing images from text. Section 3 provides a detailed description of the proposed model’s improvement methods. Section 4 outlines the datasets and evaluation metrics used in the experiments, followed by a comprehensive presentation and analysis of the experimental results. Finally, Section 5 summarizes the proposed method and offers an outlook for future work.

2. Related Works

In recent years, generative adversarial networks (GANs) and diffusion models have been continuously applied in the field of image generation. Text-to-remote-sensing image generation methods based on diffusion models have also played a significant role in remote sensing image processing. The earliest application of GANs in remote sensing image processing was proposed by Zhu Lin et al. [18], who introduced two GAN frameworks: spectral classifier and spectral-spatial classifier for hyperspectral remote sensing image classification tasks. To address the issue of low resolution in generated remote sensing images, Boyu Pang et al. [19] proposed a conditional GAN for remote sensing image super-resolution, which guides the generator to produce corresponding high-resolution images based on input low-resolution images. To generate images with details that better match textual descriptions, attnGAN [20] introduces an attention mechanism that aligns the focus points of text descriptions with the corresponding regions of generated images and uses local attention at different generation stages. Later, attnGAN was improved by incorporating a multi-stage generation strategy to incrementally introduce image detail features and structurally process text descriptions to capture fine-grained textual information. Despite improvements in the quality of generated remote sensing images, issues such as blurriness, distortion, or deformation in certain areas still persist.

Subsequently, researchers have progressively applied diffusion models to the field of remote sensing image generation, often combining semantic segmentation networks like U-Net with attention mechanisms to achieve the diffusion process. The U-Net network extracts image features, while the attention mechanism computes the weights of the input conditional information to establish the matching relationship between images and their text. Among these, Samar Khanna et al. [21] proposed the Diffusionsat model, which is an improved version of the stable diffusion model that enables remote sensing image generation guided by metadata and text. Datao Tang et al. [22] introduced ConcededNet based on improved diffusion models, which allows multi-condition inputs to control remote sensing image generation. Its adaptive feature extraction module dynamically adjusts the generation strategy according to different input conditions, further enhancing image quality and diversity.

Existing image generation models that commonly use convolutional neural networks (CNNs) are limited by the size of the convolutional kernels, hindering the extraction and learning of detailed image features. In contrast, the Transformer architecture, initially designed for natural language processing, can learn image features on a pixel-by-pixel basis. It also employs positional encoding to mark positions in the unfolded one-dimensional sequence of images, facilitating the generation of images rich in details. Yonghao Xu et al. [23] combined the Transformer architecture with energy models by leveraging two pre-trained models, VQVAE [24] and VQGAN [25]. They proposed a Hopfield network to maintain and update states during image generation, enhancing consistency between the generated images and textual descriptions, thereby resulting in remote sensing images with abundant details. The VQ-Diffusion model integrates the principles of diffusion and the Transformer architecture, demonstrating outstanding performance in general image generation. Therefore, this study uses VQ-Diffusion as the benchmark model for remote sensing image generation. To address issues such as unclear details, unreasonable spatial structures, and regional distortions in generated images, the model’s structure has been improved, significantly enhancing the quality of the generated remote sensing images.

3. Methods

The text-to-remote-sensing image generation model, RSVQ-Diffusion, proposed in this paper is an improvement of the VQ-Diffusion model. By incorporating the spatial characteristics of remote sensing images, we introduce a spatial position encoding mechanism based on pixel space coordinates. To address the issue of neglecting local information in the Transformer architecture when processing short sequences, we design a local perception mechanism for short-sequence feature extraction. The RSVQ-Diffusion model optimizes the diffusion image decoder, leading to improved performance in remote sensing image generation tasks. The RSVQ-Diffusion model significantly outperforms the original model in terms of the spatial structure, realism, diversity, and detail presentation of the generated remote sensing images.

3.1. RSVQ-Diffusion Model

3.1.1. VQ-Diffusion Model Network Architecture

The VQ-Diffusion model primarily comprises three components: the text encoder CLIP [26], the image encoder VQVAE, and the diffusion image decoder.

During the data preprocessing stage, the input text is first processed using byte pair encoding (BPE), which splits the text into smaller units (such as subwords or character-level representations), to enhance the model’s processing capabilities. Next, the text is encoded using the Contrastive Language-Image Pre-training (CLIP) model. The text is tokenized and converted into word embeddings and then processed through multiple Transformer layers to generate fixed-dimensional semantic vectors.

The VQVAE image encoder is used to encode image data into a discrete latent space for data compression. Convolutional neural networks extract image features and map them to a predefined set of discrete vectors, completing the quantization step. This method effectively reduces the dimensionality of the information and, during the decoding phase, reconstructs the discrete representations into the original images using deconvolutional networks, ensuring image quality.

Diffusion models are trained and inferred in the latent space, with the generated images decoded and output through VQVAE. During the training phase, noise is gradually added to real images, forming progressively blurred image sequences. In the inference phase, after inputting text descriptions, the model generates images starting from random noise; it then uses text embeddings as conditional inputs to guide the denoising process, resulting in high-quality images. The self-attention mechanism of the Transformer dynamically captures relevant information throughout this process, enhancing the quality and consistency of the generated images.

As illustrated in Figure 1, the architecture of the VQ-Diffusion model comprises a reverse denoising process where each step is predicted by the diffusion image decoder for the next time step. The diffusion decoder is built from multiple stacked TransformerBlocks, each of which includes the following five processes: AdaLayerNorm: this layer adaptively normalizes each batch of input data independently to enhance the model’s adaptability; FullAttention: this mechanism computes the correlation between each element in the input sequence and other elements, emphasizing important information through attention weights; CrossAttention: similar to FullAttention, this mechanism helps the model better understand and map textual descriptions to the corresponding image features; and Multi-layer Perceptron (MLP): this component processes and transforms feature data, gradually bringing it closer to the characteristics of the target image. By employing these network layers, the TransformerBlock performs multi-layer processing and feature transformation, achieving the complex generative process from text to image.

VQ-Diffusion model is primarily designed for general image generation. However, there are significant differences between general images and remote sensing images. Firstly, spatial characteristics are one of the main distinctions. General images are typically captured by digital cameras or smartphone cameras, featuring high resolution and rich image details, making them suitable for visual representation in daily life. In contrast, remote sensing images are primarily obtained from remote sensing platforms such as satellites, drones, or aircraft, usually with lower spatial resolution but capable of covering large surface areas. These images often contain multispectral information (e.g., infrared, visible light, and radar bands), providing unique reflectance characteristics of surface features under different spectral bands [27]. This results in complex structures and textures across multiple scales, spectra, and temporal states in remote sensing images. Additionally, remote sensing images typically exhibit more complex and variable backgrounds, significantly influenced by climatic, geographic, and temporal factors. Moreover, the types and sources of noise and interference (e.g., atmospheric effects and sensor errors) in remote sensing images differ greatly from those in general images. Therefore, the generation and processing of remote sensing images demand higher technical requirements and more complex model designs. This paper primarily improves the VQ-Diffusion model by focusing on the spatial characteristics and local details of remote sensing images. The model is then trained on a remote sensing image dataset to enhance its capability in generating high-quality remote sensing images.

3.1.2. RSVQ-Diffusion Model Network Architecture

In the diffusion image decoder of the VQ-Diffusion model, the original Transformer architecture, when processing images unfolded into one-dimensional sequences, relies solely on sequential positional encoding. This leads to the loss of spatial information, resulting in image distortions and overlaps of objects. To address this issue, the RSVQ-Diffusion model incorporates spatial positional encoding of two-dimensional images into the original positional encoding. This allows the model to consider the spatial location information of remote sensing images during the generation process, thereby producing remote sensing images with reasonable spatial distribution.

Additionally, the original model’s Transformer self-attention mechanism processes the entire input sequence as a whole, overlooking relationships within local sequences. To address this issue, the RSVQ-Diffusion model introduces a local awareness mechanism. This mechanism divides long sequences into multiple short sequences and calculates attention weights within these short sequences, followed by averaging. This approach encourages the Transformer to focus on local detail information in the images, thereby enhancing the richness of details in the generated images. By alternately using the local awareness mechanism and the original self-attention mechanism, the RSVQ-Diffusion model achieves synchronous learning of both local and global information in the diffusion image decoder. This results in the generation of remote sensing images that are rich in detail and have a reasonable spatial structure.

This paper designs the TransLocalBlock module, which is interleaved with the TransformerBlock in the original model to introduce a local attention mechanism. This allows the RSVQ-Diffusion model to focus on information within short sequences during training, enhancing the learning of local features and improving computational efficiency and information capture capabilities when processing long sequences. In the RSVQ-Diffusion model, local attention primarily involves dividing the long sequence into multiple short sequences and employing a multi-head attention mechanism to focus on local information. The network structure of the RSVQ-Diffusion model is illustrated in Figure 2, where the TransLocalBlock module replaces the full attention mechanism in the original TransformerBlock with the proposed local perception mechanism (LMA).

3.1.3. Loss Function

The loss function is constructed based on variational inference and reconstruction error, and it primarily comprises three components: the KL (Kullback–Leibler) divergence loss, the decoder negative log-likelihood loss, and the auxiliary loss.

Specifically, the KL divergence is an asymmetric measure used to quantify the difference between two probability distributions. It measures the “information loss” incurred when using distribution

p

to approximate distribution

q

. The calculation is detailed in Equation (1).

K L_{loss} = \sum_{t} q (x_{0} | x_{t}) \log \frac{q (x_{0} | x_{t})}{p (x_{t} | x_{0})}

(1)

where

q (x_{0} | x_{t})

represents the posterior distribution of the real data, and

p (x_{t} | x_{0})

represents the posterior distribution predicted by the model.

The model also introduces a negative log-likelihood loss, which is used to evaluate the difference between the final generated data predicted by the model and the real data. The decoder’s negative log-likelihood loss is detailed in Equation (2).

{NLL}_{d e c o d e r} = - E_{x ~ p (x)} [\log p (x_{t} | x_{0})]

(2)

where

x

is a sample drawn from the real distribution

p (x)

;

x_{t}

represents the output generated by the decoder, and

x_{0}

denotes the real input sample.

\log p (x_{t} | x_{0})

indicates the conditional probability of generating the sample

x_{t}

given the input

x_{0}

. The introduction of the negative sign is due to the log-likelihood values typically being negative; taking the negative ensures the loss is positive, and minimizing the NLL is equivalent to maximizing the likelihood.

For the auxiliary loss function in the model, refer to Equation (3).

K L_{aux} = \sum_{t} q (x_{0} | x_{t - 1}) \log \frac{q (x_{0} | x_{t - 1})}{p (x_{t - 1} | x_{0})}

(3)

where

q (x_{0} | x_{t - 1})

represents the posterior distribution of generating the real sample

x_{0}

given the state

x_{t - 1}

,capturing the feature information at the time step

x_{t - 1}

; and

p (x_{t - 1} | x_{0})

represents the theoretical model distribution of generating the state

x_{t - 1}

given the real sample

x_{0}

, derived from real data.

These three loss components are combined through a conditional logic across time steps to form the final loss function, as detailed in Equation (4).

L o s s = \{\begin{array}{l} \frac{N L L_{d e c o d e r} + K L_{a u x}}{p_{t}}, t = 0 \\ \frac{K L_{l o s s} + K L_{a u x}}{p_{t}}, o t h e r \end{array}

(4)

where

p_{t}

represents the weight associated with the time step

t

, relevant to the model’s generative process at time

t

, used for loss normalization to ensure consistency of the loss across different time steps or sample sizes.

3.2. Spatial Positional Encoding

Compared to other encoding mechanisms, spatial positional encoding helps the model better learn the spatial structure and positional relationships within images. Since remote sensing images typically contain complex geographical information and spatial features, spatial positional encoding is particularly important for remote sensing image generation. Due to the Transformer’s tendency to overlook the two-dimensional spatial structure during sequence processing, this study improves upon the original positional encoding to develop spatial positional encoding. By combining spatial positional encoding with the positional encoding in the original network, the Transformer is enhanced to better learn the spatial features of remote sensing images. The spatial positional encoding, original positional encoding, and their fusion principle are illustrated in Figure 3. The two-dimensional coordinate information (x, y) of the input image is extracted and encoded through a sinusoidal calculation to obtain the spatial positional encoding. For the original positional encoding, the input image is unfolded into a one-dimensional sequence, and the positional information n of each sequence position is also encoded using a sinusoidal method. Finally, the encodings are fused to obtain the final combined positional encoding.

First, the calculation process for spatial positional encoding is detailed in Equations (5)–(7).

\{\begin{array}{l} S P E_{r o w} (i, 2 k) = \sin (\frac{i}{H} \cdot 2 π) \\ S P E_{r o w} (i, 2 k + 1) = c o s (\frac{i}{H} \cdot 2 π) \end{array}

(5)

where

H

represents the height of the image;

i

represents the row position index ranging from

i = 0, 1, \dots, H - 1

; and

k

is the embedding dimension index, ranging from

k = 0, 1, \dots, \frac{D}{2} - 1

(where

D

is the embedding dimension, i.e., the dimension of the position representation).

\{\begin{array}{l} S P E_{c o l} (j, 2 k) = \sin (\frac{j}{w} \cdot 2 π) \\ S P E_{c o l} (j, 2 k + 1) = c o s (\frac{j}{w} \cdot 2 π) \end{array}

(6)

where

W

represents the width of the image, and

j

is the column position index, ranging from

j = 0, 1, \dots, W - 1

.

S P E (i, j) = c o n c a t (S P E_{r a w} (i, j), S P E_{c o l} (i, j))

(7)

The shape of the spatial positional encoding,

S P E (i, j)

, is

(B, H + W, D)

, where

B

is the batch size.

Second, the sequential positional encoding calculation of the Transformer architecture is detailed in Equation (8).

\{\begin{array}{l} P E (i, 2 k) = \sin (\frac{i}{1000 0^{\frac{2 k}{D}}}) \\ P E (i, 2 k + 1) = \cos (\frac{i}{1000 0^{\frac{2 k + 1}{D}}}) \end{array}

(8)

where

i

is the position index ranging from

i

= 0, 1, …, num_steps-1(where num_steps represents the length of the sequence); and

k

is the embedding dimension index, ranging from

k = 0,1, \dots, \frac{D}{2} - 1

. The shape is

(B, N, D)

(where

N

represents the number of time steps).

Finally, the two positional encodings are combined during the forward propagation process of the positional encoding, as detailed in Equation (9).

U P E = S P E (i, j) \oplus P E (i)

(9)

where

\oplus

represents the concatenation of the sinusoidal and spatial positional encodings along the embedding dimension. The final unified positional encoding,

U P E

, takes the shape of

(B, H + W + N, D)

.

3.3. Local Perception Mechanism

Inspired by the concept of local perception, this paper designs a local perception mechanism (LMA) tailored for long sequences. Specifically, the input tensor is mapped into query (q), key (k), and value (v) vectors through linear layers, and these vectors are reshaped into multiple attention heads. Local attention is achieved by defining a fixed window size, thereby limiting the attention computation scope of each time step to the current step and its surrounding context. When calculating the attention score for each time step, the query vector performs a matrix multiplication with the key vectors within the window, and the scores are normalized using softmax to obtain attention weights. Then, the weights of all heads are averaged to form a global attention matrix. These weights are subsequently used to perform a weighted sum of the value vectors, yielding the final output, as detailed in Equation (10):

y = s o f t \max (\frac{q_{i} \cdot k_{w}^{T}}{\sqrt{h}}, \dim = - 1) \cdot V

(10)

where

q_{i}

represents the query vector at position

i

, with a shape of

R^{B \times n_{h e a d} \times h}

(where

B

is the batch size;

n_{h e a d}

is the number of heads;

h

is the dimension of each head);

k_{w}

represents the key vector in the local window, with a shape of

R^{B \times n_{h e a d} \times h \times w i n d o w_s i z e}

(where window_size is the size of the window); and

v

represents the value vector, with a shape of

R^{B \times n_{h e a d} \times T \times h}

(where

T

is the number of time steps).

s o f t \max (\cdot, \dim = - 1)

indicates the element-wise exponential operation followed by normalization such that the sum of these values is 1, and

d i m = - 1

indicates that the softmax is performed along the last dimension. The final output

y

takes the shape of

R^{B \times T \times C}

(where

C = n_{h e a d} \times h

).

Finally, through linear projection and residual connections, the model can effectively integrate local information, enhancing feature representation capabilities while maintaining computational efficiency. The principle of the local perception mechanism module is illustrated in Figure 4.

4. Experimental Results and Analysis

To validate the effective application and performance improvement of the proposed RSVQ-Diffusion model in the field of remote sensing image generation, both the original model and the RSVQ-Diffusion model were trained on publicly available remote sensing image datasets. Additionally, ablation experiments were designed to verify the improvement effects, and comparisons were made with existing text-to-remote-sensing image generation models.

4.1. Datasets and Evaluation Metrics

The Remote Sensing Image Caption Dataset (RSICD) [28] is a commonly used dataset in the field of multimodal learning for remote sensing. It contains 10,921 image–text pairs, covering 30 categories such as playgrounds, bridges, beaches, deserts, and cities. As a critical resource for training text-to-remote-sensing image models, it possesses significant academic value in the field of remote sensing image understanding due to its large scale, rich semantic descriptions, and multi-source image acquisition characteristics. This dataset not only provides 10,921 images along with five corresponding descriptions for each image, enhancing the model’s understanding and generalization of complex remote sensing scenes, but also presents challenges for accurate classification and recognition due to its high intra-class diversity and low inter-class variability. Moreover, the RSICD supports a multi-task evaluation framework, including but not limited to image classification, retrieval, and object counting, making it a comprehensive benchmark for evaluating model performance. Cross-dataset comparative studies further expand its academic application scope, and performance improvements of fine-tuned models on this dataset, such as the significant increase in top-1 accuracy of the CLIP model [29], further demonstrate the important role of the RSICD in advancing intelligent remote sensing image analysis technologies. Therefore, this paper selects the RSICD as the training dataset for the text-to-remote-sensing image generation model, with the training, validation, and test sets divided in an 8:1:1 ratio.

The evaluation metrics used in this study to assess the quality of generated images are commonly used in the field of image generation: Fréchet Inception Distance (FID) [30], Contrastive Language-Image Pre-training (CLIP) score, and Inception Score (IS) [31]. FID primarily measures the distribution distance between real and generated image features. First, the generated images are embedded into the latent feature space of the selected layer of the inception network, and the embeddings of the generated and real images are treated as two continuous multivariate Gaussian samples to facilitate the calculation of their means and covariances. The calculation process is detailed in Equation (11).

F I D = || μ_{r} - μ_{g} {||}_{2}^{2} + T_{r} (Σ_{r} + Σ_{g} - 2 (Σ_{r} Σ_{g})^{\frac{1}{2}})

(11)

where

μ_{r}, Σ_{r}

represent the mean and covariance of the feature distribution of the real images, and

μ_{g}, Σ_{g}

represent the mean and covariance of the feature distribution of the generated images.

|| μ_{r} - μ_{g} {||}_{2}^{2}

denotes the Euclidean distance between the means of the feature distributions of the real and generated images, and

T_{r}

denotes the trace of the sum of the covariance matrices, which is the sum of the diagonal elements of the square matrices.

CLIP is used to evaluate the semantic alignment between generated images and their corresponding input semantic information. It measures their correlation by calculating the cosine similarity between image and text embeddings. The model converts images and texts into high-dimensional vectors, and the similarity between these vectors is calculated to obtain the CLIP score, as detailed in Equation (12).

CLI P_{s c o r e} (I, T) = τ \cdot \frac{f_{i} (I)}{|| f_{i} (I) ||} \cdot \frac{f_{t} (T)}{|| f_{t} (T) ||}

(12)

where

f_{i} (I)

represents the feature vector extracted using the image encoder;

f_{t} (T)

represents the vector extracted by the text encoder, and

|| f_{i} (I) ||

and

|| f_{t} (T) ||

are the L2 norms (normalized to unit length) of the respective feature vectors.

τ

is the scaling parameter used to adjust the score range.

The Inception Score (IS) is a widely used metric for evaluating the performance of generative models. It employs a pre-trained Inception v3 network to classify generated images, assessing the clarity and diversity of these images. Specifically, the IS measures the network’s confidence in the classification of generated images and evaluates the diversity of the generated samples. A higher IS indicates higher quality images and greater diversity among the images. The IS calculation involves obtaining the classification probabilities for each generated image and then measuring the Kullback–Leibler (KL) divergence between the conditional distribution for each image and the marginal distribution of all images. The calculation process is shown in Equation (13).

I S = e x p ((\frac{1}{N}) \sum_{i = 1}^{N} D_{K L} (p (y | x_{i}) || p (y)))

(13)

where

N

represents the number of generated remote sensing images;

p (y| x_{i})

is the conditional probability distribution of category

y

given image

x_{i}

, indicating the diversity of categories in the generated set;

D_{K L} (p (y | x_{i}) || p (y)

denotes the Kullback–Leibler divergence between the conditional distribution of image

x_{i}

and the marginal distribution of all images.

4.2. Experimental Setup

The VQ-Diffusion mode, based on the diffusion principle and the Transformer network architecture, l has hundreds of millions of parameters, and requiring a large dataset for training to effectively learn image features. However, remote sensing images are difficult to acquire and limited in quantity, which cannot fully support the model’s learning of image features. The Conceptual Captions dataset [32] consists of a large number of natural images and their descriptive texts, covering a wide range of scenes and object types with high visual diversity. Although this dataset is derived from natural images, the visual features it contains, such as object shapes, lighting, and perspective variations, have certain similarities to the common land-cover types and scenes found in remote sensing images. Therefore, pre-training with the Conceptual Captions dataset can help the model learn general visual features and provide effective initialization for training with remote sensing images. Especially in situations where remote sensing data are scarce, cross-domain knowledge transfer can significantly improve model performance and accelerate the training process. Hence, this study uses the pre-trained VQ-Diffusion model trained on the Conceptual Captions dataset to initialize the training of the VQ-Diffusion model on remote sensing datasets.

In this experiment, efficient hardware and software configurations were employed for the training process. The hardware platform included an Intel Xeon Platinum 8352S CPU, 512 GB of memory, and an RTX 4090 GPU to ensure efficient computation while training large models (Santa Clara, CA, USA). The operating system used was Windows 11, and the deep was PyTorch1.12.0. To ensure the stability and efficiency of the training process, the AdamW optimizer was selected. This optimizer effectively prevents overfitting by combining L2 regularization and weight decay. Additionally, the Reduce LR On Plateau With Warmup learning rate scheduler was used, which automatically adjusts the learning rate based on the performance on the validation set, ensuring stable convergence during different training stages. A weight decay value of 4.5 × 10⁻² was chosen to control overfitting and help the model achieve better generalization during training. For the training parameters, after multiple adjustments to determine the optimal settings, a batch size of 4 was chosen to fully utilize the GPU memory, and the number of epochs was set to 100 to ensure the model could learn effective image features over sufficient training cycles. The initial learning rate was set to 0.3 × 10⁻⁶ with a minimum learning rate of 1.0 × 10⁻⁶ to ensure rapid learning in the early stages and fine-tuning in the later stages with a smaller learning rate.

Details regarding the model training environment and parameter settings are provided in Table 1 and Table 2.

The VQ-Diffusion model, which combines diffusion principles and the Transformer architecture, has significant computational complexity. The diffusion model generates images through multiple iterative processes, with each iteration (denoising process) requiring complex calculations using the Transformer’s self-attention mechanism. Consequently, the training process demands substantial computational resources and longer training times. To accelerate training and reduce memory consumption, automatic mixed precision (AMP) training was employed during the model’s training on the RSICD, taking approximately 52 h on a single RTX 4090 GPU. Despite the introduction of a new spatial positional encoding process in the RSVQ-Diffusion model, the TransLocalBlock module replaces the original self-attention mechanism with a local awareness mechanism. This module divides long sequences into multiple short sequences and achieves parallel processing through the multi-head attention mechanism. Thus, the proposed improvements enhance the quality of the generated images while having minimal impact on the model’s training and inference speed.

4.3. Ablation Experiments

We conducted a series of ablation experiments on the RSICD, and the results are detailed in Table 3, which lists the changes in model performance under different experimental settings. We evaluated the impacts of spatial positional encoding, the local awareness mechanism, and their combination. The model was quantitatively assessed using FID and CLIP scores. The results indicate that incorporating spatial positional encoding (VQ-Diffusion_SP) and the local awareness mechanism (VQ-Diffusion_LMA) effectively enhances model performance. The combination of both (RSVQ-Diffusion) yielded the most significant improvements. Specifically, the data in Table 3 demonstrate the improvements in image generation quality for each setting, further validating the effectiveness of these strategies in enhancing the performance of the VQ-Diffusion model.

Using the VQ-Diffusion model as a baseline, it can be observed that the VQ-Diffusion_SP model shows a significant reduction in FID compared to the baseline model. The VQ-Diffusion_LMA model achieves an even greater reduction, while the proposed improved model, RSVQ-Diffusion, achieves the best reduction in FID relative to the original model. Regarding the CLIP text–image matching score, although the VQDM_SP and VQDM_LMA models exhibit similar performance, the RSVQ-Diffusion model shows a noticeable increase in this metric. Additionally, all three improved models show an increase in IS evaluation metrics compared to the baseline model. Although the IS improvement of RSVQ-Diffusion is not as large as that of VQ-Diffusion_LMA, considering the overall performance across other metrics, the RSVQ-Diffusion model stands out with its outstanding overall effectiveness.

Figure 5 presents eight groups of text descriptions (a)–(h) and their corresponding generated images. The improvements made to the model to account for the spatial characteristics of remote sensing images, as well as the introduction of the local awareness mechanism, have significantly enhanced the image generation quality. The RSVQ-Diffusion model demonstrates high-resolution and more stable image generation capabilities.

Specifically, for group (a), the text description is “This theme park has a lake and some buildings near the river”. The VQ-Diffusion model fails to capture the keyword “river”. The images generated by the VQ-Diffusion_SP and VQ-Diffusion_LMA models show enhanced spatial positioning and details. The RSVQ-Diffusion model effectively generates images that include buildings, a lake, and a river, with reasonable ground features and clear boundaries. The differences from the real images highlight their diversity. For group (b), the text keywords are “building”, “river”, “tree”, and “overpass”. The VQ-Diffusion model shows deformed overpasses, missing buildings, and blurred details. The VQ-Diffusion_SP model generates features that match the text description, but the overpass appears deformed. The VQ-Diffusion_LMA model produces a reasonably shaped overpass but lacks the semantic information of the “river”. The RSVQ-Diffusion model accurately places the overpass, with the buildings and trees reasonably structured. The river detail in the upper right corner is slightly blurred. For group (c), the VQ-Diffusion model generates distorted details; the VQ-Diffusion_SP model’s mountain textures are blurred, and the VQ-Diffusion_LMA model generates mountains with reasonable colors but with a chaotic distribution. The RSVQ-Diffusion model produces mountains with continuous textures, naturally transitioning to the brown exposed ground, with reasonable spatial structure. For group (d), the RSVQ-Diffusion model generates farmland with semi-regular geometric shapes, better color transitions, and better spatial distribution compared to the original model’s generated images. From groups (e) to (h), the images generated by the VQ-Diffusion model exhibit target deformations, twisted spatial structures, and unreasonable color distributions, failing to reflect the corresponding textual information. The VQ-Diffusion_SP and VQ-Diffusion_LMA models generate images that reflect improvements in spatial structure and local details. The RSVQ-Diffusion model produces images with complete lake structures and clear boundaries, factory buildings that align with real target scene distributions and are neatly arranged, realistic tree colors, and well-organized oil tanks in the factories. The overall image quality generated by the RSVQ-Diffusion model surpasses the other three models.

Based on the evaluation metrics from the ablation experiments and the generated images, it is evident that the image generation capabilities of the RSVQ-Diffusion model significantly surpass those of the VQ-Diffusion model. The two improved versions, VQ-Diffusion_SP and VQ-Diffusion_LMA, each demonstrate visual enhancements that align with the principles of their respective modules. Although there are still differences between the images generated by the RSVQ-Diffusion model and real images, the generated images from the improved model show clear target details and spatial consistency, making the visuals more reasonable. They effectively reflect the scenes described in the text.

4.4. Comparison Experiments

A comparative analysis was conducted between the RSVQ-Diffusion model and the state-of-the-art text-to-remote-sensing image generation model, Txt2Img-MHN, using the RSICD test set. The evaluation employed FID, CLIP, and IS as the assessment metrics to quantitatively analyze the generated results. The comparative experiment results are shown in Table 4.

It can be observed that the RSVQ-Diffusion model outperforms the Txt2Img-MHN model in both metrics, and it also shows improvements compared to the VQ-Diffusion model. This indicates that the RSVQ-Diffusion model excels in remote sensing image generation quality, text–image matching, and diversity, meeting the practical application needs of the remote sensing image processing field.

Figure 6 showcases eight groups (a)–(h) of images generated by four models: Txt2Img-MHN with pre-trained models VQVAE and VQGAN, VQ-Diffusion, and RSVQ-Diffusion. Visually, although the RSVQ-Diffusion model’s generated images exhibit minor issues such as irregular marking lines on the football field in group (b) and missing demarcated areas in the bottom left of the baseball field in group (f), it outperforms the other three models in terms of ground feature details and boundaries, text–image alignment, and overall visual quality. The RSVQ-Diffusion model’s images are closer to real images.

For instance, in group (a), with the text “A bridge is built over the river”, the image generated by the Txt2Img-MHN (VQVAE) model has an unclear lower left edge of the bridge, making it difficult to discern ground features, and it fails to capture the local structural features of the image. The Txt2Img-MHN (VQGAN) model’s image shows a deformed and blurred bridge with no clear distinction from the background. The VQ-Diffusion model’s image has relatively clear boundaries but shows deformation at the upper right edge of the bridge and distortion in the bridge area. The RSVQ-Diffusion model’s image, however, presents a complete bridge structure with clear boundaries and shadows beneath the bridge, enhancing realism and spatial coherence. In group (b), the Txt2Img-MHN models generate images with blurred and deformed stadium boundaries and indistinguishable details. The VQ-Diffusion model’s stadium edge has an unreasonable shape, and the surrounding buildings are deformed. The RSVQ-Diffusion model generates a clear image of the stadium with a visible central circle and clear, reasonable boundaries. In group (c), the Txt2Img-MHN models produce images with basic spatial structures but blurred details, while the VQ-Diffusion model’s left black area has an unnatural color texture distribution. The RSVQ-Diffusion model generates images with a reasonably structured river and clear details. In group (d), the Txt2Img-MHN models’ images have blurred building boundaries and overlapped ground features. The VQ-Diffusion model’s image shows deformed and distorted building boundaries, while the RSVQ-Diffusion model generates images with neatly arranged buildings and clear boundaries, displaying visually realistic images. In groups (e) to (h), other models’ images exhibit blurred details, deformed ground features, twisted boundaries, and missing target areas. The RSVQ-Diffusion model’s images show a natural texture distribution at the beach–sea boundary, orderly structured baseball fields and buildings, a consistent spatial distribution of residential areas and roads, and clear, reasonable surrounding ground features of swimming pools. These target areas are rich in detail, have realistic colors, and have different regional structures that match the distribution of real remote sensing images, clearly outperforming the other models.

Based on the evaluation metrics and generated images from the comparative experiments, it is evident that the RSVQ-Diffusion model significantly outperforms the other three models. The images generated by the RSVQ-Diffusion model exhibit clearer ground features and boundaries, with minimal blurring and distortion. The spatial structure is more consistent, with reasonable positional relationships between ground features, avoiding overlapping and misalignment issues. Visually, the images more closely resemble real images with natural textures and materials. Additionally, the semantic information satisfies the scenes described in the input text.

5. Conclusions

The proposed RSVQ-Diffusion model achieves high-quality remote sensing image generation that aligns well with input semantic information. By incorporating the spatial characteristics of remote sensing images into the diffusion image decoder and employing the sequential local perception mechanism integrated with the Transformer architecture, the RSVQ-Diffusion model effectively generates target remote sensing images based on the input text. The generated images exhibit clear local details, reasonable spatial structure, and closely align with remote sensing image characteristics, achieving both controllability and realism. Additionally, we conducted a comprehensive comparative analysis to evaluate the impact of various improvements on the model’s image generation quality. The experimental results indicate that the RSVQ-Diffusion model outperforms other models across various evaluation metrics and in terms of generated image quality, such as detail richness and reduced distortion and deformation. The improvement in FID scores demonstrates higher realism and closer resemblance to real images. The increase in CLIP scores reflects better alignment of generated images with input text, and the enhancement in ISs highlights superior image diversity. This study extends the application of diffusion models and Transformer architectures to the field of remote sensing imagery, broadening the research scope of remote sensing image generation methods. The RSVQ-Diffusion model generates remote sensing images with high fidelity, spatial consistency, and rich details, which can enhance data for remote sensing image processing tasks, simulate specific scenarios, and provide robust support for fields such as environmental monitoring, urban planning, and agricultural management.

Although the proposed methods in this study have significantly improved the realism of remote sensing image generation and text–image matching accuracy, the existing remote sensing datasets lack extensively labeled text data, making manual labeling difficult and limiting the variety of generated images. There is room for further improvement in the model’s generation capabilities and performance. In future work, we will focus on two main aspects: First, we aim to collect and construct diverse, high-resolution remote sensing image datasets to support large-scale generation tasks. Second, we will conduct in-depth research into Transformer architecture variants to optimize computational efficiency. This will enable us to improve training efficiency and inference speed, all while maintaining robust model performance. Lastly, by incorporating control conditions, such as metadata and geolocation information, and utilizing prior information like time and weather conditions, we aim to generate more informative remote sensing images. This will enhance the model’s ability to generate realistic images, further improving its practical applications.

Author Contributions

Conceptualization, Y.F. and X.J.; methodology, X.G. and F.W.; software, X.G. and Y.F.; validation, X.G., Y.Z., T.F., C.L. and J.P.; formal analysis, X.G. and F.W.; resources, X.J.; data curation, X.G. and T.F.; writing—original draft preparation, X.G. and Y.F.; writing—review and editing, X.G. and F.W.; project administration, X.J.; funding acquisition, X.J. and F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Development Plan Project of Jilin Province, China, grant number 20220101168JC, as well as funded by the National Key R&D Program of China, grant number 2022YFB3902300, and funded by the National Natural Science Foundation of China, grant number 62371352.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author (the data are not publicly available due to privacy).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; Manzagol, P.A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010, 11, 3371–3408. [Google Scholar]
Kingma, D.P. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Berthelot, D. BEGAN: Boundary Equilibrium Generative Adversarial Networks. arXiv 2017, arXiv:1703.10717. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Mirza, M. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 7–9 July 2015. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Zhao, R.; Shi, Z. Text-to-remote-sensing-image generation with structured generative adversarial networks. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8010005. [Google Scholar] [CrossRef]
Song, W.; Nie, F.; Wang, C.; Jiang, Y.; Wu, Y. Unsupervised Multi-Scale Hybrid Feature Extraction Network for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 3774. [Google Scholar] [CrossRef]
Lu, D.; Cheng, S.; Wang, L.; Song, S. Multi-scale feature progressive fusion network for remote sensing image change detection. Sci. Rep. 2022, 12, 11968. [Google Scholar] [CrossRef] [PubMed]
Vaswani, A. Attention is all you need. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Zhu, L.; Chen, Y.; Ghamisi, P.; Benediktsson, J.A. Generative adversarial networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2018, 56, 5046–5063. [Google Scholar] [CrossRef]
Pang, B.; Zhao, S.; Liu, Y. The Use of a Stable Super-Resolution Generative Adversarial Network (SSRGAN) on Remote Sensing Images. Remote Sens. 2023, 15, 5064. [Google Scholar] [CrossRef]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seoul, Republic of Korea, 15–20 June 2019. [Google Scholar]
Khanna, S.; Liu, P.; Zhou, L.; Meng, C.; Rombach, R.; Burke, M.; David, B.; Ermon, S. Diffusionsat: A generative foundation model for satellite imagery. arXiv 2023, arXiv:2312.03606. [Google Scholar]
Tang, D.; Cao, X.; Hou, X.; Jiang, Z.; Liu, J.; Meng, D. CRS-diff: Controllable remote sensing image generation with diffusion model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5638714. [Google Scholar] [CrossRef]
Xu, Y.; Yu, W.; Ghamisi, P.; Kopp, M.; Hochreiter, S. Txt2Img-MHN: Remote sensing image generation from text using modern Hopfield networks. IEEE Trans. Image Process. 2023, 32, 5737–5750. [Google Scholar] [CrossRef]
Van Den Oord, A.; Vinyals, O. Neural discrete representation learning. In Proceedings of the Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sutskever, I. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021. [Google Scholar]
Zheng, F.; Li, W.; Wang, X.; Wang, L.; Zhang, X.; Zhang, H. A cross-attention mechanism based on regional-level semantic features of images for cross-modal text-image retrieval in remote sensing. Appl. Sci. 2022, 12, 12221. [Google Scholar] [CrossRef]
Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2183–2195. [Google Scholar] [CrossRef]
Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Zhou, J. Remote clip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Barratt, S.; Sharma, R. A note on the inception score. arXiv 2018, arXiv:1801.01973. [Google Scholar]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018. [Google Scholar]

Figure 1. Architecture of the VQ-Diffusion model.

Figure 2. Architecture of the RSVQ-Diffusion model.

Figure 3. Principle of spatial positional encoding.

Figure 4. Principle of the local perception mechanism.

Figure 5. Ablation experiment results. The second to fifth columns, respectively, show the generated images from the original model, the model with spatial positional encoding, the model with local attention mechanism, and the RSVQ-Diffusion model. Each row represents the images generated by different models given the same input text.

Figure 6. Comparison experiment results. The first column shows the real images from the test set. The second and third columns display the images generated by the Txt2Img-MHN model under two different pre-trained models. The fourth column presents examples of images generated by the VQ-Diffusion model, and the fifth column shows the images generated by the RSVQ-Diffusion model. Each row represents the images generated by different models for the same input text.

Table 1. Training environment configuration.

Platform	Configuration Item	Configuration Value
Hardware Platform	CPU	Intel Xeon(R)Platinum 8352S CPU @ 2.2GHz
	Memory	512 GB
	GPU	RTX4090
	Graphics Memory	24 G
Software Platform	Operating System	Windows11
Software Platform	Deep Learning Framework	Pytorch

Table 2. Training parameter setting.

Parameter	Value
Epoch	100
Batch size	4
Max learning rate	0.3 × 10⁻⁶
Min learning rate	1.0 × 10⁻⁶
Optimizer	AdamW
Weight Decay	4.5 × 10⁻²
Learning Rate Scheduler	Reduce LR On Plateau With Warmup
Input Image Size	(256, 256)

Table 3. Ablation experiment results.

Model	VQ-Diffusion	VQ-Diffusion_SP	VQ-Diffusion _LMA	RSVQ-Diffusion
SP	—	√	—	√
LMA	—	—	√	√
FID↓	96.68	93.46	92.19	90.36
CLIP↑	26.91	26.80	26.80	27.22
IS↑	7.11	7.94	7.35	7.24

√ means the module, or trick in the first column is used in the corresponding model.

Table 4. Comparison experiment results.

Model	FID↓	CLIP↑	IS↑
Txt2Img-MHN (VQVAE)	175.36	21.35	3.51
Txt2Img-MHN (VQGAN)	102.44	20.27	5.99
VQ-Diffusion	96.68	26.92	7.11
RSVQ-Diffusion (Ours)	90.36	27.22	7.24

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Fu, Y.; Jiang, X.; Wu, F.; Zhang, Y.; Fu, T.; Li, C.; Pei, J. RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation. Appl. Sci. 2025, 15, 1121. https://doi.org/10.3390/app15031121

AMA Style

Gao X, Fu Y, Jiang X, Wu F, Zhang Y, Fu T, Li C, Pei J. RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation. Applied Sciences. 2025; 15(3):1121. https://doi.org/10.3390/app15031121

Chicago/Turabian Style

Gao, Xin, Yao Fu, Xiaonan Jiang, Fanlu Wu, Yu Zhang, Tianjiao Fu, Chao Li, and Junyan Pei. 2025. "RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation" Applied Sciences 15, no. 3: 1121. https://doi.org/10.3390/app15031121

APA Style

Gao, X., Fu, Y., Jiang, X., Wu, F., Zhang, Y., Fu, T., Li, C., & Pei, J. (2025). RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation. Applied Sciences, 15(3), 1121. https://doi.org/10.3390/app15031121

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RSVQ-Diffusion Model for Text-to-Remote-Sensing Image Generation

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. RSVQ-Diffusion Model

3.1.1. VQ-Diffusion Model Network Architecture

3.1.2. RSVQ-Diffusion Model Network Architecture

3.1.3. Loss Function

3.2. Spatial Positional Encoding

3.3. Local Perception Mechanism

4. Experimental Results and Analysis

4.1. Datasets and Evaluation Metrics

4.2. Experimental Setup

4.3. Ablation Experiments

4.4. Comparison Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI