3.1. SSA-Enhanced Generator with Skip-Attention
SPRGAN follows a structural framework similar to CycleGAN [
7] for unpaired image dehazing, which uses two pairs of generators and discriminators to facilitate unpaired image-to-image translation. In the SSA-enhanced generator, we employ the UNet architecture [
35], where the encoding path extracts features from the input via four convolutional layers with downsampling. The model incorporates a residual structure (skip connection [
36]), facilitating the transfer of features between encoding and decoding layers. Additionally, features from the lowest encoding layer are fed into the Spatial-Spectrum Attention Vision Transformer with Skip-Attention (SSA-SKIPAT) (
Figure 3A). The SSA-SKIPAT module adopts a fusion technique that combines information from both the spectrum and spatial domains to effectively learn feature information. Simultaneously, the SKIPAT module reduces the computational complexity of MSA while maintaining the accuracy of the original model.
In the encoding block of the UNet architecture, the pre-processing layer initially converts the input image into a tensor with dimensions
. Subsequently, this tensor undergoes a series of transformations through four convolutional and pooling layers, resulting in an output feature with dimensions
. The output of the encoding block is then passed as input to the SSA-SKIPAT. The SSA-SKIPAT (
Figure 3B) consists of three spectrum encoder blocks (
Figure 3C), a position embedding block, and three spatial SKIPAT blocks (
Figure 3D). Finally, the post-processing layer converts the output of the SSA-SKIPAT into the output image with dimensions
.
Spectrum encoder block. The spectrum encoder block (
Figure 3C) is composed of a 2D Fast Fourier Transform (FFT), a spectrum attention weight matrix (three dimensions) and a 2D Inverse Fast Fourier Transform (IFFT). The 2D FFT of a spatial domain signal (image) is calculated using the following formula:
where
represents the frequency domain representation.
is the input spatial domain signal/image.
are the frequency domain coordinates.
N and
M are the dimensions of the input signal/image.
Then, we use a weight matrix
W that modulates the spectral components of the image.
Here, denotes the Fourier coefficients at frequency . represents the attention weight matrix applied to the Fourier coefficients. By multiplying with , we modify the spectral representation of the image. represents the frequency domain representation. This allows us to selectively emphasize or de-emphasize certain frequencies, effectively enhancing desired features or suppressing noise and unwanted artifacts.
At last, we use the 2D IFFT to reconstruct the spatial domain signal from its frequency domain representation, which is calculated as follows:
where
is the reconstructed spatial domain signal/image.
We employ 2D FFT to convert the input image features from the spatial domain to the frequency domain and utilize the spectrum attention weight matrix to learn the relationship between the three-channel frequency domain information of the image. By restoring the amplitude of features at different positions in the 2D FFT spectrum of the hazy image, we can effectively reconstruct and restore the image. To recover the original image’s feature information, we apply different weights to various positions in the 2D FFT spectrum and perform a 2D inverse Fourier transform (IFFT) to return to the image domain. Similarly to spatial domain attention mechanisms, through network training, these weights can adaptively restore the critical regions in the spectrum image. The Fourier Transform is preferred for dehazing tasks due to its ability to separate and manipulate the image’s low-frequency (general structures) and high-frequency (details and noise) components in the frequency domain. This method facilitates noise reduction by isolating and suppressing irrelevant high-frequency elements, enhances detail preservation by targeting specific frequency bands, and allows for efficient filtering, such as using high-pass filters to enhance edges and low-pass filters to smooth hazy regions. This approach, combined with trainable parameter matrices, offers more effective and efficient image processing than direct spatial domain methods.
At the same time, the SSA mechanism we use faces the problem of excessive computational complexity. To address this issue, we introduce the SKIPAT mechanism, which skips the MSA block and directly passes the input features to the FFN block, reducing the computational complexity and improving the model’s convergence speed.
Skip-Attention block. The MSA block in ViT encodes the similarity of each patch to every other patch as an
attention matrix. This operator is computationally expensive, with a complexity of
. As ViT scales, i.e., as n increases, the complexity grows quadratically and this operation becomes a bottleneck. As per the analysis in the paper [
37], with the increasing number of MSA layers, the ability to extract target features does not significantly improve, but the computational load increases. Therefore, we are committed to finding the most cost-effective method that maximizes the effectiveness of MSA without substantially increasing computational complexity. To address these issues, we introduce the SKIPAT mechanism as shown in
Figure 3D,E, which skips the MSA block and directly passes the input features to the FFN block. The SKIPAT parametric function consists of two linear layers and an interposed depthwise convolution (DwC) [
38], as follows:
where
is the input feature of the
th SKIPAT block.
and
are linear layers.
is the depthwise convolution operation, which is used to reduce the number of parameters and computational complexity.
is the output feature of the
ith SKIPAT block.
The patch embeddings are then input to the first linear layer FC1: , where n is the number of patches and d is the dimension of the patch embeddings. Subsequently, the depthwise convolution DwC: which is applied to the output of the first linear layer, followed by the second linear layer FC2: . The output of the second linear layer is the output of the SKIPAT block, which is then passed to the FFN block. We use three SKIPAT blocks in each Spatial encoder block, and the SPRGAN model consists of three Spatial encoder blocks. Experimental results show that the SKIPAT mechanism can effectively improve the model’s convergence speed while maintaining performance.
3.2. Self-Supervised Pre-Training with Perlin Noise-Based Masks (PNM)
In traditional self-supervised pre-training, models are typically trained using raw image data alongside opaque masks for image inpainting tasks. However, for specific tasks like remote sensing image dehazing, this approach may be limiting. To address this limitation and better capture the complexity of remote sensing images, we introduce an innovative technique involving the use of Perlin Noise-Based Masks (PNM)
Figure 4.
PNM represents a unique type of mask utilized during pre-training in conjunction with the original images. Rather than employing traditional opaque masks, PNM introduces variability and complexity by incorporating Perlin Noise patterns. Perlin Noise is a type of gradient noise used in computer graphics to create natural-looking textures and smooth transitions. It is generated by combining multiple layers of noise at different frequencies and amplitudes.
The Perlin Noise formula for generating two-dimensional noise is given by:
where
represents the value of two-dimensional Perlin Noise at point
.
is a two-dimensional array representing the amplitude of the Perlin Noise,
is a two-dimensional array representing the gradient vector of the Perlin Noise,
n is the order of the Perlin Noise, and fade is a fade function used for smooth transitions. Specifically, the fade function
can be expressed as:
In practice, Perlin Noise is often computed using interpolation functions such as linear interpolation, cubic interpolation, etc. These interpolation functions help create smooth transitions between discrete noise values, resulting in continuous Perlin Noise patterns.
The formula for adding Perlin Noise to an image is:
Here, represents the value of the noisy image at point , represents the value of the original image at point , and scale is the scaling factor of the Perlin Noise. By changing the scale value, we can adjust the intensity of the Perlin Noise added to the image. Specifically, we generate Perlin Noise patterns with dimensions matching those of the input images. Each pixel in the noise pattern corresponds to a transmission coefficient, determining the opacity of the corresponding pixel in the original image. By applying Perlin Noise patterns as masks, we introduce semi-random variations in the opacity of different image regions, effectively simulating the diverse atmospheric conditions encountered in remote sensing imagery.
In the Perlin Noise Mask (PNM) PNM pre-training step, we pre-train the SPRGAN generators on an image inpainting task. We use the PNM to generator masked image from the original unmasked image. Then, the generator is trained to predict the original unmasked image using pixel-wise loss. After initial pretraining, our method employs an unpaired training strategy using CycleGAN’s cycle consistency and adversarial losses. This allows the model to generate and reconstruct clear images from hazy inputs without needing direct pairs. This approach is highly suitable for real-world applications like urban surveillance, agricultural monitoring, coastal surveillance, and traffic monitoring, where paired hazy and clear images are challenging to obtain. The model’s robustness and adaptability ensure reliable dehazing performance across diverse scenarios.
During the Perlin Noise Mask (PNM) pre-training step with SPRGAN, the generators are initially trained on an image inpainting task to learn to reconstruct images effectively. This pre-training leverages techniques in the frequency domain, such as Fourier Transform, to process hazy images. Fourier Transform decomposes the image into frequency components, where higher frequencies correspond to fine details and noise, and lower frequencies represent general image structures. By manipulating these components, filters can selectively enhance details and suppress noise. After applying these transformations, inverse Fourier Transform reconstructs the processed image into the spatial domain, yielding clearer images with improved detail visibility and reduced haze effects.
Throughout the pre-training process, the model learns to handle these Perlin Noise-Based Masks, thereby enhancing its ability to address the challenges posed by haze and improve both super-resolution and dehazing capabilities. The advantage of using PNM lies in its ability to introduce realistic variability and complexity into the pre-training process, better simulating real-world atmospheric conditions encountered in remote sensing images. By incorporating Perlin Noise patterns, the model gains the capacity to adapt to diverse hazy environments, leading to improved performance and accelerated convergence speed in subsequent tasks.
3.3. Enhanced Objective with Rotation Loss
In remote sensing image dehazing, we often encounter scenarios where the orientation of the input image may vary. Rotation Loss calculates the mean absolute difference between corresponding pixels in the original haze-free image and the haze-free image obtained after rotating the original image by 180 degrees and passing it through the generator. It quantifies the pixel-level difference between the two images, reflecting variations in color, texture, and other pixel attributes.
This Rotation Loss helps evaluate the model’s ability to maintain consistency in dehazing performance across different orientations of the input images, thus enhancing the model’s robustness and generalization capability. Additionally, it provides insights into how well the model preserves structural information during the dehazing process.
where
represents the intensity value of the pixel at position
in the original haze-free image generated by the generator.
R represents the operation of rotating the image by 180 degrees.
denotes the absolute value.
computes the mean value over all pixel coordinates. With the inclusion of the RT Loss, we can formulate our objective as follows:
where
,
,
and
are adversarial loss, cycle-consistency loss, identity-consistency and total variation loss, respectively. The parameters
,
and
are to control the relative importance of the three objectives.
The core idea behind Rotation Loss (RT Loss) is to enforce a constraint ensuring that the dehazing results remain consistent regardless of geometric transformations applied to the input images. When an image is geometrically transformed (e.g., rotated) and then input into the model, the resulting dehazed image should be nearly identical to the image obtained by first inputting the original image into the model and then applying the same transformation to the output. This constraint ensures consistent performance across different orientations. Additionally, simply augmenting the dataset with rotated images only ensures the model is trained on various angles, but does not guarantee uniform dehazing quality. Using RT Loss, the training process becomes more efficient, as it achieves better outcomes with fewer batches per epoch, enhancing the model’s robustness and adaptability to diverse scenarios.