1. Introduction
Extreme weather phenomena, encompassing haze, fog, and dust storms, significantly diminish the visibility of the surroundings, leading to images captured under such conditions experiencing substantial a loss of color fidelity and content details. This image degradation poses a formidable challenge to tasks such as object recognition, image partitioning, movement tracking, and other advanced visual processing endeavors, ultimately resulting in a marked decline in the accuracy of predictions and outcomes derived from these high-level visual tasks. Therefore, as a fundamental vision-processing task, single-image dehazing has attracted significant attention in recent years. The purpose of single-image dehazing is to restore a clean scene from a hazy image. Acknowledging the established understanding, the haze formation process is often mathematically modeled using the Atmospheric Scattering Model (ASM) [
1]. This representation is formulated as
here,
represents the hazy image that is observed, whereas
signifies the corresponding haze-free image.
A denotes the global atmospheric light, capturing the environment’s overall illumination characteristics, and
denotes the transmission map, a function that quantifies the light’s portion reaching the observer without being scattered by particles in the atmosphere. This model serves as a fundamental basis for analyzing and mitigating the effects of haze on visual data. It is determined by the distance
from the scene to the camera and the atmospheric scattering coefficient
. The transmission map can be formulated as an exponential decay function with respect to the scene-to-camera distance, as given by:
Given a hazy image, the problem of recovering its clean version is highly ill-posed. Conventional approaches rely on ASM [
1] and exploit hand-crafted prior assumptions to solve this challenge. For example, in order to estimate the transmission map, He et al. [
2] proposed a Dark Channel Prior that can eliminate the halo phenomenon and blocky situation well. However, when the image tends towards whiteness (especially the sky area), it will cause color distortion and reduce haze removal. Berman et al. [
3] proposed a non-local prior in order to represent the recuperative images. Nevertheless, due to its excessive dependence on color classification accuracy, it decreases seriously with the increase of haze concentration. Compared with previous prior-based methods, the IDE [
4] method proposed by Ju et al. introduces the light absorption coefficient into the atmospheric scattering model [
1], which can boost the visibility of hazy images. However, in distant regions, the preprocessed outcomes often exhibit darker tendencies. Ju et al. proposed an extremely efficient single-image dehazing method named IDGCP [
5] based on a gamma correction prior (GCP) to restore high-quality images with only an unknown constant. Huang et al. [
6] introduced an innovative image dehazing technique that ingeniously integrates the strengths of various dehazing strategies, including pixel-level, local, non-local, and scene-aware approaches. This comprehensive approach addresses the limitations of relying exclusively on a single prerequisite, thereby enhancing the effectiveness and versatility of image dehazing. These methods try to restrict space to some extent, enhancing image visibility. However, presumptions and prerequisites tailored to particular scenes or atmospheric circumstances may lead to distorted recovered images.
In the past decade, convolutional neural networks (CNNs) [
7] have achieved significant advancements in the field of image dehazing. Many researchers have posed a multitude of data-centric approaches. The original method leveraging deep learning techniques, such as All-in-One Dehazing (AOD) [
8] and Multi-Scale CNN (MSCNN) [
9], has proven CNN’s great success. Unlike most previous models, which obtain the transmission rate and the atmospheric light value, Li et al. [
8] unified these two parameters into one variable by designing an all-in-one dehazing network. The multi-scale convolutional neural network is proposed by Ren et al. [
9] for extracting useful features from hazy images to facilitate scene transmission map estimation. Opposite to minimizing the loss between the dehazed image and the real image, the network is trained to optimize the accuracy of the reconstructed transmission map against its ground truth, enhancing dehazing effectiveness.
However, these methods still follow the traditional dehazing model in Equation (
1) and mostly use low-level image features. It has been proven that by stacking convolutional layers to obtain deeper features, the image quality can be improved. To directly obtain restored images, Qin et al. [
10] proposed an end-to-end feature fusion attention network (FFA). This method integrates channel-wise and pixel-wise attention mechanisms (CA and PA) to handle different features and uneven pixels, further enhancing the clarity and fidelity of the reconstructed image. However, if different levels of haze are weighted in the same way, the dehazing effect will be reduced. The above methods are easy to over-fit and time-consuming and have poor real-time performance. Some state-of the-art (SOTA) models and our proposed method about their PSNR [
11] value and SSIM [
12] value are presented in
Figure 1.
Although current CNN-based methods have remarkable performance, their model capacity is limited, which depends widely on feature extraction. Indeed, by leveraging an encoder–decoder CNN architecture [
7], the model effectively captures the nonlinear relationship between blurred and sharp image pairs, enabling superior feature extraction and overall performance. For instance, Jing et al. [
13] proposed a U-Net-based feature attention dehazing network, which adopted feature attention, dense connection, and residual dense block to solve heterogeneity dehazing task. Bianco et al. [
14] proposed HR-Dehazer, which utilizes encoder–decoder architecture to train the mapping between hazy images and dehazed images. Feng et al. [
15] proposed the URNet dehazing model. They use a hybrid convolution that combines standard convolution and extended convolution in the network’s encoder to enlarge the acceptance domain and thus extract image features better. Lu et al. [
16] propose a novel framework, combining multi-scale processing, large convolutional kernels, and attention mechanisms. These designs can enhance the feature extraction and learning capabilities of networks. Shen et al. [
17] design a two-stage image dehazing network. The first stage is amplitude-guided dehazing, and the second stage is phase-to-structure refinement. Although this method can reduce the probability of information redundancy and enhance the complementarity of features between different layers, it can only capture feature information within a limited domain. Due to the superior performance of self-attention [
18], Shi et al. [
19] built a Transformer branch combined with Transformer [
18] self-attention and built a convolutional neural network branch based on the locally varying attention module. When the two branches are combined, non-local features could be recovered through the Transformer [
18] branch, and local features could be obtained through the CNN branch to complete the recovery of the whole image. However, at present, the method has a better effect on facial overdivision, and its role in the field of image defogging remains to be further explored. The above methods can achieve excellent performance but lack an integrating multi-layer feature extraction, fusion, and enhanced focus on intricate details. Therefore, they still have great room for improvement in extracting image features. To optimize feature extraction capabilities and refine the allocation of feature weights, we propose a multi-level feature fusion method and point-wise weighted attention.
In this work, we propose an adaptive multi-feature attention network (AMFAN) that simultaneously applies to the removal of haze across both consistent and varying degrees of haze in images. A sketch of the main ideas is shown in
Figure 2. Inspired by the method in which Jing et al. [
13] add dense connections in both the encoder and decoder parts to improve the information utilization between each non-adjacent layer, we utilize the basic framework of presented haze reduction approach by Jing et al. [
13] and simultaneously design an attention multi-layer feature fusion module (AMLFF) and a point-wise weighted attention module (PWA) in the decoder part, which can obtain balanced weight distribution of detailed information and frame information and the point-to-point weights map of attention strategy. Therefore, in the challenging non-homogeneous haze case, our method largely surpasses some previous methods. To ensure a balanced evaluation of both objective and subjective image quality, the training process incorporates not only image quality losses like PSNR [
11] and SSIM [
12] but also a loss function that captures visual perception. This approach comprehensively assesses the restored images’ fidelity and perceptual appeal.
To summarize, the key contributions outlined in this paper encompass the following aspects.
We propose an adaptive multi-feature attention network. The point-wise attention module (PWA) for enhancing the fusion strategy and the attention multi-layer feature fusion (AMLFF) module for balancing features across different layers are two key components of this network.
To enhance multi-scale feature fusion, we propose the point-wise attention (PWA) module, which integrates both global and localized features from the feature maps by focusing on important regions.
We introduce a feature fusion block (FFB) based on the idea of AFF by using the point-wise weighted attention module (PWA) to better balance the multi-scale features of the inputs. In addition, we use the proposed feature fusion block (FFB) to design an attention multi-layer feature fusion module (AMLFF) to fuse features from different layers adaptively and balance the weights between feature maps well.
In this article, the first part introduces the research methodology employed in this paper, the second section discusses the development of image dehazing methods, the third section elaborates on the architecture of the introduced network framework and the loss functions used, and the fourth part proves the effectiveness and advancement of the proposed method through extensive experiments.
3. Proposed Method
The subsequent section delves into the intricacies of our adaptive multi-feature attention network tailored for dehazing purposes, illustrated in
Figure 2.
As shown in
Figure 2, our adaptive multi-feature attention network is established based on the encoder–decoder structure. Specifically, we stack a layer of 3 × 3 convolution and a feature attention (FA) [
10] module at each encoder layer. The 3 × 3 convolution is used to refine the extracted features, and the FA [
10] module allows the network to prioritize salient and informative features while diminishing the influence of irrelevant or disruptive data. Finally, after a downsampling layer, it is passed to the next layer. To reduce feature loss during image compression, we introduce dense connections [
33] between the different layers of the encoder. In the high-dimensional part of the network, we choose to keep the residual module of FAD-U-Net [
13] to realize the removal of image haze. In the decoder part, we have made a lot of innovations on the basis of FAD-U-Net [
13] and established a new attention module to accomplish adaptive integration of image features across varying levels and further restore clear images. At the end, a layer of 3 × 3 convolution is used to obtain the final clear image.
The implementation details of the decoder are described in detail in this section. Firstly, we specify the detail of point-wise weighted attention module that can aggregate the local and global information of high-level feature maps. This configuration enhances the network’s capability to prioritize haze-impacted areas but also effectively enhances feature fusion, improving the network’s dehazing performance. Then, we elaborate on the attention multi-layer feature fusion module, which is beneficial to effectively balance features from different levels. Finally, we detail the loss functions employed throughout the training process.
3.1. Point-Wise Weighted Attention
Considering that different scale features are capable of distinguishing weight information and the non-uniform dispersal of haze patches, we design a point-wise weighted attention module (PWA) at pixel level to improve the weight of the feature fusion strategy, as shown in the
Figure 3b. Since spatial feature information is crucial to the integrity of the image, we use adaptive scaling features by exploiting channel attention to obtain spatial inter-dependencies between channels. In this module, Initially, we integrate the global spatial information across channels into the channel descriptors, utilizing global average pooling [
34]; the operation formula is as follows:
where
represents the value at the position
of the
C-th channel;
represents the value at
, the position of the
W-th channel;
represents the value at the position
of the
H-th channel; and
is the global average pooling function. After the average pooling on the channels, the shape of the feature map changes from
to
. Meanwhile, the input feature map is transposed to obtain
and
feature maps, average pooling is performed on the transposed images, and the shape of the feature maps is changed to
and
, respectively.
Subsequently, the three obtained feature maps are sequentially passed through two layers of convolution and a ReLU activation function [
9], followed by a Sigmoid function, with the aim of deriving weights for different channels.
where
denote the average pooling operation on
, respectively;
represents a kernel size of 1 × 1 convolution; and
denotes the Rectified Linear Unit. Then, the three obtained attention maps with weights of different channels result in a point-wise weighted attention feature map enriched with global contextual data through multiplication.
where
represents an attention feature map of size
,
represents an attention feature map of size
, and
represents an attention feature map of size
. To emphasize the importance of hazy regions and high-frequency feature details, we design an attention module capable of obtaining local contextual content.
The attention module has two convolutional layers a ReLU activation layer, and a BN layer [
35] behind each convolutional layer. In the BN layer between each convolution layer, the point-by-point weighted attention map with local context information can be obtained.
where
represents the feature map of the input,
denotes a kernel size of a
convolution,
is the Batch Normalization (BN) [
35], and
denotes the Rectified Linear Unit. The size of the feature map is not changed from
to
but keeps the original size of the input feature map
.
Then, the attention map that we obtained is multiplied by local and global information. Finally, the feature map obtained is added point-wise to the original input feature map, and the final element-wise weighed attention feature map is obtained by the sigmoid activation layer.
where
represents the feature map of the input,
is the point-wise weighted attention feature map with global context information,
denotes the point-wise weighted attention map with local context information, and
represents the Sigmoid activation function. The point-wise weight attention module (PWA) does not only obtain the channel weight or pixel weight but also combines the channel weight and the pixel weight, so our proposed point-wise weight attention module has stronger robustness.
3.2. Adaptive Multi-Layer Feature Fusion
High-level features have low-sampling spatial resolution, which is beneficial for haze removal but compresses more semantic context information. Compared to high-level features, low-level features possess higher resolution, which significantly aids in the localization of haze regions. Therefore, in order to fuse the feature maps from the encoder and decoder, we propose an attention multi-layer feature fusion module (AMLFF) based on a point-wise weighted attention module (PWA) to redistribute the weights of the feature maps of different scales and then obtain the fused features. The architecture of the attention multi-layer feature fusion module (AMLFF) is shown in the
Figure 4. Our proposed AMLFF achieves feature-level fusion through our proposed feature fusion block (FFB), which exploits the idea of AFF. The following formula shows the principle of AFF:
where
Z is the fused feature,
M represents MS-CAM, and ∪ denotes the initial feature integration. As shown in
Figure 4 and
Figure 5, our proposed attention multi-layer feature fusion module (AMLFF) is multi-input compared with AFF, while AFF is two-input. In addition, the AMLFF is based on our proposed point-wise weighted attention module (PWA), while AFF is based on MS-CAM. The architecture of MS-CAM is presented in
Figure 3a.
Using the idea of AFF, we use our proposed point-wise weighted attention module (PWA) to design a feature fusion block (FFB). The following is the formula of the feature fusion block:
where
P represents our proposed point-wise weighted attention module (PWA), and ∪ is the initial feature concatenation.
Inspired by dehaze-flow [
36], the fusion of multiple sets of convolutional features helps improve the convergence of the network. Therefore, the proposed attention multi-layer feature fusion module (AMLFF) uses the up-sampled features from a decoder with a kernel size of
, the up-sampled features from a decoder with a kernel size of, and the encoder down-sampled features as input. It is believed that generating a variety of dehazing results for adaptive fusion can achieve better results. The attention multi-layer feature fusion module (AMLFF) processing can be expressed as follows:
where
is the result of adaptive multi-layer features fusion,
represents feature fusion block (FFB), and
is the Batch Normalization (BN).
The attention multi-layer feature fusion module (AMLFF) can not only effectively balance the weight of the decoder input but also improve the robustness of the network.
3.3. Loss Function
We utilized three loss functions as the optimization objective of the proposed network: the reconstruction loss
, the image structure similarity loss
[
37], and the contrastive regularization
, as Equation (
14).
Reconstruction loss. This loss can facilitate feature selection in model optimization and has proven that the training effect of
loss is better than that of
loss [
38].
where
denotes the corresponding clean image,
stands for input hazy image, and
is the proposed adaptive multi-feature attention network.
SSIM loss. SSIM [
12] has a good correlation with the human visual system, so we utilize the Structural Similarity Index (SSIM) loss to further enhance the visual quality of the defogged images. Hence, we define the SSIM loss as follows:
where
N represents the number of pixels and
represents the SSIM value of every pixel.
Contrastive regularization. Contrastive learning aims to learn a representation to get closer to “positive” samples in latent space and stay away from the representation between “negative” samples. Therefore, contrastive regularization (CR), which is capable of generating a better-quality image, is incorporated into the loss function as:
where
represents a loss function that calculates some differences or similarity between three feature representations,
[
39] is the fixed pre-trained model, e.g., VGG-19, and
is the proposed adaptive multi-feature attention network. Specifically, for the non-ablation mode, where ablation equals False,
is defined as
For the ablation mode, when ablation equals True,
is defined as
In this context,
represents the weight of each layer, which is defined as follows:
denotes the
loss, and
is a small constant used for numerical stability.
4. Experiments
This section showcases both the quantitative and qualitative experimental results, demonstrating the effectiveness of our innovative methodology. A comprehensive experimental campaign is undertaken, comparing the performance of our method against the cutting-edge techniques on benchmark datasets such as O-HAZE [
40] and RESIDE [
41]. The O-HAZE [
40] dataset includes 40 sets of blurred and corresponding ground truth images, while 5 such sets are reserved for testing purposes, enabling a rigorous evaluation. Furthermore, we perform an exhaustive analysis of the individual components’ contributions, meticulously examining the contribution of each component within our network architecture, thereby demonstrating their individual and collective effectiveness.
4.1. Experiment Setup
Implementation details. The proposed network can be trained in a completely end-to-end manner, eliminating the need for the pre-training of individual sub-modules. We implement our experiments utilizing the PyTorch framework. For training, we employ the Adam optimizer, configured with exponential decay rates and set to 0.9 and 0.999, respectively. Each model undergoes a total of 1000 epochs of training, with an initial learning rate of 0.0001, which undergoes a step-wise reduction by a factor of three-quarters every 100 epochs. The hyperparameters of our hybrid loss function are optimized at (, , ) = (1, 0.55, 0.001), ensuring effectiveness.
Datasets. To validate the proposed method, we evaluate it on both synthetic and real-world datasets. RESIDE [
41] and O-HAZE [
40] are chosen as our synthetic datasets for testing. On the O-HAZE [
40] dataset, we perform a qualitative and quantitative comparison of our method with state-of-the-art approaches, including NLD [
3], IDE [
4], FFA [
10], MSCNN [
9], MSBDN [
42], and FAD-U-Net [
13]. RESIDE [
41] is a comprehensive synthetic dataset. We compared with the state-of-the-art methods, which include NLD [
3], FFA [
10], DehazeNet [
22], MSCNN [
9], and MSBDN [
42].
Evaluation metric and competitors. For quantitative assessment, we utilized the Peak Signal-to-Noise Ratio (PSNR) [
11], Structural Similarity Index (SSIM) [
12], and Reduced-Reference Partial Discrepancy (RRPD) [
43] metrics. These metrics are standard for evaluating image quality in dehazing tasks, with higher PSNR [
11] and SSIM [
12] indicating superior image recovery and lower RRPD signifying optimized gray value distribution for enhanced image clarity. Our comparisons encompass both prior-based methods, such as NLD [
3] and IDE [
4], and image translation-based approaches, including FFA [
10] and FAD-U-Net [
13], to validate the efficacy of our novel approach.
4.2. Comparison with State-of-the-Art Methods
Results on Synthetic Dataset. We selected SOTS-OUT, a data subset of RESIDE [
41] containing outdoor scenes, as the baseline synthetic dataset for the network to train and test. The quantitative results of our method and other state-of-the-art methods on synthesis datasets are shown in
Table 1, and the qualitative results are exhibited in
Figure 6 and
Figure 7. As depicted in
Table 1, our method achieved a PSNR [
11] of 29.55 dB and an SSIM [
12] of 0.96. In the outdoor testing scenario, our PSNR [
11] score surpasses the second-best method by a notable margin of 1.67 dB, solidifying the effectiveness of our proposed approach.
Figure 6 exhibits the enlarged restored image block on the SOTS-OUT datasets. Specificakky, the texture details of the image recovered by IDE [
4] are fuzzy, and the boundaries between objects in the image recovered by FFA [
10], MSCNN [
9], and MSBDN [
42] are blurry. Our proposed method is able to recover better the edge regions compared to other methods. In
Figure 7, we compare our network with compared methods in terms of the quality of the restored images on SOTS datasets. In particular, the colors of images recovered by NLD [
3] are severely distorted. FFA [
10] and DehazeNet [
22] leave hazy residues in the sky area of the recovered images. Similarly, although the visual performance of images recovered by MSBDN [
42] is excellent, the texture between floors is still not clearly restored. The images recovered by MSCNN [
9] often have higher color saturation compared to the ground truth. In comparison, our method is able to achieve high evaluation metrics while maintaining superior visual performance.
Results on Realistic Dataset. We have chosen the outdoor image dataset O-HAZE [
40] as the realistic benchmark dataset to train and test our method. The quantitative results on this dataset are shown in
Table 2; our method achieved 21.24 dB PSNR [
11], 0.76 dB SSIM [
12], and 12.85 RRPD [
43], which surpasses the second-place 3.12 dB on the PSNR [
11] in O-HAZE [
40] testing datasets.
As shown in
Figure 8, our network exhibits remarkable performance in haze removal while preserving vital information like colors and edges. Notably, other popular dehazing methods exhibit limitations. For instance, NLD [
3] suffers from significant color distortion, leading to the loss of important visual details. Similarly, IDE [
4] struggles to accurately estimate color information. On the other hand, FFA [
10] and MSCNN [
9] leave noticeable hazy areas and produce images with low overall contrast. In contrast, our network effectively mitigates these issues. The images recovered by MSBDN [
42] exhibit good visual effects, but when compared to the ground truth, the overall tone appears darker. Although the image recovered by the MFAF [
44] method has better color recovery, it has obvious haze residue. FAD-U-Net [
13] generates blurry edges, and noise remains. Although our method is based on FAD-U-Net, the visual effects of the recovered images far surpass those of FAD-U-Net. As illustrated in
Figure 6, our method effectively preserves the vibrancy of overall colors while recovering intricate details.
4.3. Ablation Study
To demonstrate the effectiveness of each component of the proposed model, we performed an ablation study on the O-HAZE datasets [
40].
We first exploited FAD-U-Net as our baseline dehazing network. Specifically, we augmented the base network with various modules as fikkiws: (1) base + AMLFF: Add the attention multi-layer feature fusion module into the baseline. (2) base + PWA: Add the point-wise weighted attention module into baseline. (3) base + PWA + AMLFF: Add both point-wise weighted attention module and attention multi-layer feature fusion module into the baseline. (4) base + SK + PWA + AMLFF: Add the single-kernel, point-wise weighted attention module, and attention multi-layer feature fusion modules into the baseline. (5) Ours: Add the multi-kernel, point-wise weighted attention module, and attention multi-layer feature fusion module into the baseline.
Table 3 presents the averaged SSIM [
12] and PSNR [
11] results for dehazed images obtained from various model configurations. As shown in
Figure 9, the PSNR [
11] values of different models under different epochs are shown. Based on
Table 3, we put forward the following conclusions. (1) The performance of the attention network with global information and local information is stronger than that of only the global attention network. As can be seen from the results, fully integrating both local and global information effectively enhances the performance of the network. (2) Whether the location and attention map of the sigmoid activation function is multiplied or added to the feature map is a question worth verifying. From the PSNR [
11] and SSIM [
12] values in the table, it can be found that the obtained attention map that is multiplied by the feature map and then passed through the sigmoid function is the best. (3) It can be found from the table that the performance of our proposed PWA is higher than that of MS-CAM and has higher values in PSNR [
11] and SSIM [
12], respectively, so our proposed PWA can improve the feature fusion strategy. (4) The effect of our proposed AMLFF with three inputs is due to the two-input feature fusion strategy AFF, and the fusion of multi-scale features is beneficial to the removal of haze. It also shows that our proposed multi-scale feature fusion strategy AMLFF has better performance. (5) As evident from the table, the feature fusion through multi-layer convolutions facilitates better convergence of the network, further enhancing its performance.
4.4. Comparisons of FLOPs and Parameters
To quantify the computational complexity and storage requirements of the proposed method and to demonstrate its superiority in these aspects, we have selected Floating Point Operations (FLOPs) and parameters as the indicators to compute the time and complexity of the model. FLOPs reflect the amount of computation required for the model to perform a single forward pass. The higher the FLOPs, the more computational resources the model demands, resulting in longer runtime. The parameter count represents the total number of trainable parameters in the model. The more parameters, the larger the storage space required by the model, which in turn necessitates more training data and computational resources.
The FLOPs and parameters for the state-of-the-art methods are shown in
Table 4. We chose images with a resolution of 512 × 512 as the test benchmark for FLOPs. As shown in
Table 4, although MSBDN [
42], FFA [
10], and SCANet [
32] use very few parameters, their FLOPs far exceed FAD-U-Net [
13] and the proposed method. In comparison, although the proposed method is an improvement based on FAD-U-Net [
13], its FLOPs value is 1.02 G lower than that of FAD-U-Net [
13], indicating that the improvement in this paper can effectively enhance the computation speed for image dehazing tasks. By integrating the PSNR [
11] and SSIM [
12] metrics from MSBDN [
42], FFA [
10], and FAD-U-Net [
13], it becomes evident that the method presented in this paper is capable of maintaining superior restoration performance while also achieving a lower memory footprint and higher computational efficiency. This demonstrates the effectiveness of the proposed approach in balancing restoration quality with resource utilization, making it an attractive solution for image dehazing tasks.