1. Introduction
Forests are a vital resource on Earth, playing a crucial role in air purification, noise reduction, soil and water conservation, natural oxygen production, and climate regulation [
1]. However, forest fires, as frequent natural disasters, not only consume trees and other forest resources but also pose serious threats to humans, animals [
2], and the environment. Therefore, early detection and mitigation of forest fires is essential [
3]. In recent years, the combination of deep learning techniques with UAV imagery has shown great potential for advancing forest fire identification [
4]. Aerial imagery technology, particularly the use of UAVs equipped with optical cameras, has emerged as an important tool for wildfire prevention. These UAVs are capable of real-time monitoring and have gained popularity due to their versatility [
5], high speed, and persistence. Their ability to integrate images from different flight altitudes enables wider coverage and the production of detailed images, making them the preferred choice for wildfire monitoring [
6]. However, the airborne camera is susceptible to interference from relative motion, attitude changes, atmospheric turbulence and other factors, resulting in motion blur in captured images [
7]. This significantly reduces the visibility of forest fires and the accuracy of feature detection, segmentation, and object recognition processes. Thus, research on forest fire image deblurring based on deep learning, combined with UAV imagery, holds significant potential for advancing forest fire recognition [
8].
The image deblurring technology has been extensively studied with systematic and mature theories and methods. Image deblurring methods can be categorized into blind deblurring [
9,
10,
11,
12] and non-blind deblurring [
13,
14,
15,
16], depending on whether the blur kernel is unknown or known. Traditional image deblurring methods exhibit certain limitations when applied to the practical task of forest fire image deblurring. These drawbacks include the requirement for a significant amount of prior knowledge, the production of low-quality restored images, and the tendency to introduce ringing artifacts. In response to these challenges, many researchers have turned to deep learning techniques for image deblurring. For instance, Schuler et al. [
17] employed a deep neural network to estimate the depth features of a blurred image, subsequently transforming these features into the frequency domain to estimate the blurring kernel. This approach allowed for non-blind deblurring using traditional methods. Similarly, Li et al. [
18] utilized image a priori information as a dichotomous classifier, trained by a deep convolutional neural network, to achieve image recovery. Despite their potential, these methods are constrained by the accuracy of blurring kernel estimation and exhibit low execution efficiency [
19]. While effective for specific scene models, they lack robust generalization capabilities for addressing more challenging real-world scene deblurring tasks [
20].
Recently, the deep learning community has shifted its focus to exploring end-to-end blind motion image deblurring strategies [
21,
22], bypassing explicit blur kernel computation by directly mapping blurred to clear images. Nah et al. [
23] first introduced the use of the DeepDeblur algorithm to confront the challenge of blurred images in dynamic scenes. Drawing on the concept of progressing from coarse to fine details, their deep convolutional neural network was intricately designed to operate across multiple scales. While the DeepDeblur algorithm significantly enhances image deblurring capabilities, it is characterized by an exceedingly large number of model parameters [
24]. To tackle this issue, Tao et al. [
25] proposed a Scale Recurrent Network (SRN) capable of substantially reducing computation time by sharing network weights across different scales. Furthermore, Zhang et al. [
22] specialized a spatial pyramid-based multilayer network (DMPHN), which focuses on utilizing cuts instead of downsampling and employing feature map cascading in the encoder-decoder process, leading to a drastic reduction in the amount of model computation. KUPYN et al. [
26] successively proposed two deep network models by introducing generative antigrid, DeBlurGAN and DeBlurGANv2, and used the generation ability of generator adversarial networks (GANs) to restore high-quality clear images. On this foundation, Zhang et al. [
27] innovated by integrating two GANs. The blur GAN (BGAN) is utilized to generate images that closely resemble real motion blur, while the deblurring GAN (DBGAN) is employed to learn the process of recovering blurred images. Cho et al. [
28] proposed a multi-input, multi-output U-Net network and introduced asymmetric feature fusion to effectively merge multi-scale features and gradually improve image clarity from the lower subnet to the upper subnet.
Despite image deblurring algorithms having made significant progress on mainstream datasets, it is still challenging to restore real-world blurred images into clear ones.
To address the aforementioned issues, this study proposes a forest fire image deblurring model based on the MIMO-UNet algorithm. This model effectively reduces motion blur in UAV images which are captured during forest fire monitoring. The key contributions of this study are as follows:
We propose a multi-branch dilation convolution to enhance the focus of the residual block. By employing dilated convolutions with different dilated factors, we are able to capture features from various receptive fields. The integration of the residual block with the parallel attention block improved the network’s ability to process multi-scale features.
To further enhance the deblurring effect, we devise a spatial–frequency domain fusion module. This module not only extracts the information in the spatial and frequency domains, but also effectively combines them to reduce information loss.
We propose a multi-channel convolutional attention residual module, which efficiently captures image details and context information by processing features of different scales in parallel. This approach effectively addresses information loss and insufficient reconstruction quality in the decoder.
To improve the generalization performance of the model, this study proposes a weighted loss function which contains multi-scale content loss, multi-scale high-frequency information loss and multi-scale structure loss. In this way, the internal texture details of the image and the lost high-frequency information can be recovered, and the deblurring effect can be enhanced comprehensively.
The rest of the paper is organized as follows:
Section 2 describes our dataset, and the overall network architecture is selected.
Section 3 presents the experimental results and performance analysis, The discussion is provided in
Section 4, and finally,
Section 5 concludes this paper.
3. The Proposed Method
In this section, we will describe the structure of the network and elaborate the details of the preprocessing module. There are mainly the multi-branch dilated convolution attention residual module in the encoder module, the spatial–frequency domain fusion module, and the multi-channel convolution attention residual module in the decoder module.
3.1. Structural Description of the Network
The network proposed in this study adopts multi-scale input and output with coarse-to-fine structure strategy. We divide the network into four parts: a preprocessing module (PM) for shallow feature extraction, an encoder module (EM) for deep information extraction, a spatial–frequency domain fusion module (SFFM), and a decoder module (DM) for image restoration and reconstruction, where
(k = 1, 2, 3) represents the input blurred image with multi-scale and
(k = 1, 2, 3) represents the output restored image with multi-scale, as shown in
Figure 2.
EM consists of three sub-encoder modules, i.e., EM1, EM2, and EM3, and it is used to extract blurred images at different scales. Initially, the input blurred image is scaled to obtain three blurry images with different scales and resolutions, namely B1, B2, and B3. Then, EM employs multi-branch dilated convolution attention residual modules (MDAMs) for feature extraction, enhancing the capture of detailed information. In EM2 and EM3, feature fusion is optimized using the feature attention module, which integrates features (k = 1, 2) and (k = 2, 3). The SFFM merges features across spatial and frequency domains of different encoder scales before passing them to the decoder, thereby improving feature utilization and reducing information loss. DM consists of DM1, DM2, and DM3. The inputs of DM1 and DM2 are the result of the fusion of the decoder output of the previous layer with the SFFM module. In addition, we designed a multi-channel convolution attention residual module (MCAM) in the DM to extract multi-scale feature information.
3.2. Preprocessing Module
Due to the continuous smoothness and sparsity of the motion-blurred images, it is essential to employ receptive fields of varying sizes for effective feature extraction. To address this problem, PM uses multiple convolution modules connected in series and parallel for shallow feature extraction before EM, where different receptive fields are captured using different convolution kernel sizes, namely 3 × 3 and effective 5 × 5 (achieved by two cascaded 3 × 3 convolutions). Then, 1 × 1 convolutions are used to subtly integrate these extracted features. This not only simplifies the output channel but also enhances the effectiveness of back propagation while mitigating the risk of vanishing gradients. Integrating local connections that jump between the input and output layers ensures accurate synthesis of the final output, as shown in
Figure 3.
3.3. The Multi-Branch Dilated Convolution Attention Residual Module
Residual modules in deep neural networks often overlook image blur caused by limited receptive fields, leading to an irreversible loss of resolution and edge details. To address this problem, our study proposes an effective multi-branch dilated convolution attention residual module (MDAM) in EM, in which dilated convolution uses a multi-branch structure to further enhance image feature expression [
33], thereby mitigating blur and preserving feature information. Multiple MDAM can be interconnected to realize feature reuse and maximize the utilization feature information. This module comprises a multi-branch dilated convolution residual module (MDCM) and a parallel attention module (AM), as shown in
Figure 4.
MDCM consists of two convolution blocks and a dilated convolution block. The dilated convolution block is composed of multiple dilated convolutions with different dilated rates in parallel, which can be expressed as:
where
,
, and
represent the dilated convolution with the dilated factor of 1, 3, 5, respectively,
δ represents the ReLU activation function, and
,
, and
represent the output of the dilated convolution with different dilated factors.
The dilated convolution network introduces the “dilation rate” parameter into the traditional convolution operation, so that the sampling points inside the convolution kernel are no longer continuous, but sampled at certain intervals, so as to expand the receptive field. The dilation rate determines the interval at which the convolution kernel performs the sampling. A larger dilation rate allows the convolution kernel to span a larger area, thus expanding the receptive field.
In the last layer of the dilated convolution module, a dilated convolution with a dilated factor of 1 combines features from different receptive fields, while a 1 × 1 convolution reduces the number of channels for integration. Finally, we superimpose the fused features onto the input features to get the output. The output characteristics can be written as:
represents the 1 × 1 convolution layers for information integration. and represent convolution feature and output feature, respectively.
To address the issue of non-uniform blur distribution in images and to leverage the varying importance of information across different spatial and channel dimensions, a novel parallel attention module (AM) is proposed. This module encompasses three distinctive branches: one specifically designed for spatial attention, another for preserving the original image features, and a third branch dedicated to implementing channel attention.
The spatial attention mechanism consists of three cascaded modules composed of a convolution layer and an activation layer, and the channel attention mechanism consists of one pooling layer, one activation layer, and one convolution layer, utilizing a 3 × 3 convolution kernel. The formula is as follows:
where
is the input feature of the parallel attention module,
represents the ReLU activation function,
represents the convolution,
represents the pooling operation,
is the output of the spatial attention mechanism,
is the output of the channel attention mechanism, and
is the final output feature of the parallel attention module.
3.4. Spatial–Frequency Domain Fusion Module
In the traditional U-Net architecture, skip connections directly transfer encoder features to the decoder, which will lead to the decoder not being able to make full use of the multi-scale features generated in EM. Furthermore, based on the multi-scale frequency reconstruction (MSFR) loss function of MIMO-UNet to recover reduced high-frequency elements, this study designs a novel multi-scale feature fusion module within the skip connections, referred to as the Spatial and Frequency Feature Module (SFFM). The EM outputs three multi-scale features, each divided into two branches. One branch undergoes 2D real-time fast Fourier transform, followed by feature extraction in the frequency domain with 3
3 convolution and ReLU activation. The other branch conducts feature extraction in the spatial domain using 3
3 convolution and ReLU activation. These branches are then fused, resized for combining different scale features, and fed into the DM, as shown in
Figure 5.
3.5. Multi-Channel Convolution Attention Residual Module
The basic task of the decoder is to reconstruct a clear, high-quality image from the feature representation. However, the problems, such as information loss and suboptimal reconstruction quality, often occur with conventional decoder modules. To mitigate these challenges, we integrate an innovative multi-channel convolutional attention residual module into DM, as shown in
Figure 6, which effectively captures image details and context information by processing features of different scales in parallel through a multi-channel structure.
Each channel consists of a different convolution layer, which is specifically used to extract the features of the corresponding layer. This approach ensures that irrelevant information is eliminated, enabling the extraction of deeper and more comprehensive information through the parallel attention module, as in the encoder. This method, which combines multi-channel feature extraction and attention mechanisms, not only significantly reduces the information loss in the reconstruction process, but also improves the overall quality of the reconstructed image.
3.6. Loss Function
Based on the structure from coarse to fine, the whole model is divided into three stages, and each stage can output the restored image. During deblurring, there are losses in the space and frequency domains as well as structural losses due to the unstable model training. Therefore, the weighted loss strategy based on multi-scale content loss, multi-scale high-frequency information loss, and multi-scale structure loss [
34] is adopted to ensure supervision and improve the blurring effect. It is assumed that
(k = 1, 2, 3) represents the output multi-scale restored image, and
(k = 1, 2, 3) represents the corresponding real clear image.
The L1 distance between the real clear picture of different scales and the model restoration map is used as the multi-scale content loss:
The L1 distance does not excessively punish large error values, which is conducive to preserving the edge features of the image.
- 2.
Multi-scale high-frequency information loss
This study utilizes Fast Fourier Transform (
) to quantify the high-frequency information loss between the blurred image and the reference clear image:
- 3.
Structural loss
Structural losses can be expressed as:
can be expressed as:
Multi-scale structural similarity loss (
) can be expressed as:
is the middle pixel value of the pixel block.
- 4.
Total loss function
The formula used to calculate the total loss function is as follows:
where
means absolute error (MAE) and
is multi-scale frequency reconstruction (MSFR).
is structural loss (SL), where
,
, and
are set 0.1, 0.01 and 0.08, respectively. The allocation of proportions to each loss is based on the variability of values obtained for each loss during the training process. Consequently, losses with lower volatility are assigned a smaller proportion of the model optimization impact. The weighting coefficients in the equations mentioned above are derived from this approach and determined experimentally.
5. Discussion
Forest resources are crucial ecological assets essential for the survival of human society. Detection of forest fires holds significant importance in safeguarding the ecological security of a country. With the advancement of information technology, the utilization of drones for forest fire detection is on the rise [
39], accompanied by escalating demands for high-quality aerial imagery. The issue of motion blur often arises during UAV image capture while in flight. Currently, numerous deep learning-based methods exist for image deblurring, but they exhibit certain limitations. Therefore, this study proposes a comprehensive investigation into the deblurring challenge encountered in forest fire images. Leveraging the inherent strengths of deep learning in end-to-end image deblurring, we propose novel image deblurring models based on the MIMO-UNet algorithm.
To address the challenge of insufficient clear-blurred image pairs in forest fire scenes, this study generated a dataset comprising such pairs. The comparative experiments are shown in
Table 7 and
Figure 13. Ablation studies with different module combinations demonstrate that the enhanced SFFM, MDAM, MCAM, and PM methods notably enhance both PSNR and SSIM metrics in our self-built forest fire dataset. Our model demonstrated exceptional performance compared to other models, with a slightly lower SSIM of 0.03 compared to the MPRNet model. We attribute this difference to MPRNet’s utilization of a multi-stage progressive restoration strategy. Specifically, the original-resolution subnetwork (ORSNet) in the final stage is believed to enhance image quality without sacrificing image structure and semantic information. Therefore, in our future endeavors, we aim to prioritize the reduction of semantic information loss and enhancement of deblurring performance.
In this study, it was found that there was still a certain gap between the subjective visual perception of the deblurred image and the effect reflected by the objective evaluation index. How to establish an image restoration quality evaluation index more in line with subjective perceptions is a problem that needs to be solved. In future work, we intend to use machine learning techniques to assess image quality without reference to enabling a more comprehensive assessment of blurring results. The model presented in this paper still requires extensive training time on high-configuration computers. However, computational complexity and training duration can be diminished through optimization of the network structure. Given that edge computing devices typically possess lower computational power compared to conventional computers, the model can be further enhanced through techniques such as quantization and pruning. These methods compress the model size or employ a more lightweight network structure, consequently reducing computational demands. As a result, the network model can be efficiently deployed and recovered on edge computing devices.
Based on the forest fire detection system made by our team [
40,
41] (the whole system consists of a UAV, a Raspberry Pi controller, an OAK-D camera, and a GPS module), we aim to build a forest fire detection model based on multi-task learning, consisting of 3 tasks (a deblurring task, a detection task, and a segmentation task). In order to enhance the user experience and ease of operation, we intend to build a cross-platform HMI using PyQt5. This design not only improves the utility of the system, but also ensures reliability and stability in the field environment. By deploying our models on UAVs, we can achieve true real-time recognition and response capabilities to more effectively monitor and deal with emergencies such as forest fires. This integrated solution will unlock new potential for forest fire prevention and improve the ability to respond to disasters in a timely manner, thereby enhancing the protection of human life and property.
6. Conclusions
In this study, we successfully built and validated an innovative spatial–frequency domain fusion network model with significant improvements over MIMO-UNet to optimize image deblurring tasks. By introducing an advanced MDAM in EM, we not only increased the receptive field of the model, but also effectively suppressed redundant information, thereby improving the overall performance of the network. In addition, in the multi-scale feature fusion module, we abandoned the traditional U-Net jump connection module and adopted a strategy of combining spatial domain and frequency domain information, which significantly reduced the information loss in the feature fusion process and improved the recovery with more detailed characteristics. MCAM in DM improved the reconstruction of local and contextual information. During model training, we used the weighted loss function, which not only improved the stability of the model but also optimized the performance of image blurring.
By training and testing the self-built forest fire dataset, our model outperformed the comparison model in various performance indices and achieved excellent results in experimental comparison. Especially when processing forest fire images, our model could highlight the texture details of recovered images, which is of great importance for wildfire monitoring and management. The LGF and SMD were used to evaluate the deblurring effect of forest fire images in real scenes, and our model has achieved the best performance in comparative experiments. The conclusive experimental findings show that the proposed forest fire image deblurring model has a PSNR of 32.26 dB, SSIM of 0.955, LGF of 10.93, and SMD of 34.31. In experiments without reference images, the model performs well in terms of LGF and SMD. It is worth noting that, compared with other baseline models and other commonly used image deblurring models, this model is generally improved in terms of indicators.