1. Introduction
Multimodality image fusion is the synthesis of multiple original images of objects in the same scene captured simultaneously from different sensors into a single new image that is enriched with information through image processing and computer technology, and provides a more intuitive understanding with the human eye system. As a typical heterogeneous sensor image fusion, infrared and visible image fusion (IVIF) has already been applied in various fields, like military reconnaissance, video surveillance, vehicle night navigation, target detection and identification [
1] and much more. There are special properties such as difference, complementarity, and correlation between different modalities of information. Infrared images can be captured by actively receiving the thermal radiation from objects, highlighting heat regions that are undetectable in the visible images, and working around the clock. However, infrared images are often blurred due to low spatial resolution. On the contrary, visible images exhibit high spatial resolution and rich texture details. Nevertheless, they are easily affected by weather factors, like rain, fog, or poor lighting [
2,
3]. IVIF techniques can make the best of the desirable characteristics of both imaging mechanisms. The aim of IVIF is to take advantage of the useful complementary information of multi-sensor images [
4] and eliminate the possible redundancy and contradictions between them. As a result, the raw data can be used much more efficiently.
In the past few years, research on IVIF approaches has drawn extensive attention. In [
5,
6], researchers detailed many existing IVIF methods and analyzed their problems. Here, we further subdivide IVIF methods into three categories in accordance with the differences in fusion theory and architecture. We classify multi-scale transform (MST) methods [
7], sparse representation (SR) methods [
8], saliency-based methods [
9], subspace-based methods [
10], hybrid methods [
11] and others [
12,
13] into the traditional methods. Deep learning methods employ a neural network to complete one or all of the three key steps (i.e., feature extraction, fusion and image reconstruction) involved in image fusion. According to the fused image acquisition process, this paper divides them into two modes: end-to-end image fusion methods and combinatorial-based image fusion methods. The end-to-end methods based on deep learning include the network architecture of autoencoder (AE), convolutional neural network (CNN) and generative adversarial network (GAN). Our method falls under this category. Another type is the combination of traditional and deep learning approaches, termed combinatorial-based methods. Examples include the combination of pulse coupled neural network (PCNN) and multi-scale transformation [
14], the combination of CNN and Laplace pyramid decomposition [
15], the combination of CNN and saliency-based [
16], the combination of CNN and SR [
17], etc., which are commonly used in image fusion tasks. While the above IVIF approaches have obtained impressive fusion performance, they still suffer from some drawbacks, especially in the traditional and combinatorial-based methods. The main problems with both methods lie in the following three folds. Firstly, it is quite challenging to design efficient image transformation and representation methods. The traditional methods adopt the same transformation and representation for heterogeneous images with multiple sources, resulting in the loss of differential information. Generally, image fusion methods have been explored to a large extent with the development of image representation theory. Therefore, it is urgent to investigate new image representation approaches to boost image fusion performance. Moreover, image decomposition is usually time-consuming. Secondly, designing complex activity-level measurements, feature extraction, or fusion operations in a manual manner will increase computing costs and algorithm complexity, further limiting their practicability. Thirdly, although deep learning techniques have been introduced into combinatorial-based methods, they are only performed for feature extraction or result reconstruction. Consequently, the limitations of traditional image fusion methods still remain.
In view of the above disadvantages, one research focus is to design IVIF models in an end-to-end fashion. In particular, the end-to-end methods completely circumvent the shortcomings of the traditional and combinatorial-based methods. For instance, DenseFuse [
18] and TSFNet [
19] utilize pre-trained AE architecture to extract features from source images and then reconstruct the fused images, which can achieve relatively promising fusion performance. DeepFuse [
20] and RXDNFuse [
21] are representative methods based on CNN, which can guide models to produce fused images via specially designed metrics of unsupervised learning. The FusionGAN [
22], D2WGAN [
23] and GANMcC [
24] methods proposed based on GAN all apply adversarial games to reduce the difference in probability distribution between the fused images and the source images, and thus promote the preservation of original information.
Generally speaking, feature extraction and fusion of the source images are two key steps in the design of IVIF algorithms. On the basis of all previous comments, the motivation for our paper consists of two folds. In the first place, the key to image fusion is to design a more comprehensive feature extraction strategy based on neural networks. This is also the fundamental goal of training models for most IVIF algorithms, that is, to train a network with strong feature extraction capabilities. However, all of the above models only focus on single scale features in the sources. For example, Refs. [
18,
21,
23] all operate at the same level of convolution kernel to extract specific scale features. Hence, the fusion results do not preserve the information of original features on a full scale. Additionally, a prerequisite for producing a fused image with highlighted targets and abundant texture information is the selection of important and salient features to be blended. Nevertheless, the handcrafted feature fusion strategies such as concatenation in channel-dimension or pixel-wise addition adopted by most IVIF methods cannot efficiently integrate significant features into fusion results in a way that is more consistent with human visual perception. As a result, the significant information in the sources is completely lost, and the reconstructed image has less gray level and low contrast.
To solve the problems mentioned above, we propose a novel IVIF method using GAN with multi-scale feature extraction (MFE) and joint attention fusion (JAF), called MJ-GAN. On one hand, multi-scale information in multimodality images is considered. More specifically, the highlighted objects in infrared images and the textures in visible images are automatically captured via more MFE modules. Additionally, an improved self-attention structure, which can achieve contextual information mining and attention learning, is introduced into MFEs to enhance the pertinence among multi-grained features. On the other hand, there is compelling evidence that the human visual system (HVS) automatically pays more attention to some salient features or areas rather than the whole. Therefore, we design a JAF network based on the channel attention and spatial attention to strengthen the attention to salient and important features in the source images during the feature fusion stage. Consequently, the fused images will be more consistent with human visual perception. Besides, it is also a well-known phenomenon that the stronger the discriminative ability of discriminators, the better the fused images produced by the generator. For this purpose, the loss functions of the dual discriminator are designed based on the idea of SCD loss function [
25] to improve the discriminative ability of the discriminators. Specifically, we build the dual adversarial mechanism between the source images and their contributions to lessen the variance of the probability distribution between them.
To visually exhibit the superiorities of our method, we select some representative end-to-end methods, including DenseFuse [
18], CSR [
17], FusionGAN [
22] and GANMcC [
24] for comparison, as presented in
Figure 1. Clearly, all comparison methods generate the fusion results with blurry thermal targets and insufficient textures, together with halos along the edges. By contrast, the fused image generated by our method keeps the high-contrast heat sources, reserves the richest and most natural background texture details, and accommodates human visual perception.
The contributions and characteristics of this work can be generalized as follows.
To adequately preserve the global information, multi-scale feature extraction (MFE) modules are introduced into the two-stream structure-based generator to extract source image features of different scales for fusion.
To focus more on the important and salient features during the fusion step, we select and merge significant features via a joint attention fusion (JAF) network.
To improve the discriminative ability of the discriminator, a dual adversarial mechanism between the source images and their contributions is designed, which will drive the generator to transfer more original information into the final fused images.
The rest of our paper is organized as follows. We introduce some works in
Section 2 that are closely associated with our method, including some end-to-end image fusion methods, an attention mechanism (AM) and FusionGAN. In
Section 3, we present our algorithm’s details, including the overall framework of the proposed method, network architectures and loss functions. Plentiful comparison experiments on publicly available datasets are illustrated in
Section 4. Additionally, we also implement generalization and ablation experiments in this section. Some concluding comments and an insightful discussion of our work are provided in
Section 5.
5. Conclusions
We designed a GAN-based end-to-end method with multi-scale feature extraction (MFE) and joint attention fusion (JAF) networks (named as MJ-GAN) together with two specific, stronger discriminators that can achieve more promising fusion performance in IVIF tasks. The inventiveness of our method is that the generator implements feature extraction at different scales and utilizes the attention mechanism (AM) to fuse features in a salient way. Therefore, the difficulties of heuristic design faced by combinatorial-based and conventionally based fusion algorithms can be surmounted. Furthermore, the dual discriminator with strong discriminative ability adds more information to the fused image based on the adversarial relationships between two kinds of nets. Importantly, a hybrid loss function will guide the fusion direction and the preservation of information types from the source inputs in the final fused image. As a result, extensive experiments demonstrated the superiority of our proposed method over other representative and state-of-the-art algorithms in terms of both subjective visual quality and objective evaluation metrics.
Although the proposed image fusion method achieves competitive performance in infrared and visible image fusion tasks, there are still several issues that deserve to be highlighted. First, the proposed method is mainly aimed at grayscale image fusion, and its practical applications are limited. Second, the designed loss function only focuses on retention of the primary and secondary information of the source image, but neglects the improvement in the visual perception quality of the fused image. Third, there is still room for improvement in extracting and fusing useful features from source images. Therefore, in the future, we will try to extend the application fields and conditions for our method, such as nighttime infrared and visible image fusion, multi-focus image fusion, and multi-exposure image fusion.