1. Introduction
With the development of science and technology, more and more data are collected using sensors, and more and more information is obtained from them. From black and white images and color images taken by optical cameras at the beginning to hyperspectral, infrared images, and radar images taken by various sensors today. With the increase in image types, the use of images is also increasing. However, these images are generated by a single sensor, which has some shortcomings. A single sensor has limitations, so it is difficult to collect all the information about the scene, which leads to no image containing all the information about the scene. Image fusion technology is to fuse the image information collected by different sensors into a new image by some means. The new image contains both the information of the original image and reduces redundant information between many images, which improves the use rate of the image.
Traditional image fusion methods can extract image features and fuse them well, but these algorithms have defects, resulting in noise and poor image quality in the fused image. The appearance of deep learning brings a new research direction for image fusion algorithms. A large number of image fusion algorithms based on deep learning have emerged. The convolutional neural network has a good ability to extract image features. However, the traditional convolutional neural network entering the bottleneck period, researchers have gradually abandoned the convolutional neural network and started to study Transformer. However, some scholars are still studying convolutional neural networks, and proposed a pure convolutional neural network ConeNeXt [
1] in 2022. Fewer activation functions and larger convolution kernels are used in ConeNeXt. Although ConvNeXt does not propose new structures and methods, ConvNeXt reduces the use of activation functions, making ConvNeXt have faster reasoning speed and higher accuracy than Swin Transformer [
1]. There is no doubt that convolution neural networks have a good ability to extract image features. Researchers use pre-trained AlexNet [
2], VGGNet [
3], GoogleNet [
4,
5], ResNet [
6], DenseNet [
7], CNN [
8], etc. to extract image depth features and restore them to images after fusion. Besides being a tool for extracting image features, neural networks can also be used in end-to-end image fusion. The representative network models are FusionGAN [
9], IFCNN [
10], PPTFusion [
11].
In the self-coding network framework, the network is composed of an encoder and a decoder. The encoder is used to extract image features, and the decoder is used to restore features to images. The encoder and decoder are independent, so the network structure of the self-coding network is very flexible. Because the self-coding network fusion framework is flexible and extensible, a large number of fusion algorithms based on the self-coding networks have been produced. In 2018, researchers proposed the first image fusion algorithm based on self coding network and named it DenseFuse [
12,
13]. NestFuse [
14,
15] was proposed in 2020, etc.
This paper uses the idea from the Inception module in GoogleNet for reference to build an encoder to extract features from the input images. The basic structure of the Inception module is shown in the
Figure 1. There are three important parts to the Inception module. Firstly, 1 ∗ 1 convolution kernel for lifting dimensions to reduce the calculation of subsequent feature maps. Secondly, the convolution kernel size and output feature map size of each branch is different, which ensures the multi-scale of the image. Finally, the feature maps of each branch are spliced to obtain all feature maps.
This article mainly has the following work and contributions:
The Inception module is added to the network model to increase the feature extraction ability of the network.
Dense blocks are added to the branches of the network, and dense modules are used for feature extraction and image generation.
The use of activation functions is reduced, and activation functions are used only after the first convolution.
2. The Architecture of Network
The network model proposed in this paper is mainly composed of three parts: encoder, fusion strategy, and decoder, as shown in
Figure 2.
2.1. The Encoder Architecture of Network
The encoder network is used to extract network features. Branch 1 uses a larger convolutional kernel to perform preliminary feature extraction of the input image, and then uses the Inception module to extract the image for multi-scale feature extraction, and then again through the smaller convolutional kernel to obtain feature 1. Branch 2 is structured similarly to Branch 1, but Branch 2 uses two convolution operations and then feeds features into the Inception module to get Feature 2. Branch 3 is similar to the structure of branch 2, convolution is performed again on the structure of branch 2 to obtain feature 3, branch 4 uses dense connections to extract features from the image, and feature 4 is obtained. Splicing features 1, 2, 3, and 4 on dim = 1 so that the resulting features contain the features of the previous 4 branch features. The encoder network is shown in
Figure 3. A detailed table of encoder parameters, as shown in
Table 1.
2.2. The Decoder the Encoder Architecture of Network
The architecture of the decoder is shown in
Figure 4. In the design of the decoder, we did not use multiple 3 * 3 convolutional kernels like other image fusion algorithms for multiple feature channel operations, but also added an Inception module on the decoder to reduce the parameters of the network at the same time, as far as possible to retain more features, after the Inception module added a dense connection module, the use of dense connection module can better extract features, this article here will be used to reduce the characteristic channel, achieved good results. Finally, the decoder network reduces the number of feature channels to 1 through a 3 * 3 convolutional kernel, and then connects a Sigmoid activation function to restore it to an image. The specific network parameters are shown in
Table 2.
2.3. The Loss Function of Network Model
This paper uses the improved SSIM [
16,
17] function as the loss function of the network, which is mainly used to calculate the structural similarity between images. Self-coding network, the most important feature is an encoder to extract image features, and a decoder to restore the feature map to an image. In the training phase, the main task is to enable the encoder to extract features as much as possible and make the image restored by the decoder close to the source image. Therefore, it is very effective to use SSIM to calculate the error between the source image and the image restored by the decoder. its calculation formula is:
where
represents the average value,
represents the standard deviation, and the value of SSIM is in [0, 1]. the value of SSIM closer the value is to 1, the higher the structural similarity of image A and image B is, and vice versa. So the Loss function is designed as follows:
2.4. The Fusion Strategy of Network Model
This paper mainly uses the strategy of averaging the characteristics, and the calculation formula is as follows:
where
represents the fused image,
,
represents the source image
a,
b,
represents the corresponding pixel position of the image, and the value of
depends on the size of the input image.
3. Experiments and Results
This paper uses 80,000 images from the MSCOCO dataset [
18], 19-to-multi-exposure images from the Exposure dataset [
19,
20], 50 pairs of images from the Road dataset [
21], and 21 pairs of images from the TNO dataset [
18] as training sets and test datasets for the network.
The entire experiment was conducted in an environment of: CPU: AMD R7 1700 and GPU: NVIDIA RTX3060, Memory 32 GB, Pytorch 1.10.1+cu113.
3.1. Training Network
In the training stage, this paper ignores the fusion strategy, mainly using the encoder to extract image features, using the decoder to restore the image. In this paper, the Adam optimizer is selected as the optimizer of the network, and the batchsize = 6 of the training, the image size is 128 ∗ 128, the learning rate is 0.0001, and the number of iterations is 20 times, The MSCOCO dataset is divided into 5000 images, 20,000 images, and 60,000 image samples, and the network model is trained to obtain the model.
3.2. Image Fusion
In the image fusion stage, two identical encoders extract two different source images to get two feature maps. The two feature maps are merged into one feature map through the fusion strategy. The decoder is used to restore the feature map to an image and output 320 ∗ 320 fused images.
3.3. Evaluation of Experimental Results and Image Quality
Common ways to evaluate images are Entropy (EN) [
22], Mutual information (MI) [
23], Structural similarity (SSIM) [
24], Multi-scale SSIM [
25], Visual information fidelity (VIF) [
26], Spatial Frequency (SF) [
27], Image Quality (Quality,
) [
28], Noise(
) [
29], Definition (DF) [
30], Standard Deviation (SD). Except for the image noise method, the lower the value obtained, the better, the higher the value obtained by other methods.
3.4. The Road Dataset Experiments
This paper uses the above method to evaluate the image generated by the fusion of the model proposed in this paper. At present, the more advanced algorithms in the field of image fusion, IFCNN, DenseFuse [
12] CBF [
31], CNN [
8,
32], DeepDecFusion [
23], MEF-GAN [
33], FusionGAN [
9], DualBranchFusion [
14], are compared, and the experimental results are shown in
Table 3:
From
Table 3, it can be seen that the network model proposed in this paper is the best in 3 of 10 evaluation indexes. Another 4 indicators are suboptimal. It can be seen from
Figure 5 that the part of the red frame in the Infrared Image image cannot see the texture of the wall, but the texture of the image can be clearly seen in the Visible Image. Through the experimental comparison, it is found that in addition to the overexposure of the images obtained by the MEF-GAN model, the proposed model, and other comparable models can retain more texture information of the background wall. However, the brightness of the images after the fusion of the DeepDecFusion and Densefuse models is not as high as that of the proposed model. In addition, in the blue frame, the model proposed in this paper retains the contour information of flowerbed plants to the greatest extent.
Besides, in the blue rectangle, the car outline in DeepDecFusion and FusionGAN is not clear, and the person’s outline is not clear. Densefuse, IFCNN, CNN, DeepDecFusion, and Ours have clear car outlines. In the purple rectangle, the Infrared image can see a crack. In contrast, in the Visible image it is almost invisible, MEF-GAN, closer to the Visible image, Densefuse, DeepDecFusion, IFCNN, and FusionGAN can see the crack inconspicuously, but CNN and Ours can clearly see the crack.
In summary, the network model proposed in this paper retains the various information of Infrared Images and Visible Images to a certain extent.
3.5. The Other Experiments
3.5.1. TNO Dataset
In the infrared and visible image fusion dataset, there are 21 pairs of different infrared and visible image images in TNO dataset. In this paper, these images are converted into grayscale images for experiments and comparison. The specific experimental results are as follows:
The red font in
Table 4 indicates that the network model proposed in this paper has the highest score in 4 of the 7 evaluation indicators. In
Figure 6, the clouds in the sky cannot be seen in the red box of the visible light image, but the trees in the white box can be clearly seen. In the red box of the infrared image, you can see many clouds and contour information, but you can’t see trees. Other images are obtained by various image fusion algorithms. There are many noises in CBF. IFCNN and DenseFuse retain more details of trees and clouds. The images fused by the network model proposed in this paper retain both the cloud layer information in the red box of the infrared image and the tree information in the white box of the visible image, and there is no noise in the fused image to affect the quality of the fused image.
Therefore, the network model proposed in this paper can better retain more information in infrared images and infrared images.
Besides, to prove the effectiveness of the proposed network model, we also fuse and test other images, such as images with many exposures and images with different focuses. The specific experimental results are as follows:
3.5.2. Exposure Dataset
In multiple Exposure datasets, Exposure dataset has a total of 19 color images with different exposures. In this paper, these color images are converted into grayscale images for experiments and comparison. The specific experimental results are as follows:
The red font in
Table 5 indicates that this indicator has the highest score. The network model proposed in this paper has the highest score in 4 of the 7 evaluation indicators. In
Figure 7, the background in the blue box of image A is under-exposed, while the background in the red box is under-exposed. In general, image A is under-exposed. The background in the blue box of image B is exposed, while the red box is over-exposed. In general, image B is over-exposed. Other images are obtained through various image fusion algorithms. CNN, Densefuse, and IFCNN images are over-exposed. However, the images fused by the network model proposed in this paper can be seen in the blue box, and also conform to the distribution of light sources. In the red box, you can see the grid under the light without over-exposure to the light effect.
In conclusion, the network model proposed in this paper can better keep more information in the under-exposed and over-exposed images. Thus, the model proposed in this paper is effective and has achieved good results in other areas.
3.6. Ablation Experiments
The ablation experiment part of this paper is mainly to replace the Inception module and Denseblocks module on each branch of this article with a convolutional kernel with a convolutional kernel size of 3 ∗ 3, and the other parts are not different from this network model.
The red numbers in
Table 6 indicate the optimal in this column.The value of gray background is the score of the network model proposed in this paper on the datasets. The values without background are the scores of ablation experiments on the datasets. With a total of 30 indicators on all data, our model received 24 highest scores. All of the modules in the network model in this article, as well as the Denseblocks, are useful and work well on most of the datasets.
4. Conclusions
Deep learning has achieved good results in many fields; this paper organically combines deep learning with image fusion, and proposes an image fusion algorithm based on deep learning.
Table 4,
Table 5 and
Table 6 show that our model has achieved good results on various datasets.
The multi-branch, multi-scale deep learning image network model proposed in this paper extracts image features by adding multiple branches and introducing the DenseBlocks module in the design of the encoder. The activation function is not used after each convolution in the entire network, and in the selection of activation functions, this article does not use all of them one activation function but uses multiple activation functions. In the design of the loss function, we use the SSIM values between images as a loss function for the network. In the design of the decoder, we use Dense Blocks for dimensionality reduction processing of images. The model proposed in this paper has largely preserved the original image after the fusion and is very realistic and natural. Experiments have proved that the network model proposed in this paper has achieved the best results on most of the datasets.
In the future, we will try more methods of image fusion and study how to reduce the super parameters of the network. Following this, we will improve the network efficiency and strive to solve more problems in the field of image fusion.
Author Contributions
Conceptualization, Z.C.; methodology, Z.C.; software, Z.C.; validation, Y.D., Z.C., F.G. and Z.L.; formal analysis, Z.C.; data curation, Z.C. and Z.L.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C. and Y.D.; visualization, Z.C. and Z.L.; supervision, Y.D.; funding acquisition, Y.D. and F.G. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the National Natural Science Foundation of China grant No. 61772295, 61572270, the PHD foundation of Chongqing Normal University (No. 19XLB003), the Science and Technology Research Program of Chongqing Municipal Education Commission (KJZD.M202-000501), Chongqing Technology Innovation and Application Development Special General Project (cstc-2020jscxlyjsAX0002) and Chongqing Technology Foresight and Institutional Innovation Project (cstc2021-jsyj-.yzys-bAX0011).
Data Availability Statement
Conflicts of Interest
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and company that could be construed as influencing the position presented in the manuscript entitled “Multi-branch Multi-scale Deep Learning Image Fusion Algorithm Based On DenseNet”.
Abbreviations
The following abbreviations are used in this manuscript:
EN | Entropy |
MI | Mutual information |
SSIM | Structural similarity |
MS-SSIM | Multi-scale SSIM |
VIF | Visual information fidelity |
SF | Spatial Frequency |
| Image Quality |
| Noise |
DF | Definition |
SD | Standard Deviation. |
References
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25; Curran Associates, Inc.: Red Hook, NY, USA, 2012. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large- scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Rabinovich, A. Going Deeper with Convolutions; IEEE Computer Society: Washington, DC, USA, 2014. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A. Inception-v4, inception- resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Iandola, F.; Moskewicz, M.; Karayev, S.; Girshick, R.; Keutzer, K. Densenet: Implementing efficient convnet descriptor pyramids. arXiv 2014, arXiv:1404.1869. [Google Scholar]
- Liu, Y.; Chen, X.; Peng, H.; Wang, Z. Multi-focus image fusion with a deep convolutional neural network. Inf. Fusion 2017, 36, 191–207. [Google Scholar] [CrossRef]
- Ma, J.; Wei, Y.; Liang, P.; Chang, L.; Jiang, J. Fusiongan: A genera- tive adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
- Zhang, Y.; Liu, Y.; Sun, P.; Yan, H.; Zhao, X.; Zhang, L. Ifcnn: A general image fusion framework based on convolutional neural network. Inf. Fusion 2020, 54, 99–118. [Google Scholar] [CrossRef]
- Fu, Y.; Xu, T.; Wu, X.; Kittler, J. PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion. arXiv 2021, arXiv:2107.13967. [Google Scholar]
- Hui, L.; Wu, X.J. Densefuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar]
- Hui, L.; Wu, X.J.; Kittler, J. Infrared and visible image fusion using a deep learning framework. In Proceedings of the International Conference on Pattern Recognition 2018, Beijing, China, 18–20 August 2018. [Google Scholar]
- Fu, Y.; Wu, X.J. A Dual-branch Network for Infrared and Visible Image Fusion. arXiv 2021, arXiv:2101.09643. [Google Scholar]
- Li, H.; Wu, X.J.; Durrani, T. Nestfuse: An infrared and visible image fusion architecture based on nest connection and spatial/channel atten- tion models. IEEE Trans. Instrum. Meas. 2020, 69, 9645–9656. [Google Scholar] [CrossRef]
- Zhou, W.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar]
- Kumar, N.; Hoffmann, N.; Oelschlägel, M.; Koch, E.; Kirsch, M.; Gumhold, S. Structural Similarity based Anatomical and Functional Brain Imaging Fusion. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention and International Workshop on Mathematical Foundations of Computational Anatomy, Shenzhen, China, 17 October 2019; Springer: Cham, Switzerland, 2019. [Google Scholar]
- Fu, Y.; Wu, X.J.; Durrani, T. Image fusion based on generative adver- sarial network consistent with perception. Inf. Fusion 2021, 72, 110–125. [Google Scholar] [CrossRef]
- Prabhakar, K.R.; Srikar, V.S.; Babu, R.V. Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Nejati, M.; Samavi, S.; Shirani, S. Multi-focus image fusion using dictionary-based sparse representation. Inf. Fusion 2015, 25, 72–84. [Google Scholar] [CrossRef]
- Xu, H.; Ma, J.; Le, Z.; Jiang, J.; Guo, X. Fusiondn: A unified densely connected network for image fusion. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
- Roberts, J.W.; Van Aardt, J.A.; Ahmed, F.B. Assessment of image fusion procedures using entropy, image quality, and multispectral classification. J. Appl. Remote Sens. 2008, 2, 023522. [Google Scholar]
- Fu, Y.; Wu, X.J.; Kittler, J. Effective method for fusing infrared and visible images. J. Electron. Imaging 2021, 30, 033013. [Google Scholar] [CrossRef]
- Xydeas, C.S.; Pv, V. Objective image fusion performance measure. Mil. Tech. Cour. 2000, 56, 181–193. [Google Scholar]
- Qu, G.; Zhang, D.; Yan, P. Information measure for performance of image fusion. Electron. Lett. 2002, 38, 313–315. [Google Scholar] [CrossRef] [Green Version]
- Liu, Y.; Chen, X.; Wang, Z.; Wang, Z.J.; Ward, R.K.; Wang, X. Deep learning for pixel-level image fusion: Recent advances and future prospects. Inf. Fusion 2018, 42, 158–173. [Google Scholar] [CrossRef]
- Eskicioglu, A.M.; Fisher, P.S. Image quality measures and their perfor- mance. IEEE Trans. Commun. 1995, 43, 2959–2965. [Google Scholar]
- Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2fusion: A unified unsuper- vised image fusion network. IEEE Trans. Pattern Anal. Machine Intell. 2020, 4, 502–518. [Google Scholar]
- Hui, L.A.; Xjw, A.; Jk, B. Rfn-nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar]
- Li, H.; Wu, X.J.; Kittler, J. Mdlatlrr: A novel decomposition method for infrared and visible image fusion. IEEE Trans. Image Process. 2020, 29, 4733–4746. [Google Scholar] [CrossRef]
- Kumar, B.K.S. Multifocus and multispectral image fusion based on pixel significance using discrete cosine harmonic wavelet transform. Signal Image Video Process. 2013, 7, 1125–1143. [Google Scholar] [CrossRef]
- Liu, Y.; Chen, X.; Cheng, J.; Peng, H.; Wang, Z. Infrared and visible image fusion with convolutional neural networks. Int. J. Wavelets Multiresolut. Inf. Process. 2018, 16, 1850018. [Google Scholar] [CrossRef]
- Xu, H.; Ma, J.; Zhang, X.P. Mef-gan: Multi-exposure image fusion via generative adversarial networks. IEEE Trans. Image Process. 2020, 29, 7203–7216. [Google Scholar] [CrossRef]
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).