1. Introduction
Infrared detectors operate effectively both day and night, particularly excelling in low-light conditions by capturing the thermal emissions from objects. In contrast to full-color visible images, single-channel infrared imagery lacks color and textural detail, diverging from human visual perception. Translating infrared to the visible spectrum can imbue infrared images with color, enhancing human perception under all lighting conditions [
1]. However, bridging the gap between these spectral domains to generate semantically rich color images remains challenging. Deep learning approaches, particularly convolutional neural networks (CNNs), have advanced the colorization of infrared images [
2]. While supervised methods like conditional generative adversarial networks (cGANs) [
3] show promise in this domain, they typically depend on extensive paired datasets, which are impractical to collect in real-world settings due to high costs and time constraints. Learning to map between unpaired infrared and visible images thus presents a valuable alternative.
Unsupervised infrared image colorization, which aims to map grayscale thermal images into the visible spectrum without matched references, is an emerging focus in imaging research. Traditional methods, relying on preset knowledge or multispectral fusion, often fall short due to rigid color translation rules and a lack of flexibility [
4,
5,
6,
7]. Recent developments in deep learning, with architectures like CycleGAN [
8], CUT [
9], and UNIT [
10], have made strides in unpaired image-to-image translation, offering innovative strategies for cross-domain colorization [
11,
12,
13]. Despite these advancements, significant challenges persist. Infrared and visible imagery differ fundamentally in color space and information content, complicating the task of inferring accurate semantic colors from monochromatic data without labeled guidance. Current unsupervised methods struggle with semantic interpretation, hindering their ability to establish complex mappings between domains. Consequently, the colorization results often lack naturalness and semantic precision. Furthermore, a distinct domain difference exists between infrared and visible images. Infrared and visible domains correspond to different wavelength ranges, directly impacting their imaging mechanisms and the information content they carry. While visible imaging usually relies on the reflection of an external light source, thermal infrared imaging is mainly based on the thermal radiation of the object itself. Consequently, the same object exhibits entirely different visual characteristics at different wavelengths. For example, in the visible domain, an object may exhibit vibrant colors, while in the infrared domain, it shows a different temperature distribution. While visible images can provide rich color and texture information, images in the thermal infrared domain mainly reflect temperature distributions. Single-channel infrared grayscale images not only lack semantic color information but also suffer from structural blurring and lack of texture information [
14,
15], which increases the difficulty of cross-domain feature learning. In summary, existing unsupervised image translation methods have certain limitations in processing infrared images. They are difficult to accurately predict the chromaticity information and structural texture information that matches the semantic information from the grayscale information of infrared images, resulting in the generation of colorized infrared images that do not meet the standards of human visual perception.
To solve the above problems, we propose a dual-branch structure colorization network for infrared images. Unsupervised infrared image to visible image conversion is realized based on perceptual features and multiple contrastive learning. The generator of the infrared image colorization network is improved by combining multiscale residual blocks with an attention mechanism. Our unsupervised infrared image colorization method aims to effectively capture image features at different scales, preserve and enhance image detail information, and generate colors that are consistent with human visual perception. We note that the patchNCE loss learns patch-level similarity features by maximizing the mutual information between image patches in the input and output images, while the generative adversarial loss digs the overall image-level similarity features by playing the mutual game between the generated image and the target domain image. In order to learn multi-level feature information, we combine contrastive learning with the generative adversarial network framework in the patch-wise contrastive guidance branch (PwCGB) and design a composite loss function. In addition, considering the importance of semantic information for the infrared image-visible image conversion task, we propose perceptual contrastive loss. Inspired by the perceptual loss, we extract the high-dimensional features of infrared images, generated colorized images, and visible images in the perceptual contrastive guidance branch (PCGB) by a pre-trained VGG16 network, respectively. By minimizing the representation distance of positive sample pairs and maximizing the representation distance of negative sample pairs, the similarity and difference between the infrared and visible domains in the feature space are optimized. The strategy of perceptual contrastive learning combined with high-dimensional feature extraction helps to improve the quality of colorization of infrared images and makes the semantic colors of the generated colorized images more accurate. In order to further improve the quality of colorized infrared images and enhance the feature extraction capability of the generator network in the colorization task, our designed multiscale residual block is added in the down-sampling stage of the generator. Parallel residual blocks with different convolutional kernel sizes are utilized to acquire feature information at different scales. In addition, we combine channel attention and residual connectivity by designing feature fusion attention residual blocks in the up-sampling stage of the generator. Through the attention mechanism, the multiscale information can be better integrated, which enables the low-level features and high-level features to be combined more efficiently during the up-sampling process and improves the accuracy of colorization.
Therefore, the main contributions of this paper are as follows:
We propose a dual-branch network for unsupervised infrared image colorization. The large differences between the infrared and visible domains are fitted by multiple contrastive learning.
We propose perceptual contrastive loss to enhance the similarity between the generated image and the visible image in the high-dimensional feature space, making the colorized image more compatible with human visual perception.
A multiscale residual module is designed to help the encoder process feature maps at different scales, enhancing the generator network’s feature extraction capability.
A feature fusion attention residual block is designed to integrate multiscale feature information, focusing on important features during up-sampling to produce higher-quality colorized images.
3. Proposed Method
3.1. Architecture
Figure 2 depicts the architecture of DC-Net, featuring two principal branches: the patch-wise contrastive guidance branch (PwCGB) for local and global feature extraction and the perceptual contrastive guidance branch (PCGB), which focuses on semantic detail aligned with human visual perception. DC-Net operates via a dual-path framework illustrated in
Figure 1c. In PwCGB, the input infrared image
is converted to the color output
by the generator
(
), which is then reduced to the infrared representation
by the generator
(
). The DC-Net model employs a contrast loss rather than a cyclic consistency loss to compare the infrared input with the generated color image, using the generative adversarial loss to constrain
and
as well as
and
. In PCGB, the input infrared image
, the color output
, and the visible image
are processed through a pre-trained VGG16 network to obtain the feature maps used for contrast loss computation. With this multi-level feature learning approach, DC-Net facilitates the conversion of high-fidelity infrared to visible images.
3.2. Patch-Wise Contrastive Guidance Branch
In the PwCGB, we apply a patch-based strategy to dissect the input image into small chunks, selecting positive and negative samples randomly. Positive samples correspond to areas with target similarities, while negatives exhibit differences. Through contrastive learning, we refine color and texture matching across similar patches, differentiating features where needed to bring the synthetic colorized infrared images closer to actual visible-domain images in terms of local qualities. For comprehensive feature extraction, we incorporate generative adversarial loss. This fosters a convergence of characteristics between the colorized infrared and visible images, aligning global colors and textures.
U-Net is commonly used as a generator in generative adversarial networks, featuring down-sampling paths (encoder) and up-sampling paths (decoder) to learn high-resolution feature mappings. While the traditional U-shaped network structure can fuse contextual information, U-Net relies solely on convolution and pooling operations, which are insufficient to capture multiscale information. Additionally, U-Net uses direct feature fusion and cannot selectively focus on important features.To improve the generator network’s capacity to capture and represent the intricate features of input infrared images, we have developed a multiscale residual attention U-Net (MRA-UNet) as the generator in the PwCGB branch.
3.2.1. Multiscale Residual Block
We introduce a multiscale residual block (MRB) in the encoder, as shown in
Figure 3a. Residual blocks from ResNet are combined with convolution kernels of different sizes to construct a parallel two-branch structure. This structure helps the generator better extract multiscale low-frequency texture information through cross-layer residual connections, strengthening the network’s feature learning and information transfer capabilities. The multiscale residual module extracts features at different scales in the down-sampling stage, capturing richer image details and improving feature extraction capability. MRB directly passes important information to deeper layers, reducing the loss of feature information.
3.2.2. Channel Attention Residual Block
As shown in
Figure 3b, we combine the channel attention module with residual connections to design the Channel Attention Residual Block (CARB) and introduce the Feature Fusion Attention Residual Block (CFARB) in the decoder. The attention mechanism guides the network to learn effective features faster by modeling the interdependence between feature channels, adaptively fusing the features of each channel. This allows for the better integration of multiscale information, enabling a more effective combination of low-level and high-level features during up-sampling, thus improving colorization accuracy. It also facilitates the colorization generator to capture more useful channel feature information.
3.3. Perceptual Contrastive Guidance Branch
The perceptual loss aims to measure the perceptual similarity between the model’s generated output and real data. Leveraging a pre-trained CNN, specifically VGG16, it assesses high-level feature correspondence. This measure has proven effective in visual applications like image synthesis and style transfer, prompting its use in our PCGB branch for infrared-to-visible image translation, emphasizing semantic content to yield colorized outputs closely matching the visible spectrum.
As shown in
Figure 4, we extract features from visible, colorized infrared, and infrared images using VGG16 and then implement perceptual contrastive learning with sampled blocks from identical spatial locations across these features. This enables the model to integrate semantic and color attributes pertinent to both spectrums. To broaden diversity, negative samples are drawn not just from infrared but also from visible feature maps. Specifically, grayscale and non-corresponding blocks from visible images serve as negatives, enhancing structural and content fidelity within the colorized products, thus ensuring their alignment with human visual expectations.
3.4. Loss Function
In this section, we will discuss the loss functions employed in both the patch-wise contrastive guidance branch (PwCGB) and the perceptual contrastive guidance branch (PCGB). The composite loss function in the PwCGB branch is a combination of the contrastive loss () and the generative adversarial loss ().
3.4.1. Contrastive Loss
The initial step involves selecting an anchor vector
Q as the query vector. Subsequently, one positive sample
and
negative samples
are chosen. The contrastive loss is then computed to compare the similarity features in the cross-domain coloring task between the infrared and visible domains.
serves as a constraint for the colorized infrared image generator
G, eliminating the need for paired visible images as label information. The representation of the contrastive loss is as follows:
In the equation, the parameter is a fixed value set to 0.07.
3.4.2. Generative Adversarial Loss
To better restore the chromatic and luminance information of the overall infrared image, a generative adversarial loss is introduced in the PwCGB branch. For any input infrared image
, the generator
G and discriminator
D collaborate to encourage the generated colorized infrared image
to compete with the target domain visible image
, aiming for more accurate color information. The representation of the generative adversarial loss
is as follows:
3.4.3. Perceptual Contrastive Loss in PCGB Branch
Through a fixed pre-trained VGG16 network, the infrared image, colorized infrared image, and visible image are individually employed as inputs to the pre-trained VGG16 network, yielding corresponding output features:
,
, and
. Then, the query vector q is selected in
, and the contrastive loss is computed with positive samples
and negative samples
, respectively. Make the generated color information and structural detail information more in accordance with their semantic features. The representation of the perceptual contrastive loss
is as follows:
where
i is the number of feature layers extracted by VGG,
is the weight, and
t is a fixed parameter with a fixed value of
. Based on the two aforementioned loss functions, the overall loss function can be defined as
where
,
, and
represent the weights for contrastive loss, global feature loss, and perceptual loss, respectively.
4. Experiments
4.1. Experiments Settings
4.1.1. Datasets
Our models are trained and evaluated on different datasets for different infrared bands. In the NIR band, we used the NIRScene [
34] dataset. This dataset consists of 477 images in 9 categories captured in both RGB and NIR modalities. The scene categories of the NIRScene dataset contain a countryside, field, forest, indoor, mountain, old building, street, city, and lake. In the thermal infrared band, we randomly selected several images in the multispectral pedestrian detection dataset KAIST [
33] as a training set. The KAIST pedestrian dataset consists of a total of 95,328 images, each of which contains both RGB color images and infrared image versions. The KAIST dataset captures a variety of regular traffic scenes including schoolyards, streets, and the countryside.
4.1.2. Quantitative Evaluation Metrics
Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Mean Squared Error (MSE) were employed in this study as quantitative evaluation metrics to measure the quality of colorized infrared images. The quantitative experimental results are reported as the average scores across all images in the test set.
4.1.3. Training Details
For training, we randomly select 1000 unpaired images in the KAIST dataset and crop the dataset’s image size to using a sliding window. Our model is trained on a single NVIDIA 2080Ti GPU with a batch size of 1. During training, we optimize using the Adam optimizer with , . The model is trained with an initial learning rate of 0.0001 for 200 epochs with a total training time of 24.38 h. To compute the contrast loss and the perceived contrast loss, we select layers 4, 8, 12, and 16 for the computation. For each layer, we select 256 patches. In the training process, how to assign appropriate weights for adversarial loss and contrast loss is an important issue. Too large or too small weights may lead to undesirable results. For example, if the weight of adversarial loss is too high, it may lead to the loss of details in the generated image; while if the weight of contrast loss is too high, it may inhibit the ability to generate diversity. After experiments, we set the loss weights of each component to 0.5, 0.5, and 1.
4.2. Quantitative Testing
In order to quantitatively analyze our method and other unsupervised image translation algorithms, tests are performed on the test sets of the NIR dataset and KAIST dataset, respectively.
Table 1 and
Table 2 show the quantitative test results of CycleGAN, CUT, FastCUT, IR-colorization [
35], TIC-CGAN [
31], and the method proposed in this paper on the KAIST and NIR datasets, respectively. The results show that the best performance is achieved in each quantitative metric using our method. Moreover, for the KAIST thermal infrared dataset, our method has a significant advantage. On the KAIST dataset, our method shows a significant improvement over the CycleGAN method. Our method improves 28.9% in PSNR, 39.0% in SSIM, and 48.4% in MSE. Compared to the CUT method, our method also achieved better results. It improved 23.8% in PSNR, 35.5% in SSIM, and 43.1% in MSE. Compared with the FastCUT method, our method shows improvements of 14.3% in PSNR, 32.4% in SSIM, and 44.1% in MSE. Compared with the IR-colorization method, our method shows improvements of 12.5% in PSNR, 17.7% in SSIM, and 50% in MSE. Compared with the TIC-CGAN method, our method shows improvements of 11.0% in terms of PSNR, 7.3% in terms of SSIM, and 40% in terms of MSE. Our method also achieves the best measurement results on the NIR dataset.
To compare the performance of our method with other unsupervised image translation methods more intuitively, as shown in
Figure 5, we use a line graph to represent the average measurement results on the KAIST dataset. As shown in
Figure 6, we use a line graph to represent the average measurement results on the NIR dataset.
4.3. Qualitative Testing
Figure 7 shows the colorization results of different unsupervised image translation methods on the KAIST dataset. For the street scene, each method recovers the approximate color. However, CUT, CycleGAN, and FastCUT fail to accurately recover the structure and details of the “vehicle” target on the street, and the taillights on the “vehicle” are not correctly colored. For the campus scene, the first three methods only recover the colors of the sky and the ground but seriously lose the color and structural information of the targets such as “buildings” and “lane lines”. IR-colorization successfully recovers the colors of taillights but loses most of the feature colors of targets such as “crosswalks” and “buildings”. TIC-CGAN performs well in recovering “lane lines” and “crosswalks”, but the results are structurally ambiguous and lack texture information. Our method preserves the structural information of targets such as trees, buildings, vehicles, and lane lines while also restoring their colors based on human eye perception.
Figure 8 shows the colorization results of different unsupervised image translation methods on the test set NIR-Scene. For the mountain scene, CUT, CycleGAN, and FastCUT all exhibit a coloring disorder, incorrectly assigning roads the color of mountains. In the indoor scenes, CUT and FastCUT show different degrees of texture mosaicing, while CycleGAN results in blurred “chandeliers”. On the contrary, IR-colorization and TIC-CGAN perform well for mountain scenes, but the “chandelier” appears blurred when dealing with indoor scenes. The colorization of our method matches the semantic information, and the texture is clear. For the lake scene, both CycleGAN and our method are able to color accurately, while CUT, FastCUT, IR-colorization, and TIC-CGAN have coloring errors. In addition, only our method succeeded in recovering the color of the mountains around the lake.
To verify the effectiveness of DC-Net in detail, texture and coloring, as shown in
Figure 9, on the thermal infrared dataset we show the effect of each method to colorize the target using a car as an example. Among them, CUT, CycleGAN edge distortion is the most obvious, TIC-CGAN detail information is lost the most, and only our method accurately colors the body and lights. On the NIR dataset, we take the chandelier as an example to show the effect of each method to colorize the target. Among them, CUT and FastCUT all show serious mosaic phenomena, while CycleGAN, IR-colorization and TIC-CGAN show blurred details.
Overall, the qualitative test results show that our method achieves good performance in the task of converting infrared images to visible images, and the generated visible images have accurate colors and rich details. All the unsupervised image translation methods mentioned in the figure can recover the low-frequency background information such as “sky” and “trees” very well. However, for the fast-changing details, textures, edges, and other parts of the image such as “vehicle”, “crosswalk”, “lane line”, etc., the existing unsupervised image translation methods can not generate a high-quality colorized infrared image structure. High-quality colorized infrared image structure detail information. Our method adeptly manages scenes with significant temperature fluctuations, like those at object edges or in localized areas with notable temperature variations. In these conditions, our approach produces high-frequency data, effectively capturing intricate details and local features within the image. Additionally, it generates color information akin to that found in visible images within the target domain.
4.4. Ablation Study
To evaluate the impact of different parts within our dual-branch structure, consisting of PwCGB and PCGB, we conducted ablation studies. In this part, we train our network by reducing the weights of each part of the loss function to 0 to test their effects. In the PwCGB branch, we try to remove the contrastive loss
and the multiscale residual attention generator. To test the effectiveness of PCGB, we try to remove
. We performed ablation experiments on three scenes from the KAIST dataset, and the colorization results are shown in
Figure 10,
Figure 11 and
Figure 12. In the school scene, removing both the perceptual contrast loss and the MRA-UNet results in an overall dark color and the loss of most building details. In the street scene, removing contrast loss, perceptual contrast loss, and the MRA-UNet results in blurred building structures and loss of small targets in the blue box. In the traffic scene, targets like “lane lines” and “vehicles” show severe artifacts and are not colored correctly. The average quantitative results after removing different parts are shown in
Table 3.
5. Conclusions
This paper presents a novel unsupervised method for translating unpaired infrared images to visible images using a semantic-aware dual-branch contrastive learning network. The proposed network structure leverages contrastive learning for the transformation process and enriches it with high-level perceptual features extracted by pre-trained deep learning models. These features guide the colorization process, resulting in images that more closely align with human visual perception. In addition, the multiscale residual attention generator in PwCGB efficiently learns both local and global features of the image through multi-layer residual blocks. The residual connectivity enables the model to better capture the detailed information in the image, reduces information loss, and helps generate more realistic color images. The feature fusion attention residual block introduced by the generator enables a finer tuning of feature responses on different channels, improving detail retention and color accuracy during colorization. Experimental results indicate that our method effectively infers RGB values from infrared grayscale information, yielding colorized infrared images of high quality. The generated images exhibit accurate colors and detailed textures, performing well in translation tasks from infrared to visible imagery. Nevertheless, real-world scenarios where high-temperature objects diffuse thermal radiation can cause indistinct boundaries in colorized results due to similar grayscale values in surrounding areas. Future work will focus on addressing this issue to enhance boundary clarity. Additionally, the use of colorized infrared image models in real-time surveillance and defense systems is considered. In future research, we will investigate model pruning and quantization techniques to reduce model size and computational demands, enhancing inference speed.