V2T-GAN: Three-Level Refined Light-Weight GAN with Cascaded Guidance for Visible-to-Thermal Translation
Abstract
:1. Introduction
2. Related Work
2.1. Infrared Image Simulation
2.2. Pixel-Level Image Conversion Tasks
2.3. Conditional Generative Adversarial Network
2.4. Lightweight Network
3. V2T-GAN
3.1. First-Level Network
- GConv divides the input channels into even and non-overlapping groups according to the grouping number g;
- Perform standard convolution independently on each group that has been divided;
- Concat the results of the standard convolution in the dimension of the channel.
- The first step is to perform DConv;
- Use standard convolution with a 1 × 1 convolution kernel to adjust the number of output channels.
- First perform standard convolution with a 1 × 1 convolution kernel to adjust the number of output channels;
- Use DConv.
- Perform standard convolution with a 1 × 1 convolution kernel. The number of output channels in this step is: C1 = [Cin/r], where C1 is the number of output channels in the first step, Cin is the number of input channels, and r represents the manually set rate;
- Use GConv on the output result of the first step, the number of groups is the number of output channels in the first step (that is, equal to C1), the number of output channels in this step is: C1 × (r − 1);
- Concatenate the output result of the first step and the second step to get the final result.
3.2. Second-Level Network
3.2.1. Target Task Generator G2
3.2.2. Second-Level Auxiliary Task Network Gg
3.3. Thrid-Level Network
3.4. Loss Function
4. Experiments
4.1. Experimental Details and Evaluation Metrics
4.1.1. Dataset
4.1.2. Evaluation Metrics
4.1.3. Training Setup
4.2. Results
4.3. Ablation Study
4.3.1. Three-Level Network Structure
4.3.2. Auxiliary Task
4.3.3. Edge Auxiliary Information
4.3.4. Lightweight Convolution
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Acknowledgments
Conflicts of Interest
References
- Kim, S. Sea-based infrared scene interpretation by background type classification and coastal region detection for small target detection. Sensors 2015, 15, 24487–24513. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mu, C.P.; Peng, M.S.; Dong, Q.X.; Gao, X.; Zhang, R.H. Infrared Image Simulation of Ground Maneuver Target and Scene Based on OGRE. Appl. Mech. Mater. 2014, 716–717, 932–935. [Google Scholar] [CrossRef]
- Yang, M.; Li, M.; Yi, Y.; Yang, Y.; Wang, Y.; Lu, Y. Infrared simulation of ship target on the sea based on OGRE. Laser Infrared 2017, 47, 53–57. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 3, pp. 2366–2374. [Google Scholar]
- Laina, I.; Rupprecht, C.; Belagiannis, V.; Tombari, F.; Navab, N. Deeper depth prediction with fully convolutional residual networks. In Proceedings of the 2016 4th International Conference on 3D Vision, 3DV 2016, Stanford, CA, USA, 25–28 October 2016; pp. 239–248. [Google Scholar]
- Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1520–1528. [Google Scholar]
- Badrinarayanan, V.; Handa, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling. arXiv 2015, arXiv:1505.07293. [Google Scholar]
- Yin, Z.; Shi, J. GeoNet: Unsupervised Learning of Dense Depth, Optical Flow and Camera Pose. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1983–1992. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
- Hwang, S.; Park, J.; Kim, N.; Choi, Y.; Kweon, I.S. Multispectral pedestrian detection: Benchmark dataset and baseline. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1037–1045. [Google Scholar]
- Zhou, Q.; Bai, T.; Liu, M.; Qiu, C. Near Infrared Scene Simulation Based on Visual Image. Infrared Technol. 2015, 37, 11–15. [Google Scholar]
- Li, M.; Xu, Z.; Xie, H.; Xing, Y. Infrared Image Generation Method and Detail Modulation Based on Visible Light Images. Infrared Technol. 2018, 40, 34–38. [Google Scholar]
- Wang, P.; Shen, X.; Lin, Z.; Cohen, S.; Price, B.; Yuille, A. Towards unified depth and semantic prediction from a single image. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2800–2809. [Google Scholar]
- Xu, D.; Ricci, E.; Ouyang, W.; Wang, X.; Sebe, N. Multi-scale continuous CRFs as sequential deep networks for monocular depth estimation. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 161–169. [Google Scholar]
- Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; Ricci, E. Structured Attention Guided Convolutional Neural Fields for Monocular Depth Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3917–3925. [Google Scholar]
- Qi, X.; Liao, R.; Liu, Z.; Urtasun, R.; Jia, J. GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 283–291. [Google Scholar]
- Ranjan, A.; Jampani, V.; Balles, L.; Kim, K.; Sun, D.; Wulff, J.; Black, M.J. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12232–12241. [Google Scholar]
- Jiao, J.; Cao, Y.; Song, Y.; Lau, R. Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; Volume 11219 LNCS, pp. 55–71. [Google Scholar]
- Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
- Hu, L.; Zhang, Y. Facial Image Translation in Short-Wavelength Infrared and Visible Light Based on Generative Adversarial Network. Guangxue Xuebao/Acta Opt. Sin. 2020, 40, 0510001. [Google Scholar] [CrossRef]
- Ma, S.; Fu, J.; Chen, C.W.; Mei, T. DA-GAN: Instance-Level Image Translation by Deep Attention Generative Adversarial Networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5657–5666. [Google Scholar]
- Mejjati, Y.A.; Richardt, C.; Cosker, D.; Tompkin, J.; Kim, K.I. Unsupervised Attention-guided Image-to-Image Translation. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 3693–3703. [Google Scholar]
- Tang, H.; Xu, D.; Sebe, N.; Wang, Y.; Corso, J.J.; Yan, Y. Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2412–2421. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
- Mehta, S.; Rastegari, M.; Shapiro, L.; Hajishirzi, H. ESPNetv2: A light-weight, power efficient, and general purpose convolutional neural network. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9182–9192. [Google Scholar]
- Haase, D.; Amthor, M. Rethinking depthwise separable convolutions: How intra-kernel correlations lead to improved mobilenets. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 14588–14597. [Google Scholar]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 1577–1586. [Google Scholar]
- Ronneberger, O.; Fischer, P.; Brox, T. UNet: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6627–6638. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
- Jia, R.; QIU, Z.; Cui, J.; Wang, Y. Deep multi-scale encoder-decoder convolutional network for blind deblurring. J. Comput. Appl. 2019, 9081, 2552–2557. [Google Scholar]
- Ramamonjisoa, M.; Du, Y.; Lepetit, V. Predicting sharp and accurate occlusion boundaries in monocular depth estimation using displacement fields. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 14636–14645. [Google Scholar]
- Lin, G.; Milan, A.; Shen, C.; Reid, I. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 5168–5177. [Google Scholar]
- Regmi, K.; Borji, A. Cross-View Image Synthesis Using Conditional GANs. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3501–3510. [Google Scholar]
- Zhu, P.; Abdal, R.; Qin, Y.; Wonka, P. SEAN: Image synthesis with semantic region-adaptive normalization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 5103–5112. [Google Scholar]
- Tang, H.; Xu, D.; Yan, Y.; Torr, P.H.S.; Sebe, N. Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; Volume 1, pp. 7867–7876. [Google Scholar]
Methods | The Lower, The Better | The Higher, The Better | ||||
---|---|---|---|---|---|---|
Abs Rel | Avg log10 | RMS | δ < 1.25 | PSNR | SSIM | |
Pix2pix [9] | 0.248 | 0.107 | 0.906 | 0.571 | 22.431 | 0.985 |
X-Fork [38] | 0.314 | 0.130 | 1.074 | 0.480 | 20.692 | 0.984 |
Selection-GAN [24] | 0.284 | 0.112 | 0.958 | 0.554 | 21.976 | 0.982 |
SEAN [39] | 0.293 | 0.114 | 0.966 | 0.564 | 21.804 | 0.983 |
LG-GAN [40] | 0.262 | 0.102 | 0.886 | 0.616 | 22.601 | 0.989 |
Ours | 0.247 | 0.099 | 0.850 | 0.623 | 22.908 | 0.990 |
Network Structure | The Lower, The Better | The Higher, The Better | ||||
---|---|---|---|---|---|---|
Rel | Avg log10 | RMS | δ < 1.25 | PSNR | SSIM | |
One-level | 0.254 | 0.100 | 0.859 | 0.617 | 22.838 | 0.988 |
Two-level | 0.254 | 0.099 | 0.853 | 0.619 | 22.872 | 0.990 |
Three-level | 0.247 | 0.099 | 0.850 | 0.623 | 22.908 | 0.990 |
Setup | The Lower, The Better | The Higher, The Better | ||||
---|---|---|---|---|---|---|
Rel | Avg log10 | RMS | δ < 1.25 | PSNR | SSIM | |
−Gs, Gg | 0.257 | 0.103 | 0.876 | 0.609 | 22.674 | 0.989 |
−Gg | 0.255 | 0.102 | 0.870 | 0.611 | 22.678 | 0.989 |
−Gs | 0.249 | 0.101 | 0.855 | 0.615 | 22.811 | 0.990 |
Ours | 0.247 | 0.099 | 0.850 | 0.623 | 22.908 | 0.990 |
Methods | The Lower, The Better | The Higher, The Better | ||||
---|---|---|---|---|---|---|
Abs Rel | Avg log10 | RMS | δ < 1.25 | PSNR | SSIM | |
−Ie | 0.248 | 0.099 | 0.850 | 0.616 | 22.922 | 0.989 |
Ours | 0.247 | 0.099 | 0.850 | 0.623 | 22.908 | 0.990 |
Methods | The Lower, The Better | The Higher, The Better | Params | ||||
---|---|---|---|---|---|---|---|
Abs Rel | Avg log10 | RMS | δ < 1.25 | PSNR | SSIM | ||
BSConv | 0.256 | 0.105 | 0.898 | 0.601 | 22.442 | 0.989 | 5.081M |
DSConv | 0.259 | 0.108 | 0.916 | 0.593 | 22.257 | 0.989 | 5.126M |
GhostModule | 0.260 | 0.105 | 0.891 | 0.604 | 22.565 | 0.987 | 15.263M |
GConv | 0.247 | 0.099 | 0.850 | 0.623 | 22.908 | 0.990 | 15.235M |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jia, R.; Chen, X.; Li, T.; Cui, J. V2T-GAN: Three-Level Refined Light-Weight GAN with Cascaded Guidance for Visible-to-Thermal Translation. Sensors 2022, 22, 2119. https://doi.org/10.3390/s22062119
Jia R, Chen X, Li T, Cui J. V2T-GAN: Three-Level Refined Light-Weight GAN with Cascaded Guidance for Visible-to-Thermal Translation. Sensors. 2022; 22(6):2119. https://doi.org/10.3390/s22062119
Chicago/Turabian StyleJia, Ruiming, Xin Chen, Tong Li, and Jiali Cui. 2022. "V2T-GAN: Three-Level Refined Light-Weight GAN with Cascaded Guidance for Visible-to-Thermal Translation" Sensors 22, no. 6: 2119. https://doi.org/10.3390/s22062119
APA StyleJia, R., Chen, X., Li, T., & Cui, J. (2022). V2T-GAN: Three-Level Refined Light-Weight GAN with Cascaded Guidance for Visible-to-Thermal Translation. Sensors, 22(6), 2119. https://doi.org/10.3390/s22062119