Progressively Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation
Abstract
:1. Introduction
- We propose a framework that improves the image-to-image translation model through a progressive block-training approach. This novel technique allows for the acquisition of distinct features during various training phases, leading to several notable advantages. These include reduced VRAM usage, accelerated training speed on par with or surpassing other methods when using the same device, and the ability to achieve successful image translation at higher resolutions.
- Furthermore, we propose a novel research field that emphasizes the exploration and refinement of progressive image-to-image translation techniques. Our aim is to enhance both the quality of results and the overall efficiency of image-to-image translation models.
2. Related Work
2.1. Generative Adversarial Networks (GANs)
2.2. Image-to-Image Translation
3. Methodology
3.1. Generator
3.2. Discriminator
3.3. Loss Function
4. Experiments
4.1. Baseline
4.2. Dataset
- AnimeFace2CoserFace. We collected around 30,000 images of coser people dressed like anime characters from Flickr and around 50,000 images of anime girls from Pixiv, a Japanese open-source website. We used a pre-trained MXnet model from the light anime face detector repository to crop the face in the image and resized it to 1024 × 1024.
- Anime2Coser. We gathered around 20,000 images of coser people from Flickr and 25,000 bust photos of anime characters from Pixiv. We used a pre-trained MXnet model from the light anime face detector repository, expanding a 30% larger detection area to crop bust photos and resize them to 1024 × 1024.
- AttackOnHuman. We collected around 30,000 face images from Attack on Titan, a Japanese dark fantasy anime television series, and cropped the faces in the videos using the light anime face detector. In addition, we used around 15,000 cropped images from the Flickr Faces-HQ (FFHQ) dataset for the human domain dataset. Finally, we resized all the images to 512 × 512. The generated results are shown in Figure 2.
4.3. Training
4.4. Qualitative Evaluation
5. Ablation Study
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kim, J.; Kim, M.; Kang, H.; Lee, K. U-gat-it: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. arXiv 2019, arXiv:1907.10830. [Google Scholar]
- Mo, S.; Cho, M.; Shin, J. Instagan: Instance-aware image-to-image translation. arXiv 2018, arXiv:1812.10889. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Iizuka, S.; Simo-Serra, E.; Ishikawa, H. Globally and locally consistent image completion. ACM Trans. Graph. 2017, 36, 1–14. [Google Scholar] [CrossRef]
- Zhang, R.; Isola, P.; Efros, A.A. Colorful image colorization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 649–666. [Google Scholar]
- Zhang, R.; Zhu, J.Y.; Isola, P.; Geng, X.; Lin, A.S.; Yu, T.; Efros, A.A. Real-time user-guided image colorization with learned deep priors. arXiv 2017, arXiv:1705.02999. [Google Scholar] [CrossRef]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [Green Version]
- Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
- Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
- Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1501–1510. [Google Scholar]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Li, J. Twin-GAN–unpaired cross-domain image translation with weight-sharing GANs. arXiv 2018, arXiv:1809.00946. [Google Scholar]
- Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
- Li, C.; Liu, H.; Chen, C.; Pu, Y.; Chen, L.; Henao, R.; Carin, L. Alice: Towards understanding adversarial learning for joint distribution matching. Adv. Neural Inf. Process. Syst. 2017, 30, 5501–5509. [Google Scholar]
- Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807. [Google Scholar]
- Larsson, G.; Maire, M.; Shakhnarovich, G. Learning representations for automatic colorization. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 577–593. [Google Scholar]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Anoosheh, A.; Agustsson, E.; Timofte, R.; Van Gool, L. Combogan: Unrestrained scalability for image domain translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 783–790. [Google Scholar]
- Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar]
- Huang, X.; Liu, M.Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 172–189. [Google Scholar]
- Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, NSW, Australia, 6–11 August 2017; pp. 1857–1865. [Google Scholar]
- Liu, M.Y.; Breuel, T.; Kautz, J. Unsupervised image-to-image translation networks. Adv. Neural Inf. Process. Syst. 2017, 30, 700–708. [Google Scholar]
- Royer, A.; Bousmalis, K.; Gouws, S.; Bertsch, F.; Mosseri, I.; Cole, F.; Murphy, K. Xgan: Unsupervised image-to-image translation for many-to-many mappings. In Domain Adaptation for Visual Understanding; Springer: Berlin/Heidelberg, Germany, 2020; pp. 33–49. [Google Scholar]
- Taigman, Y.; Polyak, A.; Wolf, L. Unsupervised cross-domain image generation. arXiv 2016, arXiv:1611.02200. [Google Scholar]
- Yi, Z.; Zhang, H.; Tan, P.; Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2849–2857. [Google Scholar]
- Song, G.; Luo, L.; Liu, J.; Ma, W.C.; Lai, C.; Zheng, C.; Cham, T.J. AgileGAN: Stylizing portraits by inversion-consistent transfer learning. ACM Trans. Graph. 2021, 40, 1–13. [Google Scholar] [CrossRef]
- Gokaslan, A.; Ramanujan, V.; Ritchie, D.; Kim, K.I.; Tompkin, J. Improving shape deformation in unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 649–665. [Google Scholar]
- Lee, H.Y.; Tseng, H.Y.; Huang, J.B.; Singh, M.; Yang, M.H. Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 35–51. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Chen, Y.; Lai, Y.K.; Liu, Y.J. Cartoongan: Generative adversarial networks for photo cartoonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9465–9474. [Google Scholar]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
- Tang, H.; Sebe, N. Total generate: Cycle in cycle generative adversarial networks for generating human faces, hands, bodies, and natural scenes. IEEE Trans. Multimed. 2021, 24, 2963–2974. [Google Scholar] [CrossRef]
- Liu, G.; Tang, H.; Latapie, H.M.; Corso, J.J.; Yan, Y. Cross-view exocentric to egocentric video synthesis. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 974–982. [Google Scholar]
- Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
- Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
- Tang, H.; Liu, H.; Sebe, N. Unified generative adversarial networks for controllable image-to-image translation. IEEE Trans. Image Process. 2020, 29, 8916–8929. [Google Scholar] [CrossRef] [PubMed]
- Perarnau, G.; Van De Weijer, J.; Raducanu, B.; Álvarez, J.M. Invertible conditional gans for image editing. arXiv 2016, arXiv:1611.06355. [Google Scholar]
- Tang, H.; Xu, D.; Liu, G.; Wang, W.; Sebe, N.; Yan, Y. Cycle in cycle generative adversarial networks for keypoint-guided image generation. In Proceedings of the 27th ACM international conference on multimedia, Nice, France, 21–25 October 2019; pp. 2052–2060. [Google Scholar]
- Tang, H.; Wang, W.; Xu, D.; Yan, Y.; Sebe, N. Gesturegan for hand gesture-to-gesture translation in the wild. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 774–782. [Google Scholar]
- Tang, H.; Bai, S.; Zhang, L.; Torr, P.H.; Sebe, N. Xinggan for person image generation. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 717–734. [Google Scholar]
- Tang, H.; Xu, D.; Sebe, N.; Wang, Y.; Corso, J.J.; Yan, Y. Multi-channel attention selection gan with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2417–2426. [Google Scholar]
- Tang, H.; Xu, D.; Yan, Y.; Torr, P.H.; Sebe, N. Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7870–7879. [Google Scholar]
- Benaim, S.; Wolf, L. One-sided unsupervised domain mapping. Adv. Neural Inf. Process. Syst. 2017, 30, 752–762. [Google Scholar]
- Tang, H.; Xu, D.; Wang, W.; Yan, Y.; Sebe, N. Dual generator generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the Asian Conference on Computer Vision, Perth, WA, Australia, 2–6 December 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 3–21. [Google Scholar]
- Wang, Y.; van de Weijer, J.; Herranz, L. Mix and match networks: Encoder-decoder alignment for zero-pair image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5467–5476. [Google Scholar]
- Xu, D.; Wang, W.; Tang, H.; Liu, H.; Sebe, N.; Ricci, E. Structured attention guided convolutional neural fields for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3917–3925. [Google Scholar]
- Liang, X.; Zhang, H.; Xing, E.P. Generative semantic manipulation with contrasting gan. arXiv 2017, arXiv:1708.00315. [Google Scholar]
- Kastaniotis, D.; Ntinou, I.; Tsourounis, D.; Economou, G.; Fotopoulos, S. Attention-aware generative adversarial networks (ATA-GANs). In Proceedings of the 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), Zagori, Greece, 10–12 June 2018; IEEE: New York, NY, USA, 2018; pp. 1–5. [Google Scholar]
- Yang, C.; Kim, T.; Wang, R.; Peng, H.; Kuo, C.C.J. Show, attend, and translate: Unsupervised image translation with self-regularization and attention. IEEE Trans. Image Process. 2019, 28, 4845–4856. [Google Scholar] [CrossRef] [Green Version]
- Alami Mejjati, Y.; Richardt, C.; Tompkin, J.; Cosker, D.; Kim, K.I. Unsupervised attention-guided image-to-image translation. Adv. Neural Inf. Process. Syst. 2018, 31, 3697–3707. [Google Scholar]
- Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
- Andersson, F.; Arvidsson, S. Generative Adversarial Networks for photo to Hayao Miyazaki style cartoons. arXiv 2020, arXiv:2005.07702. [Google Scholar]
- Jung, C.; Kwon, G.; Ye, J.C. Exploring patch-wise semantic relation for contrastive learning in image-to-image translation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18260–18269. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
Image Size | CycleGAN | UNIT | U-GAT-IT | Our Model |
---|---|---|---|---|
256 × 256 | 97.789 | 105.372 | 91.148 | 86.930 |
512 × 512 | 122.083 | 118.668 | 148.175 | 74.344 |
Image Size | CycleGAN | UNIT | U-GAT-IT | Hneg-SRC | Our Model |
---|---|---|---|---|---|
256 × 256 | 72.430 | 81.085 | 111.162 | 95.137 | 77.978 |
512 × 512 | 94.022 | 86.205 | 115.539 | 101.931 | 70.253 |
U-GAT-IT | Our Model | |
---|---|---|
Correctness of Human Face to Anime Face | 63.04% | 37.25% |
Correctness of Anime Face to Human Face | 58.40% | 48.51% |
CycleGAN | UNIT | U-GAT-IT | Our Model | |
---|---|---|---|---|
Generated image accuracy | 5.4% | 10.7% | 22.9% | 61% |
Image Size | CycleGAN | UNIT | U-GAT-IT | Hneg-SRC | Our Model |
---|---|---|---|---|---|
64 × 64 | 3203 MB | 3019 MB | 4225 MB | 3205 MB | 6393 MB |
128 × 128 | 4281 MB | 3829 MB | 4969 MB | 5555 MB | 7223 MB |
256 × 256 | 10,011 MB | 7627 MB | 8423 MB | 20,401 MB | 9855 MB |
512 × 512 | 31,225 MB | 22,689 MB | 22,169 MB | 11,991 MB | 16,007 MB |
1024 × 1024 | 81,920 MB | 60,081 MB | 74,633 MB | 43,119 MB | 29,000 MB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lee, H.-Y.; Li, Y.-H.; Lee, T.-H.; Aslam, M.S. Progressively Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation. Sensors 2023, 23, 6858. https://doi.org/10.3390/s23156858
Lee H-Y, Li Y-H, Lee T-H, Aslam MS. Progressively Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation. Sensors. 2023; 23(15):6858. https://doi.org/10.3390/s23156858
Chicago/Turabian StyleLee, Hong-Yu, Yung-Hui Li, Ting-Hsuan Lee, and Muhammad Saqlain Aslam. 2023. "Progressively Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation" Sensors 23, no. 15: 6858. https://doi.org/10.3390/s23156858
APA StyleLee, H. -Y., Li, Y. -H., Lee, T. -H., & Aslam, M. S. (2023). Progressively Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation. Sensors, 23(15), 6858. https://doi.org/10.3390/s23156858