RSCAN: Residual Spatial Cross-Attention Network for High-Fidelity Architectural Image Editing by Fusing Multi-Latent Spaces
Abstract
:1. Introduction
- We propose a multi-level spatial feature extractor module to map the image to the F space of the synthesis network, which enables us to more accurately reconstruct architectural images with many line details.
- We fuse multi-latent spaces, including the high-dimensional feature space F, which excels in reconstruction, and the low-dimensional space W, which excels in editing, through the residual cross-attention module. By learning the mapping relationship from the W space to the F space, manipulations made in the W space preserve the original editing effects while ensuring correct changes in the features of the F space.
- The self-supervised training method that we design can map images to the F space more rapidly and learn the correct direction of the W space variation for F space changes. On the LSUN Church dataset, our method outperforms existing methods in both qualitative and quantitative evaluations.
2. Related Work
2.1. Architectural Image Editing
2.2. Latent Space for GAN Inversion
Method | Publication | Type | Latent Space | Details | Weaknesses |
---|---|---|---|---|---|
Image2StyleGAN [23] | ICCV’19 | O | W | Using optimization to embed images into the W+ space. | Time complexity is high, reconstruction quality is poor. |
mGANPrior [22] | CVPR’20 | O | Z | Invert images to the Z space and propose adaptive channel adjustment to improve reconstruction. | Artifacts are generated during editing. |
PSP [24] | CVPR’21 | L | W+ | Using encoder to extract features and map them to the W+ space. | Poor reconstruction quality and the W+ space is far from the W space, losing a large number of editing effects. |
E4E [16] | TOG’21 | L | W+ | Inversion to the W+ space uses adversarial training to position the W+ vectors closer to the W space. | The reconstruction quality is low. |
StyleSpace [25] | CVPR’21 | O | S | Explores the S space and proposes a method for the detection of decoupled control channels. | The S space still has difficulty in improving the reconstruction quality and reducing editing effects. |
BDInvert [26] | ICCV’21 | O | W+, F | Proposed a GAN inversion method for the F/W+ space. | Long computation time, does not support large-scale editing such as structure and pose. |
PTI [27] | TOG’22 | H | W+ | Fine-tuned the generator and inverted it to the W space for reconstruction. | Long computation time, requires re-tuning for each input, and the tuning damages the generation quality. |
HyperStyle [18] | CVPR’22 | H | W | Optimized the modulation generator weights. | Reconstruction quality is greatly improved, but a large number of editing effects are lost. |
HyperInverter [19] | CVPR’22 | L | W | Inverted to the W space, using a hypernetwork to predict residual weights and restoring lost image details. | Reconstruction quality improves, but predicted weights are difficult to associate with the W space. |
HFGI [8] | CVPR’22 | L | W+, F | Achieved image-specific detail retention and editing using a distortion consultation branch. | Features retain too much spatial dependency, causing severe artifacts. |
StyleRes [28] | CVPR’23 | L | W, F | Learned residual features, using cycle consistency loss to learn feature editing transformation. | Too many encoders are designed, making training difficult and causing artifacts. |
CLCAE [29] | CVPR’23 | L | W, W+, F | Aligned images with the W space using contrastive learning, transforming W vectors to the W+ and F spaces with cross-attention. | Reconstruction is incomplete, guided by the W space to reconstruct W+ and F spaces, reducing editing effects. |
Kai [12] | WACV’24 | L | Z, F | Extended the Z space to Z+ and integrated it into advanced inversion algorithms such as F/W+. | The Z+ space is a lower-dimensional form of the W+ space, losing reconstruction and editing quality. |
GradStyle [13] | arxiv’24 | L | W+, F | Computed residual features, using selective attention mechanisms to align these details. | Original features focus on changes in editing features, making it more difficult to learn in two high-dimensional spaces. |
Ours | 2024 | L | W, F | Extracted image features to the F space, using cross-attention to learn the variation values of the W space. | Cross-attention and modulation convolution use different calculation methods, making W space transfer incomplete. |
2.3. Residual Network and Cross-Attention Mechanism
3. Method
3.1. Overview
3.2. Spatial Feature Extractor Module
3.3. Residual Cross-Attention Module
3.4. Training Details
3.5. Loss Function
4. Experiment
4.1. Datasets and Evaluation Metrics
- •
- Pixel-level L2 distance: It measures image differences by calculating pixel-level discrepancies.
- •
- PSNR: It is based on the L2 distance and evaluates the image quality through the ratio of the peak signal to noise power.
- •
- SSIM: It provides a holistic assessment of the image quality considering the brightness, contrast, and structure, taking into account the structural information of the image.
- •
- LPIPS: It utilizes a pre-trained neural network to simulate the human visual system’s perception, capturing detailed differences in images.
- •
- FID: It assesses the overall quality and style consistency of images at a higher level by comparing the distance between the generated images and real images in the feature space.
4.2. Experiment Setting
4.3. Comparisons with Other Methods in Terms of Reconstruction Quality
4.3.1. Qualitative Evaluation
4.3.2. Quantitative Evaluation
4.4. Comparisons with Other Methods in Terms of Editing Effects
4.4.1. Qualitative Evaluation
4.4.2. Quantitative Evaluation
4.5. Ablation Study
4.5.1. Impact of Mapping Space and Loss Function on Reconstruction Quality
4.5.2. Impact of Residual Cross-Attention on Editing Effects
4.6. Image Blending
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Jiang, S.; Yan, Y.; Lin, Y.; Yang, X.; Huang, K. Sketch to building: Architecture image translation based on GAN. J. Phys. Conf. Ser. 2022, 2278, 012036. [Google Scholar] [CrossRef]
- Nauata, N.; Hosseini, S.; Chang, K.H.; Chu, H.; Cheng, C.Y.; Furukawa, Y. House-gan++: Generative adversarial layout refinement network towards intelligent computational agent for professional architects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13632–13641. [Google Scholar]
- Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv 2018, arXiv:1809.11096. [Google Scholar]
- Luan, F.; Paris, S.; Shechtman, E.; Bala, K. Deep photo style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4990–4998. [Google Scholar]
- Sangkloy, P.; Lu, J.; Fang, C.; Yu, F.; Hays, J. Scribbler: Controlling deep image synthesis with sketch and color. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5400–5409. [Google Scholar]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
- Xia, W.; Zhang, Y.; Yang, Y.; Xue, J.H.; Zhou, B.; Yang, M.H. Gan inversion: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3121–3138. [Google Scholar] [CrossRef] [PubMed]
- Wang, T.; Zhang, Y.; Fan, Y.; Wang, J.; Chen, Q. High-fidelity gan inversion for image attribute editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11379–11388. [Google Scholar]
- Shannon, C.E. Coding theorems for a discrete source with a fidelity criterion. IRE Nat. Conv. Rec 1959, 4, 1. [Google Scholar]
- Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Seattle, WA, USA, 3 November 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–5. [Google Scholar]
- Song, Q.; Li, G.; Wu, S.; Shen, W.; Wong, H.S. Discriminator feature-based progressive GAN inversion. Knowl.-Based Syst. 2023, 261, 110186. [Google Scholar] [CrossRef]
- Katsumata, K.; Vo, D.M.; Liu, B.; Nakayama, H. Revisiting Latent Space of GAN Inversion for Robust Real Image Editing. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 5313–5322. [Google Scholar]
- Li, H.; Huang, M.; Zhang, L.; Hu, B.; Liu, Y.; Mao, Z. Gradual Residuals Alignment: A Dual-Stream Framework for GAN Inversion and Image Attribute Editing. arXiv 2024, arXiv:2402.14398. [Google Scholar] [CrossRef]
- Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
- Chen, Y.; Vu, T.A.; Shum, K.C.; Yeung, S.K.; Hua, B.S. Time-of-Day Neural Style Transfer for Architectural Photographs. In Proceedings of the 2022 IEEE International Conference on Computational Photography (ICCP), Pasadena, CA, USA, 1–5 August 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–12. [Google Scholar]
- Tov, O.; Alaluf, Y.; Nitzan, Y.; Patashnik, O.; Cohen-Or, D. Designing an encoder for stylegan image manipulation. Acm Trans. Graph. 2021, 40, 1–14. [Google Scholar] [CrossRef]
- Su, W.; Ye, H.; Chen, S.Y.; Gao, L.; Fu, H. Drawinginstyles: Portrait image generation and editing with spatially conditioned stylegan. IEEE Trans. Vis. Comput. Graph. 2022, 29, 4074–4088. [Google Scholar] [CrossRef] [PubMed]
- Alaluf, Y.; Tov, O.; Mokady, R.; Gal, R.; Bermano, A. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18511–18521. [Google Scholar]
- Dinh, T.M.; Tran, A.T.; Nguyen, R.; Hua, B.S. Hyperinverter: Improving stylegan inversion via hypernetwork. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11389–11398. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6007–6017. [Google Scholar]
- Gu, J.; Shen, Y.; Zhou, B. Image processing using multi-code gan prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3012–3021. [Google Scholar]
- Abdal, R.; Qin, Y.; Wonka, P. Image2stylegan: How to embed images into the stylegan latent space? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4432–4441. [Google Scholar]
- Richardson, E.; Alaluf, Y.; Patashnik, O.; Nitzan, Y.; Azar, Y.; Shapiro, S.; Cohen-Or, D. Encoding in style: A stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2287–2296. [Google Scholar]
- Wu, Z.; Lischinski, D.; Shechtman, E. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12863–12872. [Google Scholar]
- Kang, K.; Kim, S.; Cho, S. Gan inversion for out-of-range images with geometric transformations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 13941–13949. [Google Scholar]
- Roich, D.; Mokady, R.; Bermano, A.H.; Cohen-Or, D. Pivotal tuning for latent-based editing of real images. Acm Trans. Graph. 2022, 42, 1–13. [Google Scholar] [CrossRef]
- Pehlivan, H.; Dalva, Y.; Dundar, A. Styleres: Transforming the residuals for real image editing with stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1828–1837. [Google Scholar]
- Liu, H.; Song, Y.; Chen, Q. Delving stylegan inversion for image editing: A foundation latent space viewpoint. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 10072–10082. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
- Chen, C.F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 357–366. [Google Scholar]
- Shen, Y.; Gu, J.; Tang, X.; Zhou, B. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9243–9252. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
- Mescheder, L.; Geiger, A.; Nowozin, S. Which training methods for gans do actually converge? In Proceedings of the International Conference on Machine Learning, (PMLR), Stockholm, Sweden, 10–15 July 2018; pp. 3481–3490. [Google Scholar]
- Mechrez, R.; Shechtman, E.; Zelnik-Manor, L. Photorealistic style transfer with screened poisson equation. arXiv 2017, arXiv:1709.09828. [Google Scholar]
- Yu, F.; Seff, A.; Zhang, Y.; Song, S.; Funkhouser, T.; Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv 2015, arXiv:1506.03365. [Google Scholar]
- Xu, Z.; Tao, D.; Zhang, Y.; Wu, J.; Tsoi, A.C. Architectural style classification using multinomial latent logistic regression. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part I 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 600–615. [Google Scholar]
- Almohammad, A.; Ghinea, G. Stego image quality and the reliability of PSNR. In Proceedings of the 2nd International Conference on Image Processing Theory, Tools and Applications, Paris, France, 7–10 July 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 215–220. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 30, 6629–6640. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed]
PSNR↑ | SSIM↑ | FID↓ | L2↓ | LPIPS↓ | |
---|---|---|---|---|---|
PSP [24] | 17.6227 | 0.4464 | 40.9037 | 0.1621 | 0.2279 |
E4E [16] | 15.9481 | 0.4175 | 41.9608 | 0.1991 | 0.3163 |
HyperInverter [19] | 17.0909 | 0.4511 | 35.2321 | 0.1671 | 0.1773 |
HyperStyle [18] | 19.4163 | 0.4999 | 39.6117 | 0.1284 | 0.1303 |
CLCAE [29] | 19.4931 | 0.5628 | 51.3415 | 0.1353 | 0.1493 |
Ours | 21.6889 | 0.7288 | 29.1387 | 0.1171 | 0.1154 |
Style Editing | Element Editing | ||||||||
---|---|---|---|---|---|---|---|---|---|
Gothic | Greek | Byzantine | Baroque | Total | Trees | Clouds | Glass | Total | |
PSP [24] | 2.2884 | 1.2974 | 1.2148 | 0.7145 | 5.5151 | 0.0131 | 0.0125 | 0.0151 | 0.0407 |
E4E [16] | 3.3226 | 1.6265 | 2.5017 | 1.7826 | 9.2334 | 0.0912 | 0.0586 | 0.1045 | 0.2543 |
HyperInverter [19] | 1.3079 | 0.0679 | 0.8607 | 0.3201 | 2.5566 | 0.0569 | 0.0251 | 0.0291 | 0.1111 |
HyperStyle [18] | 1.3246 | 0.4062 | 0.6361 | 0.3893 | 2.7562 | 0.0622 | 0.0165 | 0.0258 | 0.1045 |
CLCAE [29] | 1.4427 | 0.4762 | 1.4154 | 0.6469 | 3.9812 | 0.0191 | 0.0389 | 0.0548 | 0.1128 |
Ours | 4.1377 | 2.0002 | 3.3117 | 2.0877 | 11.5373 | 0.2003 | 0.0784 | 0.1551 | 0.4338 |
PSNR↑ | SSIM↑ | FID↓ | |
---|---|---|---|
W+ Space | 15.9481 | 0.4175 | 41.9608 |
w/o RL | 17.2613 | 0.4813 | 38.6894 |
w/o GL | 18.5846 | 0.5546 | 32.7428 |
Ours | 21.6889 | 0.7288 | 29.1387 |
W Space | Attention | Ours | |
---|---|---|---|
Style | 3.8752 | 10.8075 | 11.5373 |
Element | 0.1204 | 0.3927 | 0.4338 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, C.; Zhao, G.; Lin, B.; Wang, X.; Yan, F. RSCAN: Residual Spatial Cross-Attention Network for High-Fidelity Architectural Image Editing by Fusing Multi-Latent Spaces. Electronics 2024, 13, 2327. https://doi.org/10.3390/electronics13122327
Zhu C, Zhao G, Lin B, Wang X, Yan F. RSCAN: Residual Spatial Cross-Attention Network for High-Fidelity Architectural Image Editing by Fusing Multi-Latent Spaces. Electronics. 2024; 13(12):2327. https://doi.org/10.3390/electronics13122327
Chicago/Turabian StyleZhu, Cheng, Guangzhe Zhao, Benwang Lin, Xueping Wang, and Feihu Yan. 2024. "RSCAN: Residual Spatial Cross-Attention Network for High-Fidelity Architectural Image Editing by Fusing Multi-Latent Spaces" Electronics 13, no. 12: 2327. https://doi.org/10.3390/electronics13122327
APA StyleZhu, C., Zhao, G., Lin, B., Wang, X., & Yan, F. (2024). RSCAN: Residual Spatial Cross-Attention Network for High-Fidelity Architectural Image Editing by Fusing Multi-Latent Spaces. Electronics, 13(12), 2327. https://doi.org/10.3390/electronics13122327