Temporal Adaptive Attention Map Guidance for Text-to-Image Diffusion Models
Abstract
:1. Introduction
- A temporal adaptive attention map guidance method is proposed for accurate text-to-image generation aligned with the input prompt by accounting for the characteristics of the pre-trained diffusion model at each time step.
- We divide the diffusion generation steps into four intervals: initial, layout, shape, and refinement. Leveraging the proposed loss function, we dynamically assign distinct loss terms to each interval, refining the attention maps to align the optimization process with the unique characteristics of each interval.
- An initial seed filtering method is introduced; it rejects the initial noise, randomly selects a new seed, and restarts the generation process to ensure better alignment between the generated output and the input text.
- Through various experiments, our method demonstrates significant improvements over existing approaches, showcasing its effectiveness and superiority.
2. Related Works
3. Preliminary
3.1. Stable Diffusion Model
3.2. Attention Layer
4. Proposed Method
4.1. Layout Interval
4.2. Shape Interval
4.3. Seed Filtering
4.4. Refinement Interval,
Algorithm 1 Full algorithm |
Input: pre-trained text-to-image generation diffusion model and Decoder , input prompt Output: generated image I
|
5. Experiments
5.1. Experimental Settings
5.2. Quantitative Comparison
5.3. Qualitative Comparison
5.4. Ablation Study
5.4.1. Ablation Study on Components
5.4.2. Ablation Study on Sampling Steps
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 36479–36494. [Google Scholar]
- Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 8780–8794. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 6840–6851. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 1877–1901. [Google Scholar]
- Chefer, H.; Alaluf, Y.; Vinker, Y.; Wolf, L.; Cohen-Or, D. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Trans. Graph. (TOG) 2023, 42, 1–10. [Google Scholar] [CrossRef]
- Dahary, O.; Patashnik, O.; Aberman, K.; Cohen-Or, D. Be yourself: Bounded attention for multi-subject text-to-image generation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 432–448. [Google Scholar]
- Guo, X.; Liu, J.; Cui, M.; Li, J.; Yang, H.; Huang, D. Initno: Boosting text-to-image diffusion models via initial noise optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 9380–9389. [Google Scholar]
- Chen, M.; Laina, I.; Vedaldi, A. Training-free layout control with cross-attention guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2024; pp. 5331–5341. [Google Scholar]
- Patashnik, O.; Garibi, D.; Azuri, I.; Averbuch-Elor, H.; Cohen-Or, D. Localizing object-level shape variations with text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 22994–23004. [Google Scholar]
- Tao, M.; Tang, H.; Wu, F.; Jing, X.Y.; Bao, B.K.; Xu, C. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16494–16504. [Google Scholar]
- Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1316–1324. [Google Scholar]
- Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D.N. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5908–5916. [Google Scholar]
- Zhu, M.; Pan, P.; Chen, W.; Yang, Y. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5795–5803. [Google Scholar]
- Rani, P.; Kumar, D.; Sudhakar, N.; Prakash, D.; Shubham. Text-to-Image Synthesis using BERT Embeddings and Multi-Stage GAN. In Proceedings of the International Conference on Innovative Computing and Communications, Delhi, India, 17–18 February 2023; pp. 157–167. [Google Scholar]
- Deng, Z.; He, X.; Peng, Y. LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1–18. [Google Scholar] [CrossRef]
- Zhang, H.; Koh, J.Y.; Baldridge, J.; Lee, H.; Yang, Y. Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 833–842. [Google Scholar]
- Chang, H.; Zhang, H.; Barber, J.; Maschinot, A.; Lezama, J.; Jiang, L.; Yang, M.H.; Murphy, K.P.; Freeman, W.T.; Rubinstein, M.; et al. Muse: Text-To-Image Generation via Masked Generative Transformers. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 4055–4075. [Google Scholar]
- Ding, M.; Yang, Z.; Hong, W.; Zheng, W.; Zhou, C.; Yin, D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; et al. Cogview: Mastering text-to-image generation via transformers. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 19822–19835. [Google Scholar]
- Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
- Yu, J.; Xu, Y.; Koh, J.Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B.K.; et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv 2022, arXiv:2206.10789. [Google Scholar]
- Gu, J.; Zhai, S.; Zhang, Y.; Susskind, J.M.; Jaitly, N. Matryoshka diffusion models. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024; pp. 1–29. [Google Scholar]
- Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Zhang, Q.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv 2022, arXiv:2211.01324. [Google Scholar]
- Segalis, E.; Valevski, D.; Lumen, D.; Matias, Y.; Leviathan, Y. A picture is worth a thousand words: Principled recaptioning improves image generation. arXiv 2023, arXiv:2310.16656. [Google Scholar]
- Xue, Z.; Song, G.; Guo, Q.; Liu, B.; Zong, Z.; Liu, Y.; Luo, P. Raphael: Text-to-image generation via large mixture of diffusion paths. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 41693–41706. [Google Scholar]
- Liu, N.; Li, S.; Du, Y.; Torralba, A.; Tenenbaum, J.B. Compositional visual generation with composable diffusion models. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 423–439. [Google Scholar]
- Feng, W.; He, X.; Fu, T.J.; Jampani, V.; Akula, A.; Narayana, P.; Basu, S.; Wang, X.E.; Wang, W.Y. Training-free structured diffusion guidance for compositional text-to-image synthesis. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023; pp. 1–21. [Google Scholar]
- Agarwal, A.; Karanam, S.; Joseph, K.; Saxena, A.; Goswami, K.; Srinivasan, B.V. A-star: Test-time attention segregation and retention for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 2283–2293. [Google Scholar]
- Li, Y.; Keuper, M.; Zhang, D.; Khoreva, A. Divide & Bind Your Attention for Improved Generative Semantic Nursing. In Proceedings of the British Machine Vision Conference, Aberdeen, UK, 20–24 November 2023; pp. 1–12. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Szeliski, R. Computer Vision: Algorithms and Applications, 2nd ed.; Springer: New York, NY, USA, 2022. [Google Scholar]
- Liang, V.W.; Zhang, Y.; Kwon, Y.; Yeung, S.; Zou, J.Y. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 17612–17625. [Google Scholar]
- Sheynin, S.; Ashual, O.; Polyak, A.; Singer, U.; Gafni, O.; Nachmani, E.; Taigman, Y. Knn-diffusion: Image generation via large-scale retrieval. arXiv 2022, arXiv:2204.02849. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Ghosh, D.; Hajishirzi, H.; Schmidt, L. Geneval: An object-focused framework for evaluating text-to-image alignment. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 52132–52152. [Google Scholar]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1280–1289. [Google Scholar]
- Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
- Lewis, M.; Nayak, N.V.; Yu, P.; Yu, Q.; Merullo, J.; Bach, S.H.; Pavlick, E. Does clip bind concepts? Probing compositionality in large image models. In Proceedings of the Findings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, 17–22 March 2024; pp. 1487–1500. [Google Scholar]
Full (↑) | Min (↑) | Text (↑) | |
---|---|---|---|
SD [2] | 0.3164 (−7.93%) | 0.2205 (−14.88%) | 0.7675 (−8.07%) |
AnE [8] | 0.3386 (−1.46%) | 0.2537 (−2.09%) | 0.8086 (−3.15%) |
DnB [31] | 0.3336 (−2.92%) | 0.2472 (−4.59%) | 0.8065 (−3.39%) |
InitNO [10] | 0.3420 (−0.48%) | 0.2563 (−1.07%) | 0.8233 (−1.39%) |
Ours | 0.3437 | 0.2591 | 0.8349 |
Full (↑) | Min (↑) | Text (↑) | |
---|---|---|---|
SD [2] | 0.3359 (−8.58%) | 0.2376 (−13.62%) | 0.7641 (−7.63%) |
AnE [8] | 0.3634 (−1.11%) | 0.2727 (−0.82%) | 0.8153 (−1.44%) |
DnB [31] | 0.3564 (−3.01%) | 0.2649 (−3.67%) | 0.8068 (−2.47%) |
InitNO [10] | 0.3668 (−0.19%) | 0.2735 (−0.55%) | 0.8227 (−0.55%) |
Ours | 0.3675 | 0.2750 | 0.8272 |
Full (↑) | Min (↑) | Text (↑) | |
---|---|---|---|
SD [2] | 0.3451 (−4.74%) | 0.2493 (−8.22%) | 0.7924 (−5.77%) |
AnE [8] | 0.3606 (−0.47%) | 0.2707 (−0.33%) | 0.8344 (−0.78%) |
DnB [31] | 0.3525 (−2.71%) | 0.2638 (−2.88%) | 0.8312 (−1.16%) |
InitNO [10] | 0.3630 (0.20%) | 0.2719 (0.09%) | 0.8414 (0.06%) |
Ours | 0.3623 | 0.2716 | 0.8409 |
Full (↑) | Min (↑) | Text (↑) | |
---|---|---|---|
SD [2] | 0.3641 (−3.22%) | 0.2363 (−12.72%) | 0.7033 (−5.01%) |
AnE [8] | 0.3730 (−0.85%) | 0.2676 (−1.18%) | 0.7247 (−2.12%) |
DnB [31] | 0.3679 (−2.21%) | 0.2619 (−3.26%) | 0.7245 (−2.14%) |
InitNO [10] | 0.3751 (−0.31%) | 0.2683 (−0.90%) | 0.7354 (−0.66%) |
Ours | 0.3762 | 0.2708 | 0.7404 |
SD [2] | AnE [8] | DnB [31] | InitNO [10] | Ours | |
---|---|---|---|---|---|
GenEval score (↑) | 0.110 (−0.72%) | 0.301 (−0.24%) | 0.281 (−0.29%) | 0.372 (−0.07%) | 0.398 |
Method | Seed Filtering | Dual Optimization | Full (↑) | Min (↑) | Text (↑) | |
---|---|---|---|---|---|---|
M1 | ✓ | ✓ | ✓ | 0.3434 | 0.2591 | 0.8340 |
M2 | ✓ | ✓ | 0.3421 (−0.38%) | 0.2582 (−0.35%) | 0.8316 (−0.29%) | |
M3 | ✓ | ✓ | 0.3426 (−0.23%) | 0.2583 (−0.31%) | 0.8326 (−0.17%) | |
M4 | ✓ | ✓ | 0.3442 (0.23%) | 0.2593 (0.08%) | 0.8308 (−0.38%) |
T | Full (↑) | Min (↑) | Text (↑) | ||||
---|---|---|---|---|---|---|---|
Ours (50 steps) | 50 | 45 | 35 | 15 | 0.3437 | 0.2591 | 0.8349 |
Ours (25 steps) | 25 | 22 | 17 | 7 | 0.3438 | 0.2595 | 0.8296 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jung, S.; Heo, Y.S. Temporal Adaptive Attention Map Guidance for Text-to-Image Diffusion Models. Electronics 2025, 14, 412. https://doi.org/10.3390/electronics14030412
Jung S, Heo YS. Temporal Adaptive Attention Map Guidance for Text-to-Image Diffusion Models. Electronics. 2025; 14(3):412. https://doi.org/10.3390/electronics14030412
Chicago/Turabian StyleJung, Sunghoon, and Yong Seok Heo. 2025. "Temporal Adaptive Attention Map Guidance for Text-to-Image Diffusion Models" Electronics 14, no. 3: 412. https://doi.org/10.3390/electronics14030412
APA StyleJung, S., & Heo, Y. S. (2025). Temporal Adaptive Attention Map Guidance for Text-to-Image Diffusion Models. Electronics, 14(3), 412. https://doi.org/10.3390/electronics14030412