Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models
Abstract
:1. Introduction
- Performing a comparative analysis of linguistic ambiguity resolution by text-to-image diffusion models compared to human resolution in multiple diffusion models;
- Presenting the Visual Linguistic Ambiguity Benchmark (V-LAB) dataset for analyzing and evaluating the interpretation of linguistic ambiguity by text-to-image diffusion models;
- Identifying three failure modes imposed by the presence of three different types of linguistic ambiguity in text-to-image models’ prompts, as well as prompt engineering guidelines to mitigate their effects.
2. Related Work
2.1. Text-to-Image Generation
Vision Language Models
2.2. Ambiguities in Text-to-Image Models
2.3. Prompt Engineering for Generative AI Models
3. Background: Linguistic Ambiguity
3.1. Syntactical Ambiguity
3.2. Lexical Ambiguity
3.3. Figurative Ambiguity
4. Methodology
4.1. Prompt Generation
4.1.1. Syntactical Ambiguity Prompts
4.1.2. Lexical Ambiguity Prompts
4.1.3. Figurative Ambiguity Prompts
4.2. Image Generation
Visual Linguistic Ambiguity Benchmark (V-LAB) Dataset
4.3. Models Evaluation
5. Results and Analysis
5.1. Alignment with Human Resolution
5.1.1. Syntactical Ambiguity
- When provided with syntactically ambiguous prompts, both models tend to exhibit mixed interpretations almost 50% of the time.
- DALL-E tends to interpret syntactical ambiguity in a manner more aligned with human interpretation.
- Both models exhibit inconsistency in interpreting syntactical ambiguity when running the same prompt multiple times.
5.1.2. Lexical Ambiguity
- Lexical ambiguity is often interpreted in a manner aligned with human resolution (43% and 41% by DALL-E and Stable Diffusion, respectively).
- Mixed interpretations appear in both models, being more frequent in Stable Diffusion (9% margin).
- Both model exhibit inconsistent interpretation of prompts with lexical ambiguity.
5.1.3. Figurative Ambiguity
- Stable Diffusion tends to interpret figurative ambiguity in a manner more aligned with human interpretation () compared to DALL-E.
- DALL-E frequently generates mixed interpretations for figurative ambiguity, reflecting its challenge in resolving figurative prompts with a single coherent meaning.
- Both models performed well in terms of consistent depictions in the case of figurative ambiguity, with Stable diffusion achieving higher consistency (90% of the prompts) than DALL-E (80%).
5.2. Stable Diffusion vs. DALL-E
- DALL-E outperformed Stable Diffusion in matching human interpretations in prompts with syntactical ambiguity.
- Both models performed similarly in mixing interpretations in the majority of the generated images.
- The two models performed similarly in matching human interpretations.
- DALL-E tended to generate more mixed interpretations than Stable Diffusion.
- Stable Diffusion outperformed DALL-E in aligning with human resolution by a large margin (30%).
- Stable Diffusion is more consistent than DALL-E in depicting a single interpretation in a single image (less mixed interpretations).
- Both models always managed to interpret the figurative meaning of the prompt, whether solely or simultaneously with the literal interpretation.
6. Failure Modes
- Misaligned Interpretations: The model generates an image that depicts an interpretation that, although correct, still differs from the interpretation that the user has intended.
- Mixed Interpretations: The model depicts mixed interpretations of the prompt in a single generated image.
- Inconsistent Interpretations: The models depicts inconsistent or different interpretations of the same prompt over multiple generations of images.
- Prompts that generate unintended interpretations due to syntactical ambiguity can be reconstructed to clearly specify the relationships of the prompt. Figure 13 shows an example of images generated with two clarified prompts based on one of the syntactically ambiguous prompts used in the study.
- Prompts that generate unintended interpretations due to lexical ambiguity can be reconstructed by adding additional context to clarify the intended interpretation of the ambitious word. Figure 14 shows an example of images generated with two clarified prompts based on one of the syntactically ambiguous prompts used in the study.
- Prompts that generate unintended interpretation due to figurative ambiguity can be reconstructed by removing\replacing the figurative\metaphoric words with their direct alternatives.
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Brack, M.; Friedrich, F.; Kornmeier, K.; Tsaban, L.; Schramowski, P.; Kersting, K.; Passos, A. Ledits++: Limitless image editing using text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 8861–8870. [Google Scholar]
- Ruiz, N.; Li, Y.; Jampani, V.; Wei, W.; Hou, T.; Pritch, Y.; Wadhwa, N.; Rubinstein, M.; Aberman, K. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 6527–6536. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Hao, Y.; Chi, Z.; Dong, L.; Wei, F. Probabilistic modeling of semantic ambiguity for scene graph generation. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
- Sainburg, T.; Mai, A.; Gentner, T.Q. Long-range sequential dependencies precede complex syntactic production in language acquisition. Proc. R. Soc. B 2022, 289, 20212657. [Google Scholar] [CrossRef]
- Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar]
- Alhabeeb, S.K.; Al-Shargabi, A.A. Text-to-Image Synthesis With Generative Models: Methods, Datasets, Performance Metrics, Challenges, and Future Direction. IEEE Access 2024, 12, 24412–24427. [Google Scholar] [CrossRef]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
- Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 19–24 June 2016; pp. 1060–1069. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the ICML, Online, 18–24 July 2021. [Google Scholar]
- Crowson, K.; Biderman, S.; Kornis, D.; Stander, D.; Hallahan, E.; Castricato, L.; Raff, E. Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv 2022, arXiv:2204.08583. [Google Scholar]
- Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning. PMLR, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
- Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Proceedings of Machine Learning Research. Volume 139, pp. 8821–8831. [Google Scholar]
- Midjourney AI. 2022. Available online: https://www.midjourneyfree.ai/ (accessed on 2 July 2024).
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
- Anthropic. Introducing the Next Generation of Claude; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Feng, W.; Zhu, W.; Fu, T.J.; Jampani, V.; Akula, A.; He, X.; Basu, S.; Wang, X.E.; Wang, W.Y. Layoutgpt: Compositional visual planning and generation with large language models. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
- Li, B.; Lin, Z.; Pathak, D.; Li, J.; Fei, Y.; Wu, K.; Xia, X.; Zhang, P.; Neubig, G.; Ramanan, D. Evaluating and Improving Compositional Text-to-Visual Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5290–5301. [Google Scholar]
- Du, Y.; Durkan, C.; Strudel, R.; Tenenbaum, J.B.; Dieleman, S.; Fergus, R.; Sohl-Dickstein, J.; Doucet, A.; Grathwohl, W.S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 8489–8510. [Google Scholar]
- Wu, X.; Sun, K.; Zhu, F.; Zhao, R.; Li, H. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 2096–2105. [Google Scholar]
- Liu, N.; Du, Y.; Li, S.; Tenenbaum, J.B.; Torralba, A. Unsupervised compositional concepts discovery with text-to-image generative models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 2085–2095. [Google Scholar]
- Feng, W.; He, X.; Fu, T.J.; Jampani, V.; Akula, A.R.; Narayana, P.; Basu, S.; Wang, X.E.; Wang, W.Y. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Huang, K.; Sun, K.; Xie, E.; Li, Z.; Liu, X. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Adv. Neural Inf. Process. Syst. 2023, 36, 78723–78747. [Google Scholar]
- Yuksekgonul, M.; Bianchi, F.; Kalluri, P.; Jurafsky, D.; Zou, J. When and why Vision-Language Models behave like Bags-of-Words, and what to do about it? In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Rassin, R.; Hirsch, E.; Glickman, D.; Ravfogel, S.; Goldberg, Y.; Chechik, G. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
- Mehrabi, N.; Goyal, P.; Verma, A.; Dhamala, J.; Kumar, V.; Hu, Q.; Chang, K.W.; Zemel, R.; Galstyan, A.; Gupta, R. Resolving ambiguities in text-to-image generative models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 14367–14388. [Google Scholar]
- Yang, G.; Zhang, J.; Zhang, Y.; Wu, B.; Yang, Y. Probabilistic modeling of semantic ambiguity for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12527–12536. [Google Scholar]
- Rassin, R.; Ravfogel, S.; Goldberg, Y. DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 335–345. [Google Scholar]
- Guo, Y.; Shao, H.; Liu, C.; Xu, K.; Yuan, X. PrompTHis: Visualizing the Process and Influence of Prompt Editing during Text-to-Image Creation. IEEE Trans. Vis. Comput. Graph. 2024, 1–12. [Google Scholar] [CrossRef] [PubMed]
- Meskó, B. Prompt engineering as an important emerging skill for medical professionals: Tutorial. J. Med Internet Res. 2023, 25, e50638. [Google Scholar] [CrossRef]
- Giray, L. Prompt engineering with ChatGPT: A guide for academic writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef]
- Spurlock, K.D.; Acun, C.; Saka, E.; Nasraoui, O. ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback. arXiv 2024, arXiv:2401.03605. [Google Scholar]
- Feng, Y.; Wang, X.; Wong, K.K.; Wang, S.; Lu, Y.; Zhu, M.; Wang, B.; Chen, W. Promptmagician: Interactive prompt engineering for text-to-image creation. IEEE Trans. Vis. Comput. Graph. 2023, 30, 295–305. [Google Scholar] [CrossRef]
- Liu, V.; Chilton, L.B. Design Guidelines for Prompt Engineering Text-to-Image Generative Models. In Proceedings of the CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–23. [Google Scholar]
- Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
- Mo, W.; Zhang, T.; Bai, Y.; Su, B.; Wen, J.R.; Yang, Q. Dynamic Prompt Optimizing for Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26627–26636. [Google Scholar]
- MacDonald, M.C. The interaction of lexical and syntactic ambiguity. J. Mem. Lang. 1993, 32, 692–715. [Google Scholar] [CrossRef]
- Fortuny, J.; Payrató, L. Ambiguity in Linguistics 1. Stud. Linguist. 2024, 78, 1–7. [Google Scholar] [CrossRef]
- Kreishan, L.; Abbadi, R.; Al-Saidat, E. Disambiguating ambiguity: A comparative analysis of lexical decision-making in native and non-native English speakers. Int. J. Engl. Lang. Lit. Stud. 2024, 13, 139–156. [Google Scholar] [CrossRef]
- Liu, A.; Wu, Z.; Michael, J.; Suhr, A.; West, P.; Koller, A.; Swayamdipta, S.; Smith, N.A.; Choi, Y. We’re Afraid Language Models Aren’t Modeling Ambiguity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
- Baggio, G.; Van Lambalgen, M.; Hagoort, P. Language, linguistics and cognition. Handb. Philos. Sci. 2012, 14, 325–355. [Google Scholar]
- Betker, J.; Goh, G.; Jing, L.; Brooks, T.; Wang, J.; Li, L.; Ouyang, L.; Zhuang, J.; Lee, J.; Guo, Y.; et al. Improving image generation with better captions. Comput. Sci. 2023, 2, 8. [Google Scholar]
Prompt | ||||
---|---|---|---|---|
A boy wearing a shirt with a dog. | The shirt has a dog picture. | The dog is next to the boy. | (80%) | |
A man walking next to a girl holding an umbrella. | The man held the umbrella. | The girl held the umbrella. | (70%) | |
He painted on the table. | He painted directly on the table. | He painted on a medium on the table. | (70%) | |
I saw the painting of the tree with the magnifying glass. | I saw the tree next to the glasses. | I saw through the glasses. | (80%) | |
The cat on the mat with the red rose. | The mat had a rose design. | The rose is next to the cat. | (100%) | |
The chicken is ready to eat | The cooked chicken is ready. | The alive chicken is ready for its meal. | (70%) | |
The girl drew on the notebook with black lines. | The girl used black lines. | The notebook had black lines. | (60%) | |
The kite flew over the field with a rainbow. | The kite has a rainbow design | The field had a rainbow over it. | (80%) | |
The man walked the street with the torch. | The street had a torch | The man held a torch. | (90%) | |
The police arrested the man with a gun. | The man had the gun. | The police had the gun. | (60%) |
Prompt | ||||
---|---|---|---|---|
bat near the field | A kind of animal | A type of athletic tool | (90%) | |
A bow displayed in the market | A type of weapon | A knot made by twisting ribbons | (70%) | |
A crane next to the house | A Kind of bird | A mechanical machine | (60%) | |
A key on the desk | Shaped metal tool | A button on a board | (100%) | |
A mole on a hand | A kind of animal | A blemish on the skin | (100%) | |
Glasses on the table | Pair of optical lenses | Drinking instrument | (70%) | |
The bank next to the park | Financial institute | River-side land | (80%) | |
The man carried the light bag | A source of illumination | Adjective (weight) | (80%) | |
The painting with the dates | A type of fruit | Calendar day | (80%) | |
The woman saw the big wave | A long body of water | A hand gesture | (100%) |
Prompt | Figurative Interpretation | |
---|---|---|
He hit the road in the morning. | He left or started a journey in the morning. | |
He is feeling blue. | He is feeling sad or down. | |
She is a ray of sunshine. | She brings happiness and positivity. | |
Stars on the red carpet. | Famous people are present at an event. | |
The city that never sleeps. | A city that is always lively and active. | |
The kid is under the weather. | The child is feeling unwell or sick. | |
The market at the heart of the city. | The market is centrally located or very important to the city. | |
The night sky was a blanket of stars. | The sky was filled with stars, appearing like a covering. | |
The student is burning the midnight oil. | The student is studying or working late into the night. | |
The two sisters looked like two peas in a pod. | The two sisters look very similar or are very close. |
Prompt | ||||
---|---|---|---|---|
He painted on the table. | He painted directly on the table. | He painted on a medium on the table. | (70%) | (30%) |
The painting with the dates | A type of fruit | Calendar day | (80%) | (20%) |
Model | |||
---|---|---|---|
DALL-E | 32% | 23% | 45% |
Stable Diffusion | 20% | 34% | 46% |
Model | |||
---|---|---|---|
DALL-E | 53% | 20% | 27% |
Stable Diffusion | 41% | 38% | 21% |
Model | |||
---|---|---|---|
DALL-E | 34% | 0% | 56% |
Stable Diffusion | 63% | 0% | 27% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Elsharif, W.; Alzubaidi, M.; She, J.; Agus, M. Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models. Computers 2025, 14, 19. https://doi.org/10.3390/computers14010019
Elsharif W, Alzubaidi M, She J, Agus M. Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models. Computers. 2025; 14(1):19. https://doi.org/10.3390/computers14010019
Chicago/Turabian StyleElsharif, Wala, Mahmood Alzubaidi, James She, and Marco Agus. 2025. "Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models" Computers 14, no. 1: 19. https://doi.org/10.3390/computers14010019
APA StyleElsharif, W., Alzubaidi, M., She, J., & Agus, M. (2025). Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models. Computers, 14(1), 19. https://doi.org/10.3390/computers14010019