Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models

Elsharif, Wala; Alzubaidi, Mahmood; She, James; Agus, Marco

doi:10.3390/computers14010019

Open AccessArticle

Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models

¹

College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar

²

Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong 999077, China

^*

Authors to whom correspondence should be addressed.

Computers 2025, 14(1), 19; https://doi.org/10.3390/computers14010019

Submission received: 26 November 2024 / Revised: 30 December 2024 / Accepted: 4 January 2025 / Published: 8 January 2025

(This article belongs to the Special Issue Natural Language Processing (NLP) and Large Language Modelling)

Download

Browse Figures

Versions Notes

Abstract

:

Text-to-image models have demonstrated remarkable progress in generating visual content from textual descriptions. However, the presence of linguistic ambiguity in the text prompts poses a potential challenge to these models, possibly leading to undesired or inaccurate outputs. This work conducts a preliminary study and provides insights into how text-to-image diffusion models resolve linguistic ambiguity through a series of experiments. We investigate a set of prompts that exhibit different types of linguistic ambiguities with different models and the images they generate, focusing on how the models’ interpretations of linguistic ambiguity compare to those of humans. In addition, we present a curated dataset of ambiguous prompts and their corresponding images known as the Visual Linguistic Ambiguity Benchmark (V-LAB) dataset. Furthermore, we report a number of limitations and failure modes caused by linguistic ambiguity in text-to-image models and propose prompt engineering guidelines to minimize the impact of ambiguity. The findings of this exploratory study contribute to the ongoing improvement of text-to-image models and provide valuable insights for future advancements in the field.

Keywords:

natural language processing; computational linguistics; linguistic ambiguity; text-to-image models; diffusion models; prompt engineering

1. Introduction

Advancements in text-to-image models continue to transform visual content creation, allowing for more creativity and personalization [1,2]. Diffusion models, in particular, have led to the launch of many models that massively transformed the field. Aided by their impressive capabilities, they are excelling in bridging language and visual perception, unlocking new possibilities, expanding human imagination, and bridging language and visual perception [3].

The ability to generate images from text empowers users to transform their ideas into creative visual content. However, unlocking the full potential of these models through plain text is not an easy task, which has led to the rise of the concept of prompt engineering, where dedicated efforts are made to enhance AI models’ prompts to achieve optimal results [4]. Since these models primarily rely on text, often in English, as well as linguistics studies language structure and its long-range dependencies, it is crucial to investigate prompt design from a linguistic perspective [5].

This work targets the issue of linguistic ambiguity and how it is handled by text-to-image models. For example, the sentence “Glasses on the table” has lexical ambiguity; it can either mean drinking glasses or reading glasses sitting on a table. As shown in Figure 1, when used as a prompt for text-to-image models, Stable Diffusion in the example generates both interpretations over multiple runs, which, in addition to causing inconsistencies, can diverge from the user’s preferences. Such issues, among many others, point to the importance of understanding the prompt structure from a linguistic perspective [6]. This study takes on a comparative approach to compare the model’s interpretation of linguistic ambiguity to human interpretation through a set of experiments while also reporting other findings and failure modes of the previous generations of models. We present a set of ambiguous prompts and images generated in our experiments as a visual linguistic ambiguity benchmark (V-LAB) dataset. The main contributions of this work are summarized as follows:

Performing a comparative analysis of linguistic ambiguity resolution by text-to-image diffusion models compared to human resolution in multiple diffusion models;
Presenting the Visual Linguistic Ambiguity Benchmark (V-LAB) dataset for analyzing and evaluating the interpretation of linguistic ambiguity by text-to-image diffusion models;
Identifying three failure modes imposed by the presence of three different types of linguistic ambiguity in text-to-image models’ prompts, as well as prompt engineering guidelines to mitigate their effects.

The following section provides a summary of the related work. Next, a brief linguistic background is provided, explaining the different types of linguistic ambiguity targeted in this work. The section after details the methods followed in this study, followed by the results and analysis, based on which a number of challenges are identified along with suggested prompt engineering guidelines to counter them. Finally, the findings are concluded with several potential directions for future work.

2. Related Work

This section provides an overview of literature related to this study. However, our work deals with text-to-image generative models and linguistic ambiguity analysis and does not aim to provide an extensive survey of the related work: we refer interested readers to the recent surveys on this topic [7].

2.1. Text-to-Image Generation

Generative adversarial networks (GANs) marked the beginning of text-to-image models, introducing the generator–discriminator model. Building upon GANs, conditional GANs (CGANs) were introduced, allowing for more controlled generation and leading to the development of text-to-image models [8,9]. Variational autoencoders (VAEs), along with the development of attention mechanisms and the transformer model, have played a crucial role in advancing the field [10].

Additionally, vision language models such as OpenAI’s CLIP provided means for better guiding the generation process and were paired with popular GAN-based models, such as BigGAN and VQGAN+CLIP [11,12,13]. However, in recent times, diffusion models have risen in prominence, demonstrating advantages over their GAN counterparts [14]. The current state-of-the-art text-to-image models include diffusion-based architectures such as OpenAI’s DALL-E and its successors, DALL-E.2 and the most recent DALL-E.3, among others such as Midjourney, Stability AI’s Stable Diffusion, and Google’s Imagen [3,15,16,17,18,19].

Vision Language Models

Text-to-image generation is currently also being performed through vision language models that are designed to process and integrate visual and textual data, enabling seamless interaction across these modalities. By leveraging large-scale multimodal training, models such as GPT-4, Claude, and Gemini excel in tasks requiring a deep understanding of both images and text, such as image captioning, visual question answering (VQA), and text-to-image generation [20,21,22].

2.2. Ambiguities in Text-to-Image Models

Optimizing AI-generated images’ alignment with textual prompts involves not only matching the textual input but also capturing the intent and nuance behind the text. Previous works have tackled this issue from different angles, such as addressing the compositional generation problem [23,24,25]. A human preference score was proposed in [26]. Additionally, the inverse problem of concept discovery with text-to-image models was investigated in [27].

Compositional image generation in text-to-image models refers to the ability of these models to generate images that accurately reflect the relationships and arrangements of objects described in the text prompt. It involves combining multiple elements, such as objects, attributes, spatial relationships, and contextual details, in a coherent and meaningful way within the generated image. However, many issues are faced by these models, and several works have proposed solutions to the issues related to compositional generation, targeting various sub-problems. For instance, StructureDiffusion introduced a training-free method to enhance compositional generations [28]. T2I-CompBench was proposed as a compositional generation benchmark dataset for evaluating text-to-image models’ performance [29]. ARO was introduced as a benchmark to systematically evaluate the ability of vision language models (VLMs) to understand different types of relationships and attributes [30]. On the linguistic side, SynGen proposed a method to enrich the diffusion process with syntactic information [31].

However, part of optimizing prompt–image alignment lies in resolving ambiguities that might exist in the prompt. TIED proposed a method to disambiguate prompts by soliciting clarifications from the user [32]. The Probabilistic Uncertainty Modeling (PUM) module focused on the semantic ambiguity of visual relationships [33]. A recent study reported the problem of semantic leakage, in which DALL-E modified multiple entities with a single noun, highlighting the issue of inductive bias in text-to-image models [34]. Very recently, Guo et al. designed a visual representation for comparing prompt–image pairs and exploring editing history. Their Image Variant Graph represents prompt differences as edges between corresponding images and visualizes distances through projection, allowing users to better understand the impact of prompt changes and gain more effective control over image generation [35].

Despite all efforts to improve model controllability, to the best of our knowledge, the impact of linguistic ambiguity on text-to-image models has not yet been studied in detail.

2.3. Prompt Engineering for Generative AI Models

Prompt engineering is a widely used tool to guide generative AI models in generating desired outputs. Initially developed in the NLP field with large language models (LLMs), it is now extensively applied to text-to-image models. Various techniques have been proposed to improve user experiences with popular LLMs like ChatGPT across diverse fields and applications [36,37,38].

In the text-to-image domain, significant research has focused on addressing diverse challenges. Yang et al. proposed a framework to generate user-aligned prompts by fine-tuning an LLM with reinforcement learning [33]. PromptMagician introduced a system that enables users to visually explore and refine prompts for improved outcomes [39]. Liu et al. studied effective keywords and hyper-parameters for image generation, subsequently developing guidelines to enhance prompt effectiveness [40]. Context Optimization (CoOp) leverages vision language models like CLIP to overcome limitations of manually crafted prompts [41].

Recently, Hao et al. proposed a framework to adapt user inputs into model-preferred prompts using fine-tuning and reinforcement learning [4]. This approach enhances aesthetic quality while preserving user intent. Similarly, the Prompt Auto-Editing (PAE) method refines prompts through reinforcement learning to achieve greater precision in control [42]. However, the linguistic aspects of prompt engineering for generative AI models, including the role of syntax and semantics, remain largely unexplored.

3. Background: Linguistic Ambiguity

Effective communication through text requires addressing linguistic ambiguities, which often overlap across various forms [43]. Linguistic ambiguities are generally classified into three types: (i) phonetic, (ii) semantic, and (iii) syntactic. Since phonetic ambiguity deals with sounds rather than text, it is outside the scope of this study. We focus on syntactic and semantic ambiguities, including their subsets: lexical and figurative.

3.1. Syntactical Ambiguity

Syntactic ambiguity, or amphibology, occurs when a sentence can be interpreted in multiple ways due to its structure [44]. For instance, “I saw a man on a hill with a telescope” can mean either that the man was using a telescope on a hill or that the speaker saw a man with a telescope while on a hill. Similarly, “old men and women” can refer to elderly men and women collectively or distinguish between the two groups. These examples highlight the importance of context in understanding intended meanings.

3.2. Lexical Ambiguity

Lexical ambiguity, a subset of semantic ambiguity, arises when a word or phrase has multiple meanings [45]. For example, in “They are hunting dogs”, the word “hunting” can describe dogs used for hunting or indicate that people are hunting dogs. Additional context is often needed for disambiguation.

3.3. Figurative Ambiguity

Figurative ambiguity occurs when a phrase can be interpreted both literally and figuratively [46]. This is common in irony and metaphors. In fields like artificial intelligence, where text descriptions are converted to visual representations, such ambiguities can result in inaccurate outputs. Thus, precise language interpretation is crucial.

4. Methodology

One popular strategy for linguistic analysis suggests that language processing studies often begin with sample sentences that vary in a specific linguistic feature [47]. Researchers then rely on their own intuitions to evaluate this aspect. Building on this approach, we investigate how text-to-image models handle ambiguous prompts and resolve linguistic ambiguity compared to humans.

Our methodology begins with a set of ambiguous prompts representing the three types of ambiguity identified in this study. We then generate images using these prompts to evaluate the models’ interpretations. Finally, we analyze the results and draw conclusions about the models’ performance. This study’s methodology follows the workflow described in Figure 2 and consists of three main steps: prompt generation, image generation, and model evaluation.

4.1. Prompt Generation

For each of the three classes of linguistic ambiguity considered in this study, we developed a set of prompts that align with the conditions of each type. To minimize potential bias from the authors’ selection, we utilized ChatGPT-4o to generate the prompts used in this study. The details of the prompt generation process for each type of ambiguity are described as follows.

4.1.1. Syntactical Ambiguity Prompts

The set of prompts with syntactical ambiguity was generated by prompting ChatGPT-4o using the prompt “Generate sentences with syntactical ambiguity to be used as prompts for text-to-image models”. The model then returned multiple examples of syntactically ambiguous prompts that have two possible interpretations, which we refer to as (interpretation A) and (interpretation B). Syntactical ambiguity in the English language does not follow a fixed sentence structure. Upon examining datasets of collected Stable Diffusion prompts such as https://huggingface.co/datasets/isidentical/random-stable-diffusion-prompts (accessed on 1 January 2024), we note the frequent occurrence of prepositional phrases in the prompts that often results in syntactical ambiguity. Building on this, the majority of the prompts selected in this category follow that structure, and a total of 10 prompts were selected. The set of syntactically ambiguous prompts is defined as

P_{s} = {p_{s 1}, p_{s 2}, p_{s 3}, \dots, p_{s 10}}, p_{s i} \mapsto {I N T_{A i}, I N T_{B i}, i \in {1, 2, \dots, 10} .

where

P_{s}

is the set of prompts with syntactical ambiguity,

p_{s i}

is an individual syntactically ambiguous prompt, and

I N T_{A i}

and

I N T_{B i}

are the possible interpretations for the ambiguous prompt

p_{s i}

. The list of syntactically ambiguous prompts, along with their interpretations, is listed in Table 1.

4.1.2. Lexical Ambiguity Prompts

To develop our set of prompts with lexical ambiguity, we first prompted ChatGPT-4o to list words with lexical ambiguity that have two possible visual meanings using the prompt: “Generate words with lexical ambiguity that have two visual meanings”. We then prompted the model to use these words in sentences such that they would cause the sentences to be ambiguous in a way that they are not interpretable directly from cues in the sentences using the prompt: “Use the list of lexically ambiguous words to generate lexically ambiguous sentences to be used as prompts for text-to-image models”. The prompts in this category consist of a lexically ambiguous word and a visual set up, each of which has two possible interpretations: interpretation A and interpretation B. The set of syntactically ambiguous prompts is defined as

P_{l} = {p_{l 1}, p_{l 2}, p_{l 3}, \dots, p_{l 10}}, p_{s i} \mapsto {I N T_{A i}, I N T_{B i}, i \in {1, 2, \dots, 10} .

where

P_{l}

is the set of prompts with lexical ambiguity,

p_{l i}

is an individual lexically ambiguous prompt, and

I N T_{A i}

and

I N T_{B i}

are the possible interpretations for the ambiguous prompt

p_{s i}

. The list of prompts with lexical ambiguity, along with their interpretations, is listed in Table 2.

4.1.3. Figurative Ambiguity Prompts

The prompts with figurative ambiguity used in this category are widely used metaphors or idioms. The prompt “Generate sentence with figurative ambiguity containing metaphors or idioms to be used as prompts for text-to-image models” was used with ChatGPT-4o to generate prompts for this category. The set of prompts with figurative ambiguity is defined as

P_{f} = {p_{f 1}, p_{f 2}, p_{f 3}, \dots, p_{f 10}}, p_{f i} \mapsto {I N T_{F i}, I N T_{L i}, i \in {1, 2, \dots, 10} .

where

P_{f}

is the set of prompts with figurative ambiguity,

p_{f i}

is an individual ambiguous prompt with figurative ambiguity,

I N T_{F i}

is the figurative interpretation of the prompt, and

I N T_{L i}

is the literal meaning of the prompt

p_{f i}

.

The list of prompts with figurative ambiguity, along with their figurative interpretations, is listed in Table 3.

4.2. Image Generation

The images in the experimenting stage were generated using two diffusion models: Stable Diffusion [18]: It is currently one of the most powerful and trending diffusion models know for its impressive capabilities in generating high quality, realistic images from text, in addition to other features such as image–image generation. The general architecture of Stable Diffusion consists of a variational autoencoder, forward and reverse diffusion, a noise predictor, and text conditioning. In this study, we used Stable Diffusion XL accessed through https://stablediffusionweb.com/ (accessed on 4 July 2024) that is GPU-enabled with its random seeds and guidance scale set to the default value of 7.

DALL-E [15,17]: Developed by OpenAI, DALL-E is known for its capabilities in transforming abstract and complex prompts into images in a creative way. DALL-E is based on the transforming model, incorporating attentions mechanisms and auto-aggression techniques. It also utilizes text conditioning to understand and align text to images. In this study, we accessed DALL-E-3 [48] through ChatGPT-4o on their https://chatgpt.com/ (access on 4 July 2024), and to ensure comparable results, ChatGPT was instructed to use the given prompt without any modification as it usually does with a user’s input prompts.

The two models were accessed through web interfaces rather than APIs to resemble the experience of most users. For each category of ambiguity, 10 images were generated for each of the 10 prompts, resulting in 200 images for each ambiguity investigated type and a total of 600 images that were evaluated in this study. Figure 3, Figure 4 and Figure 5 show samples of the images generated from prompts with syntactical, lexical, and figurative ambiguity, respectively.

Visual Linguistic Ambiguity Benchmark (V-LAB) Dataset

The set of prompts generated and used in this study compose our Visual Linguistic Ambiguity Benchmark (V-LAB). Overall, V-LAB consists of 30 ambiguous prompts. The dataset also contains a total of 600 images generated with both Stable Diffusion and DALL-E, where each model was used to generate 10 images for each prompt in the dataset.

4.3. Models Evaluation

The models in this study were evaluated on two stages. First, a human evaluation was performed by a total of 10 participants for the prompts used in the study. We considered a survey form that includes the prompts with syntactical and lexical ambiguity used in this study to learn how users interpret ambiguous prompts. The survey contains the syntactically and lexically ambiguous prompts, along with their possible interpretations, and participants were requested to choose the interpretation they found more aligned with the prompt. For each prompt, we named the first listed interpretation in the survey as

I N T_{A}

and the second one as

I N T_{B}

. The survey form can be found at https://forms.gle/Ty1qAUhLczi3WQdc8 (accessed on 4 July 2024). Table 1 and Table 2 report the selections by human evaluators for the interpretations of prompts with syntactical and lexical ambiguity. For each prompt, the most-voted interpretation is labeled as

I N T_{H}

(human selection). In the case of figurative ambiguity the figurative interpretation is referred to as

I N T_{H}

, since it is usually the perceived interpretation by humans.

The second stage of human evaluation was the model assessment, which was performed by the authors, where they viewed the images and chose which interpretation was being depicted by the model in the generated images. In the cases of the syntactical and lexical ambiguity, the authors chose from three possible interpretations;

I N T_{A}

,

I N T_{B}

(

I N T_{F}

and

I N T_{L}

in the case of figurative ambiguity), and a third case where both previous interpretations were depicted in the image

I N T_{M}

(mixed interpretations).

Figure 3. Sample of images generated from prompts with syntactical ambiguity using Stable Diffusion in the top row and DALL-E in the bottom row.

Figure 4. Sample of images generated from prompts with lexical ambiguity using Stable Diffusion in the top row and DALL-E in the bottom row.

Figure 5. Sample of images generated from prompts with figurative ambiguity using Stable Diffusion in the top row and DALL-E in the bottom row.

5. Results and Analysis

This section reports and analyzes the results obtained from both the human evaluation and model assessments performed in the study. We considered two main angles: first, how the text-to-image models’ resolution of ambiguous prompts aligns with human resolution, and second, how the two models perform on the different categories of ambiguity.

5.1. Alignment with Human Resolution

When generating images using ambiguous prompts with two possible interpretations

I N T_{A}

and

I N T_{B}

, there are three possibilities for ambiguity resolution in the generated image. First, the image will depict interpretation A solely (

I N T_{A}

); second, the image will depict interpretation B solely (

I N T_{B}

); or lastly, the image will depict both interpretations simultaneously

I N T_{M}

(mixed interpretations).

Throughout the assessment of the generated images, the evaluators chose from the three aforementioned cases. For each prompt, we labeled the interpretation selected by majority votes from the evaluators as

I N T_{H}

(human interpretation) and the other one with less votes as

I N T_{N}

(non-human interpretation). For example, in Table 4,

I N T_{A}

was favored by evaluators 80% of the time; hence, for prompt

P_{l} 9

,

I N T_{B}

is referred to as

I N T_{H}

. In the case of figurative ambiguity, the figurative meaning of the prompt was labeled as

I N T_{H}

and the literal meaning as

I N T_{N}

in all the prompts. Next, we detail the models’ performance for each ambiguity category.

5.1.1. Syntactical Ambiguity

A total of 200 images were generated from 10 prompts exhibiting syntactical ambiguity, where each prompt generated 10 images, with 100 each generated using DALL-E and Stable Diffusion. Table 5 summarizes the results of the two models’ interpretations depicted in the generated images. Out of the 200 images, 40 images from DALL-E and 20 from Stable diffusion successfully depicted an interpretation matching the majority of human interpretation

I N T_{H}

solely, adding up to 30% of the images. In contrast, 46 images from Stable Diffusion and 45 images from DALL-E depicted mixed interpretations

I N T_{M}

in the image, leading 45.5% of the images to exhibit mixed interpretations. The Venn diagram in Figure 6 visualizes the generated result; the intersection of the two types interpretations (

I N T_{H}

and

I N T_{N}

) represents the images with mixed interpretations.

It can be observed that while performing similarly in mixing interpretations, in exclusive depiction cases, DALL-E outperformed Stable Diffusion in matching human resolution. On a prompt level, Figure 7 details how each of the diffusion models interpreted each of the 10 prompts over 10 generated images. It is noted that they exhibited a sort of inconsistency in how the ambiguous prompts were interpreted, where single prompts had different interpretations and, hence, depictions over multiple generations. A total of 40% of the prompts in DALL-E and 50% in Stable Diffusion exhibited more than a single interpretation when run multiple times. Based on the observed findings, the following can be concluded:

When provided with syntactically ambiguous prompts, both models tend to exhibit mixed interpretations almost 50% of the time.
DALL-E tends to interpret syntactical ambiguity in a manner more aligned with human interpretation.
Both models exhibit inconsistency in interpreting syntactical ambiguity when running the same prompt multiple times.

5.1.2. Lexical Ambiguity

By observing Table 6, it is noticed that when faced with lexically ambiguous prompts, the models tended to interpret the ambiguity as the human interpretation

I N T_{H}

in the majority of the generated images (43% for DALL-E and 41% for Stable diffusion). By observing the Venn diagram in Figure 8, it is notable that both models tended to perform similarly, with DALL-E generating both more mixed interpretations

I N T_{M}

and those matching human resolution

I N T_{H}

than Stable Diffusion.

It is also noted through examining the heat maps in Figure 9, that both model exhibited inconsistency in interpreting the lexical ambiguity in the prompts. Both models only depicted consistent interpretations in the images 50% of the time, with Stable Diffusion being slightly more inconsistent. Based on the statistics, we conclude the following:

Lexical ambiguity is often interpreted in a manner aligned with human resolution (43% and 41% by DALL-E and Stable Diffusion, respectively).
Mixed interpretations appear in both models, being more frequent in Stable Diffusion (9% margin).
Both model exhibit inconsistent interpretation of prompts with lexical ambiguity.

5.1.3. Figurative Ambiguity

Table 7 summarizes the results of the two models’ interpretations as depicted in the generated images. It is noted that Stable Diffusion succeeded in capturing the figurative meaning in the prompts in most of the cases unlike DALL-E, which only did so in 34% of the cases. On the other hand, DALL-E tended to mix interpretations in 56% of its generations. It is also noted that both models never interpreted the prompts solely literally, which can lead to the conclusion that the models are capable of interpreting the figurative meaning even when mixing it with the literal meaning. The results are visualized in Figure 10 illustrating that

I N T_{N}

interpretations do not appear solely in the generated images by both models.

Interestingly, both models performed better with figurative ambiguity compared to the other two categories in terms of consistent interpretations, as shown in Figure 11. Stable Diffusion almost always depicted the same interpretation, while DALL-E was consistent in 80% of the prompts. Based on the observed statistics, the following can be concluded:

Stable Diffusion tends to interpret figurative ambiguity in a manner more aligned with human interpretation ( $I N T_{H}$ ) compared to DALL-E.
DALL-E frequently generates mixed interpretations for figurative ambiguity, reflecting its challenge in resolving figurative prompts with a single coherent meaning.
Both models performed well in terms of consistent depictions in the case of figurative ambiguity, with Stable diffusion achieving higher consistency (90% of the prompts) than DALL-E (80%).

5.2. Stable Diffusion vs. DALL-E

By examining the pie charts illustrated in Figure 12, we can visually compare the performance of Stable Diffusion and DALL-E over the entire dataset of 600 image from the three ambiguity categories.

In the case of syntactical ambiguity, over 100 images, we have the following results:

DALL-E outperformed Stable Diffusion in matching human interpretations in prompts with syntactical ambiguity.
Both models performed similarly in mixing interpretations in the majority of the generated images.

In the case of lexical ambiguity, we have the following results:

The two models performed similarly in matching human interpretations.
DALL-E tended to generate more mixed interpretations than Stable Diffusion.

In the case of figurative ambiguity, we have the following results:

Stable Diffusion outperformed DALL-E in aligning with human resolution by a large margin (30%).
Stable Diffusion is more consistent than DALL-E in depicting a single interpretation in a single image (less mixed interpretations).
Both models always managed to interpret the figurative meaning of the prompt, whether solely or simultaneously with the literal interpretation.

6. Failure Modes

By observing the results and their analysis in the previous section, we report a number of failure modes that might arise due to linguistic ambiguity as observed in the conducted study. We find the failure modes can take one of the following forms:

Misaligned Interpretations: The model generates an image that depicts an interpretation that, although correct, still differs from the interpretation that the user has intended.
Mixed Interpretations: The model depicts mixed interpretations of the prompt in a single generated image.
Inconsistent Interpretations: The models depicts inconsistent or different interpretations of the same prompt over multiple generations of images.

To account for the aforementioned problems, we suggest the following prompt engineering guidelines:

Prompts that generate unintended interpretations due to syntactical ambiguity can be reconstructed to clearly specify the relationships of the prompt. Figure 13 shows an example of images generated with two clarified prompts based on one of the syntactically ambiguous prompts used in the study.
Prompts that generate unintended interpretations due to lexical ambiguity can be reconstructed by adding additional context to clarify the intended interpretation of the ambitious word. Figure 14 shows an example of images generated with two clarified prompts based on one of the syntactically ambiguous prompts used in the study.
Prompts that generate unintended interpretation due to figurative ambiguity can be reconstructed by removing\replacing the figurative\metaphoric words with their direct alternatives.

Figure 13, Figure 14 and Figure 15 show examples of how applying the proposed guidelines enhances the discrepancies encountered in some of the images generated in this study.

7. Conclusions and Future Work

This paper provided a preliminary exploratory analysis of linguistic ambiguity resolution in text-to-image models, with a particular emphasis on Stable Diffusion and DALL-E. Our comparative analysis reveals insights into how the two models differ in interpreting three different types of linguistic ambiguity present in text prompts—syntactical, lexical, and figurative ambiguity—and how their resolution compares to human resolution. We identified four main failure modes that are caused by the investigated ambiguities, and we proposed prompt engineering guidelines for avoiding the unwanted results.

The future of this work will go into two main directions. First, it will involve developing real-time guidelines that provide prompt recommendations to users during interactions with text-to-image generative systems, improving user experiences in effective image generation. The second direction will focus on the technical aspect of the model architectures as a way to enhance the model interpretation by adding control to the diffusion process.

Author Contributions

Conceptualization, W.E.; methodology, W.E. and J.S.; validation, M.A. (Marco Agus) and M.A. (Mahmood Alzubaidi); formal analysis, W.E. and M.A. (Marco Agus); investigation, W.E. and M.A. (Mahmood Alzubaidi); data curation, W.E. and M.A. (Mahmood Alzubaidi); writing—original draft preparation, W.E. and J.S.; writing—review and editing, M.A. (Marco Agus); visualization, W.E.; supervision, M.A. (Marco Agus) and J.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study has been approved by the Hamad Bin Khalifa University Institutional Review Board (HBKU-IRB). IRB Protocol Reference Number: HBKU-IRB-2025-17.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The Visual Linguistic Ambiguity Benchmark (V-LAB) dataset consisting of the prompts and images used in this study is publicly available at https://www.kaggle.com/datasets/walaadil/visual-linguistic-ambiguity-benchmark-v-lab/ (accessed on 4 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brack, M.; Friedrich, F.; Kornmeier, K.; Tsaban, L.; Schramowski, P.; Kersting, K.; Passos, A. Ledits++: Limitless image editing using text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 8861–8870. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Wei, W.; Hou, T.; Pritch, Y.; Wadhwa, N.; Rubinstein, M.; Aberman, K. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–24 June 2024; pp. 6527–6536. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Hao, Y.; Chi, Z.; Dong, L.; Wei, F. Probabilistic modeling of semantic ambiguity for scene graph generation. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Sainburg, T.; Mai, A.; Gentner, T.Q. Long-range sequential dependencies precede complex syntactic production in language acquisition. Proc. R. Soc. B 2022, 289, 20212657. [Google Scholar] [CrossRef]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar]
Alhabeeb, S.K.; Al-Shargabi, A.A. Text-to-Image Synthesis With Generative Models: Methods, Datasets, Performance Metrics, Challenges, and Future Direction. IEEE Access 2024, 12, 24412–24427. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, New York, NY, USA, 19–24 June 2016; pp. 1060–1069. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Brock, A.; Donahue, J.; Simonyan, K. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the ICML, Online, 18–24 July 2021. [Google Scholar]
Crowson, K.; Biderman, S.; Kornis, D.; Stander, D.; Hallahan, E.; Castricato, L.; Raff, E. Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv 2022, arXiv:2204.08583. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning. PMLR, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Proceedings of Machine Learning Research. Volume 139, pp. 8821–8831. [Google Scholar]
Midjourney AI. 2022. Available online: https://www.midjourneyfree.ai/ (accessed on 2 July 2024).
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
Anthropic. Introducing the Next Generation of Claude; Anthropic: San Francisco, CA, USA, 2024. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Feng, W.; Zhu, W.; Fu, T.J.; Jampani, V.; Akula, A.; He, X.; Basu, S.; Wang, X.E.; Wang, W.Y. Layoutgpt: Compositional visual planning and generation with large language models. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Li, B.; Lin, Z.; Pathak, D.; Li, J.; Fei, Y.; Wu, K.; Xia, X.; Zhang, P.; Neubig, G.; Ramanan, D. Evaluating and Improving Compositional Text-to-Visual Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5290–5301. [Google Scholar]
Du, Y.; Durkan, C.; Strudel, R.; Tenenbaum, J.B.; Dieleman, S.; Fergus, R.; Sohl-Dickstein, J.; Doucet, A.; Grathwohl, W.S. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In Proceedings of the International Conference on Machine Learning. PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 8489–8510. [Google Scholar]
Wu, X.; Sun, K.; Zhu, F.; Zhao, R.; Li, H. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 2096–2105. [Google Scholar]
Liu, N.; Du, Y.; Li, S.; Tenenbaum, J.B.; Torralba, A. Unsupervised compositional concepts discovery with text-to-image generative models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 2085–2095. [Google Scholar]
Feng, W.; He, X.; Fu, T.J.; Jampani, V.; Akula, A.R.; Narayana, P.; Basu, S.; Wang, X.E.; Wang, W.Y. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Huang, K.; Sun, K.; Xie, E.; Li, Z.; Liu, X. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Adv. Neural Inf. Process. Syst. 2023, 36, 78723–78747. [Google Scholar]
Yuksekgonul, M.; Bianchi, F.; Kalluri, P.; Jurafsky, D.; Zou, J. When and why Vision-Language Models behave like Bags-of-Words, and what to do about it? In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Rassin, R.; Hirsch, E.; Glickman, D.; Ravfogel, S.; Goldberg, Y.; Chechik, G. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. Adv. Neural Inf. Process. Syst. 2024, 36. [Google Scholar]
Mehrabi, N.; Goyal, P.; Verma, A.; Dhamala, J.; Kumar, V.; Hu, Q.; Chang, K.W.; Zemel, R.; Galstyan, A.; Gupta, R. Resolving ambiguities in text-to-image generative models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 14367–14388. [Google Scholar]
Yang, G.; Zhang, J.; Zhang, Y.; Wu, B.; Yang, Y. Probabilistic modeling of semantic ambiguity for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12527–12536. [Google Scholar]
Rassin, R.; Ravfogel, S.; Goldberg, Y. DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Abu Dhabi, United Arab Emirates, 8 December 2022; pp. 335–345. [Google Scholar]
Guo, Y.; Shao, H.; Liu, C.; Xu, K.; Yuan, X. PrompTHis: Visualizing the Process and Influence of Prompt Editing during Text-to-Image Creation. IEEE Trans. Vis. Comput. Graph. 2024, 1–12. [Google Scholar] [CrossRef] [PubMed]
Meskó, B. Prompt engineering as an important emerging skill for medical professionals: Tutorial. J. Med Internet Res. 2023, 25, e50638. [Google Scholar] [CrossRef]
Giray, L. Prompt engineering with ChatGPT: A guide for academic writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef]
Spurlock, K.D.; Acun, C.; Saka, E.; Nasraoui, O. ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback. arXiv 2024, arXiv:2401.03605. [Google Scholar]
Feng, Y.; Wang, X.; Wong, K.K.; Wang, S.; Lu, Y.; Zhu, M.; Wang, B.; Chen, W. Promptmagician: Interactive prompt engineering for text-to-image creation. IEEE Trans. Vis. Comput. Graph. 2023, 30, 295–305. [Google Scholar] [CrossRef]
Liu, V.; Chilton, L.B. Design Guidelines for Prompt Engineering Text-to-Image Generative Models. In Proceedings of the CHI Conference on Human Factors in Computing Systems, New Orleans, LA, USA, 29 April–5 May 2022; pp. 1–23. [Google Scholar]
Zhou, K.; Yang, J.; Loy, C.C.; Liu, Z. Learning to prompt for vision-language models. Int. J. Comput. Vis. 2022, 130, 2337–2348. [Google Scholar] [CrossRef]
Mo, W.; Zhang, T.; Bai, Y.; Su, B.; Wen, J.R.; Yang, Q. Dynamic Prompt Optimizing for Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26627–26636. [Google Scholar]
MacDonald, M.C. The interaction of lexical and syntactic ambiguity. J. Mem. Lang. 1993, 32, 692–715. [Google Scholar] [CrossRef]
Fortuny, J.; Payrató, L. Ambiguity in Linguistics 1. Stud. Linguist. 2024, 78, 1–7. [Google Scholar] [CrossRef]
Kreishan, L.; Abbadi, R.; Al-Saidat, E. Disambiguating ambiguity: A comparative analysis of lexical decision-making in native and non-native English speakers. Int. J. Engl. Lang. Lit. Stud. 2024, 13, 139–156. [Google Scholar] [CrossRef]
Liu, A.; Wu, Z.; Michael, J.; Suhr, A.; West, P.; Koller, A.; Swayamdipta, S.; Smith, N.A.; Choi, Y. We’re Afraid Language Models Aren’t Modeling Ambiguity. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023. [Google Scholar]
Baggio, G.; Van Lambalgen, M.; Hagoort, P. Language, linguistics and cognition. Handb. Philos. Sci. 2012, 14, 325–355. [Google Scholar]
Betker, J.; Goh, G.; Jing, L.; Brooks, T.; Wang, J.; Li, L.; Ouyang, L.; Zhuang, J.; Lee, J.; Guo, Y.; et al. Improving image generation with better captions. Comput. Sci. 2023, 2, 8. [Google Scholar]

Figure 1. Example of a scenario where a prompt with lexical ambiguity results in a correct interpretation by the text-to-image models, yet it still mismatches user intention\expectation.

Figure 2. Workflow for assessing text-to-image models’ resolution of linguistic ambiguity, including prompt generation, image generation, and evaluation using DALL-E and Stable Diffusion models. Human judgment and model assessments are integrated for analyzing outputs of lexical, syntactic, and figurative ambiguity prompts.

Figure 6. Venn diagram illustrating the number interpretations of DALL-E and Stable Diffusion of 100 images with syntactical ambiguity by each model. The diagram displays the number of images where the model depicted human like interpretations

I N T_{H}

in the image, non-human like interpretations

I N T_{N}

, and where both interpretations are depicted

I N T_{M}

.

Figure 6. Venn diagram illustrating the number interpretations of DALL-E and Stable Diffusion of 100 images with syntactical ambiguity by each model. The diagram displays the number of images where the model depicted human like interpretations

I N T_{H}

in the image, non-human like interpretations

I N T_{N}

, and where both interpretations are depicted

I N T_{M}

.

Figure 7. Heat maps illustrating the distribution of the interpretations for each syntactical ambiguity prompt in the 10 images generated by each of DALL-E and Stable Diffusion (SD).

Figure 8. Venn diagram illustrating the interpretations of DALL-E and Stable Diffusion of prompts with lexical ambiguity. The diagram displays the number of images where the model depicted human like interpretations

I N T_{H}

in the image, non-human like interpretations

I N T_{N}

, and where both interpretations are depicted

I N T_{M}

.

Figure 8. Venn diagram illustrating the interpretations of DALL-E and Stable Diffusion of prompts with lexical ambiguity. The diagram displays the number of images where the model depicted human like interpretations

I N T_{H}

in the image, non-human like interpretations

I N T_{N}

, and where both interpretations are depicted

I N T_{M}

.

Figure 9. Heat maps illustrating the distribution of the interpretations for each lexical ambiguity prompt in the 10 images generated by each of DALL-E and Stable Diffusion (SD).

Figure 10. Venn diagram illustrating the interpretations of DALL-E and Stable Diffusion of prompts with figurative ambiguity. The diagram displays the number of images where the model depicted human like interpretations

I N T_{H}

in the image, non-human like interpretations

I N T_{N}

, and where both interpretations are depicted

I N T_{M}

.

Figure 10. Venn diagram illustrating the interpretations of DALL-E and Stable Diffusion of prompts with figurative ambiguity. The diagram displays the number of images where the model depicted human like interpretations

I N T_{H}

in the image, non-human like interpretations

I N T_{N}

, and where both interpretations are depicted

I N T_{M}

.

Figure 11. Heat maps illustrating the distribution of the interpretations for each figurative ambiguity prompt in the 10 images generated by each of DALL-E and Stable Diffusion (SD).

Figure 12. Pie charts comparing the distribution of interpretation of prompts with syntactical, lexical, and figurative ambiguity across Stable Diffusion and DALL-E. The charts illustrate the proportions of interpretations matching human selection (INT_H), interpretations not matching human selection (INT_N), and mixed interpretations for each type of ambiguity, highlighting differences in how each model handles various forms of linguistic ambiguity.

Figure 13. Sample of images generated with two clarified prompts based on the prompt with syntactical ambiguity “A man walking next to a girl holding an umbrella” using Stable Diffusion (SD) in the top row and DALL-E in the bottom row.

Figure 14. Sample of images generated with two clarified prompts based on the prompt with lexical ambiguity “Glasses on the table” using Stable Diffusion in the top row and DALL-E in the bottom row.

Figure 15. Sample of images generated with the clarified prompt “He is sad” derived from the prompt with figurative ambiguity: “He is feeling blue”.

Table 1. The set of prompts with syntactical ambiguity used in this study, along with their possible interpretations

I N T_{A}

and

I N T_{B}

.

I N T_{H}

refers to the preferred interpretation with majority of votes from human evaluators for each prompt.

Table 1. The set of prompts with syntactical ambiguity used in this study, along with their possible interpretations

I N T_{A}

and

I N T_{B}

.

I N T_{H}

refers to the preferred interpretation with majority of votes from human evaluators for each prompt.

	Prompt	$I N T_{A}$	$I N T_{B}$	$I N T_{H}$
$p_{s} 1$	A boy wearing a shirt with a dog.	The shirt has a dog picture.	The dog is next to the boy.	$I N T_{A}$ (80%)
$p_{s} 2$	A man walking next to a girl holding an umbrella.	The man held the umbrella.	The girl held the umbrella.	$I N T_{A}$ (70%)
$p_{s} 3$	He painted on the table.	He painted directly on the table.	He painted on a medium on the table.	$I N T_{B}$ (70%)
$p_{s} 4$	I saw the painting of the tree with the magnifying glass.	I saw the tree next to the glasses.	I saw through the glasses.	$I N T_{B}$ (80%)
$p_{s} 5$	The cat on the mat with the red rose.	The mat had a rose design.	The rose is next to the cat.	$I N T_{A}$ (100%)
$p_{s} 6$	The chicken is ready to eat	The cooked chicken is ready.	The alive chicken is ready for its meal.	$I N T_{B}$ (70%)
$p_{s} 7$	The girl drew on the notebook with black lines.	The girl used black lines.	The notebook had black lines.	$I N T_{A}$ (60%)
$p_{s} 8$	The kite flew over the field with a rainbow.	The kite has a rainbow design	The field had a rainbow over it.	$I N T_{B}$ (80%)
$p_{s} 9$	The man walked the street with the torch.	The street had a torch	The man held a torch.	$I N T_{B}$ (90%)
$p_{s} 10$	The police arrested the man with a gun.	The man had the gun.	The police had the gun.	$I N T_{B}$ (60%)

Table 2. The set of prompts with lexical ambiguity used in this study, along with their possible interpretations

I N T_{A}

and

I N T_{B}

.

I N T_{H}

refers to the preferred interpretation with majority of votes from human evaluators for each prompt.

Table 2. The set of prompts with lexical ambiguity used in this study, along with their possible interpretations

I N T_{A}

and

I N T_{B}

.

I N T_{H}

refers to the preferred interpretation with majority of votes from human evaluators for each prompt.

	Prompt	$I N T_{A}$	$I N T_{B}$	$I N T_{H}$
$P_{l} 1$	bat near the field	A kind of animal	A type of athletic tool	$I N T_{A}$ (90%)
$P_{l} 2$	A bow displayed in the market	A type of weapon	A knot made by twisting ribbons	$I N T_{A}$ (70%)
$P_{l} 3$	A crane next to the house	A Kind of bird	A mechanical machine	$I N T_{B}$ (60%)
$P_{l} 4$	A key on the desk	Shaped metal tool	A button on a board	$I N T_{A}$ (100%)
$P_{l} 5$	A mole on a hand	A kind of animal	A blemish on the skin	$I N T_{B}$ (100%)
$P_{l} 6$	Glasses on the table	Pair of optical lenses	Drinking instrument	$I N T_{B}$ (70%)
$P_{l} 7$	The bank next to the park	Financial institute	River-side land	$I N T_{A}$ (80%)
$P_{l} 8$	The man carried the light bag	A source of illumination	Adjective (weight)	$I N T_{B}$ (80%)
$P_{l} 9$	The painting with the dates	A type of fruit	Calendar day	$I N T_{B}$ (80%)
$P_{l} 10$	The woman saw the big wave	A long body of water	A hand gesture	${I N T}_{A}$ (100%)

Table 3. The set of prompts with figurative ambiguity, along with their figurative interpretations

I N T_{F}

.

Table 3. The set of prompts with figurative ambiguity, along with their figurative interpretations

I N T_{F}

.

	Prompt	Figurative Interpretation ${I N T}_{F}$
$P_{f} 1$	He hit the road in the morning.	He left or started a journey in the morning.
$P_{f} 2$	He is feeling blue.	He is feeling sad or down.
$P_{f} 3$	She is a ray of sunshine.	She brings happiness and positivity.
$P_{f} 4$	Stars on the red carpet.	Famous people are present at an event.
$P_{f} 5$	The city that never sleeps.	A city that is always lively and active.
$P_{f} 6$	The kid is under the weather.	The child is feeling unwell or sick.
$P_{f} 7$	The market at the heart of the city.	The market is centrally located or very important to the city.
$P_{f} 8$	The night sky was a blanket of stars.	The sky was filled with stars, appearing like a covering.
$P_{f} 9$	The student is burning the midnight oil.	The student is studying or working late into the night.
$P_{f} 10$	The two sisters looked like two peas in a pod.	The two sisters look very similar or are very close.

Table 4. Examples illustrating how the interpretations of prompts (

I N T_{A}

and

I N T_{B}

) are mapped in this study. The prompt most voted by humans as the likely interpretation of the prompts is labeled

I N T_{H}

, and the prompt with less votes is labeled

I N T_{N}

. In the first example,

I N T_{A}

was selected by human evaluators as the likely interpretation of the prompt. Hence, throughout the analysis, it is referred to as

I N T_{H}

(human selection) for the corresponding prompt, while

I N T_{B}

is referred to as

I N T_{N}

(non-human selection).

Table 4. Examples illustrating how the interpretations of prompts (

I N T_{A}

and

I N T_{B}

) are mapped in this study. The prompt most voted by humans as the likely interpretation of the prompts is labeled

I N T_{H}

, and the prompt with less votes is labeled

I N T_{N}

. In the first example,

I N T_{A}

was selected by human evaluators as the likely interpretation of the prompt. Hence, throughout the analysis, it is referred to as

I N T_{H}

(human selection) for the corresponding prompt, while

I N T_{B}

is referred to as

I N T_{N}

(non-human selection).

Prompt	$I N T_{A}$	$I N T_{B}$	$I N T_{H}$	$I N T_{N}$
He painted on the table.	He painted directly on the table.	He painted on a medium on the table.	$I N T_{A}$ (70%)	$I N T_{B}$ (30%)
The painting with the dates	A type of fruit	Calendar day	$I N T_{B}$ (80%)	$I N T_{A}$ (20%)

Table 5. Interpretations of prompts with syntactical ambiguity in images generated by DALL-E and Stable Diffusion.

I N T_{H}

refers to the case where the model interprets the ambiguous prompt similarly to human resolution in an image,

I N T_{N}

refers to where the model interprets the prompt different from human interpretation in an image, and

I N T_{M}

refers to when the generated image exhibits both

I N T_{H}

and

I N T_{N}

.

Table 5. Interpretations of prompts with syntactical ambiguity in images generated by DALL-E and Stable Diffusion.

I N T_{H}

refers to the case where the model interprets the ambiguous prompt similarly to human resolution in an image,

I N T_{N}

refers to where the model interprets the prompt different from human interpretation in an image, and

I N T_{M}

refers to when the generated image exhibits both

I N T_{H}

and

I N T_{N}

.

Model	$I N T_{H}$	$I N T_{N}$	$I N T_{M}$
DALL-E	32%	23%	45%
Stable Diffusion	20%	34%	46%

Table 6. Lexical ambiguity interpretations in images generated by DALL-E and Stable Diffusion.

I N T_{H}

refers to the case where the model interprets the ambiguous prompt similarly to human resolution,

I N T_{N}

refers to where the model interprets the prompt different from human interpretation, and

I N T_{M}

refers to when the generated image exhibits both

I N T_{H}

and

I N T_{N}

.

Table 6. Lexical ambiguity interpretations in images generated by DALL-E and Stable Diffusion.

I N T_{H}

refers to the case where the model interprets the ambiguous prompt similarly to human resolution,

I N T_{N}

refers to where the model interprets the prompt different from human interpretation, and

I N T_{M}

refers to when the generated image exhibits both

I N T_{H}

and

I N T_{N}

.

Model	$I N T_{H}$	$I N T_{N}$	$I N T_{M}$
DALL-E	53%	20%	27%
Stable Diffusion	41%	38%	21%

Table 7. Figurative ambiguity interpretations in images generated by DALL-E and Stable Diffusion.

I N T_{H}

refers to the case where the model interprets the ambiguous prompt similarly to human resolution,

I N T_{N}

refers to where the model interprets the prompt different from human interpretation, and

I N T_{M}

refers to when the generated image exhibits both

I N T_{H}

and

I N T_{N}

.

Table 7. Figurative ambiguity interpretations in images generated by DALL-E and Stable Diffusion.

I N T_{H}

refers to the case where the model interprets the ambiguous prompt similarly to human resolution,

I N T_{N}

refers to where the model interprets the prompt different from human interpretation, and

I N T_{M}

refers to when the generated image exhibits both

I N T_{H}

and

I N T_{N}

.

Model	$I N T_{H}$	$I N T_{N}$	$I N T_{N}$
DALL-E	34%	0%	56%
Stable Diffusion	63%	0%	27%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Elsharif, W.; Alzubaidi, M.; She, J.; Agus, M. Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models. Computers 2025, 14, 19. https://doi.org/10.3390/computers14010019

AMA Style

Elsharif W, Alzubaidi M, She J, Agus M. Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models. Computers. 2025; 14(1):19. https://doi.org/10.3390/computers14010019

Chicago/Turabian Style

Elsharif, Wala, Mahmood Alzubaidi, James She, and Marco Agus. 2025. "Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models" Computers 14, no. 1: 19. https://doi.org/10.3390/computers14010019

APA Style

Elsharif, W., Alzubaidi, M., She, J., & Agus, M. (2025). Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models. Computers, 14(1), 19. https://doi.org/10.3390/computers14010019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visualizing Ambiguity: Analyzing Linguistic Ambiguity Resolution in Text-to-Image Models

Abstract

1. Introduction

2. Related Work

2.1. Text-to-Image Generation

Vision Language Models

2.2. Ambiguities in Text-to-Image Models

2.3. Prompt Engineering for Generative AI Models

3. Background: Linguistic Ambiguity

3.1. Syntactical Ambiguity

3.2. Lexical Ambiguity

3.3. Figurative Ambiguity

4. Methodology

4.1. Prompt Generation

4.1.1. Syntactical Ambiguity Prompts

4.1.2. Lexical Ambiguity Prompts

4.1.3. Figurative Ambiguity Prompts

4.2. Image Generation

Visual Linguistic Ambiguity Benchmark (V-LAB) Dataset

4.3. Models Evaluation

5. Results and Analysis

5.1. Alignment with Human Resolution

5.1.1. Syntactical Ambiguity

5.1.2. Lexical Ambiguity

5.1.3. Figurative Ambiguity

5.2. Stable Diffusion vs. DALL-E

6. Failure Modes

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI