Image Generation from Contextually-Contradictory Prompts

Anonymous authors

Text-to-image diffusion models often fail on prompts that combine concepts misaligned with their learned associations, generating semantically inaccurate images. We call this Contextual Contradiction—a failure mode where one concept implicitly contradicts another due to entangled priors. For example, “a bear perfroming a handstand” contradicts the model’s priors about typical bear poses, and “a dragon blowing water” conflicts with its learned association with fire.

To address this, we introduce ''Stage Aware Prompting (SAP)'', a method for resolving contextual contradictions in text-to-image generation by aligning prompt information with the semantic stages of the denoising process. SAP decomposes such prompts into a sequence of proxy prompts, each tailored to a specific timestep range. These stage-aware prompts guide the generation process from coarse layout to fine details. By leveraging a large language model to identify and temporally separate contradictory concepts, SAP enables faithful and fine-grained image generation from prompts that would otherwise fail to align with the intended semantics.

Abstract

Text-to-image diffusion models excel at generating high-quality, diverse images from natural language prompts. However, they often fail to produce semantically accurate results when the prompt contains concept combinations that contradict their learned priors. We define this failure mode as contextual contradiction, where one concept implicitly negates another due to entangled associations learned during training. To address this, we propose a stage-aware prompt decomposition framework that guides the denoising process using a sequence of proxy prompts. Each proxy prompt is constructed to match the semantic content expected to emerge at a specific stage of denoising, while ensuring contextual coherence. To construct these proxy prompts, we leverage a large language model (LLM) to analyze the target prompt, identify contradictions, and generate alternative expressions that preserve the original intent while resolving contextual conflicts. By aligning prompt information with the denoising progression, our method enables fine-grained semantic control and accurate image generation in the presence of contextual contradictions. Experiments across a variety of challenging prompts show substantial improvements in alignment to the textual prompt.

Coarse-to-fine denoising

Text-to-image diffusion models generates images by progressively refining noise over a series of denoising steps. This process inherently follows a coarse-to-fine structure: early steps establish broad layout and spatial composition, while later steps gradually add fine details. This generative structure gives rise to two key observations:

  • Irreversibility of details: At each stage of denoising, the model commits to a certain level of structural detail. Once a given attribute (e.g., layout, coarse shape) has been established, it becomes effectively fixed and cannot be revised in later steps, even if it misaligns with the prompt.
  • Flexibility in high-frequency details: In early stages, high-frequency details have not yet emerged and thus are unaffected by the prompt. This allows greater flexibility in guiding the model without impacting the final generated fine details.

Here, we display this behaviour by observing the model x0 prediction through various denoising steps.

  • Top row: Using the full prompt from the start causes early mistakes (like night scenes) that can't be fixed later.
  • Middle row: A clean proxy prompt avoids early conflicts, and switching mid-way keeps the layout while correcting details.
  • Bottom row: A bad proxy misguides structure and lighting.

These insights enable our method to steer the generation process more effectively by conditioning the model with prompt information that aligns with what the model is capable of expressing at each stage.

How does it work?

Our method guides the denoising process using time-dependent proxy prompts that adapt to the model’s internal progression from coarse to fine. A large language model (LLM) analyzes the input prompt and decomposes it into a sequence of proxy prompts, each tailored to specific stages of the generation process. These intermediate prompts are injected into the diffusion model at predefined timestep intervals. By aligning the prompt information with the model’s evolving visual structure, this stage-aware conditioning ensures a semantically coherent and contextually accurate image.

BibTeX