Failures in DMs and Lina's improvement. (a) Generated with prompt: "A red ball on a small mirror." Baseline models generate reflections extending beyond the mirror surface or produce texture errors. (b) Generated with Winoground prompt: "a person is close to the water and in the sand." Baseline models incorrectly place the person in the water. By calibrating the sampling dynamics without altering the pre-trained weights, Lina successfully aligns the generation with the intended causal graph while preserving the original textures.
Prompt: "a bird eats a snake". Seed = 0
Prompt: "a snake eats a bird". Seed = 0
Prompt: "there is more dirt than empty space in the jar". Seed = 0
Prompt: "there is more dirt than empty space in the jar". Seed = 1
(Drag slider to compare) Lina corrects visual priors to satisfy quantitative constraints. Using the Winoground prompt "there is more dirt than empty space in the jar", we observe a consistent failure in the baseline model across different seeds. Influenced by training priors, the baseline incorrectly generates excessive empty space. By adaptively intervening in the sampling dynamics, Lina suppresses this bias, successfully filling the jar to align with the text description.
Lina's Improvement on Video Generation. The prompt is from Winoground: “a person is close to the water and in the sand”. (Left) Baseline (Wan2.2-T2V-A14B): The model fails to capture the precise spatial preposition, incorrectly placing the person in the water throughout the video. (Right) Lina: Our method successfully guides the generation of a coherent temporal sequence. The person begins close to the water, moves towards the sand, and subsequently interacts with the sand (i.e., digging/entering), satisfying the complex causal and spatial requirements of the instruction.
Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination.
We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps.
Based on these findings, we introduce Lina (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset.
To systematically diagnose and repair physical misalignment, we introduce the Causal Scene Graph (CSG). Inspired by the Causal Graphical Model (CGM) and the Scene Graph (SG), the CSG models the prompt X and the generated image Y as a directed graph. Crucially, it establishes a causal hierarchy among semantic units:
The Limitation of the Flattened Paradigm: In the physical world, causality follows a clear direction (X → YD → YI). However, conventional Diffusion Models operate on a flattened, non-hierarchical paradigm. Consequently, the model learns texture correlations rather than the underlying causal mechanism, leading to failures in multi-hop reasoning (e.g., generating shadows without objects).
We formalize the generation task as a graph alignment problem. The user prompt X implies a Ground-Truth Target CSG (GX*). The diffusion model's sampling mapping F produces an image with a Generated CSG (GXgen). Our framework, Lina, aims to calibrate this mapping to ensure GXgen ≈ GX*, strictly enforcing physical alignment and instruction following.
To diagnose the root of DMs' causal failures, we introduce the Causal Scene Graph (CSG) to formally define the problem. Conventionally, DMs learn to denoise elements simultaneously in a flattened paradigm, ignoring the inherent causal direction (e.g., X → YD → YI). We use the Physical Alignment Probe (PAP) dataset to conduct probe experiments, localizing the failure point to indirect elements in multi-hop reasoning and the failure source to guidance miscalibration.
Visualizing Causal Failures and Mechanisms.
(Left) Multi-hop reasoning failures.
(a) Given "a small iron block in a glass full of water," the model incorrectly floats the heavy object. Intervening on the "iron" token degrades it into an ice cube, revealing that the model relies on texture replacement rather than physical laws.
(b) Given "a water drop in the space station," the liquid surface fails to assume the spherical shape required by microgravity.
(c) Even the SOTA video model (Wan-2.2) exhibits similar physical misalignments, failing to capture consistent interactions.
(Right) Causal Mechanisms.
(a) The prompt embedding exhibits disentangled representations: ablating relation tokens removes the spatial layout and physical interactions (e.g., reflection), whereas ablating object tokens destroys semantics but remarkably preserves the causal phenomenon (e.g., the reflection texture remains).
(b) Analysis of the denoising schedule reveals that the visual causal structure is established almost exclusively in the initial, computationally limited steps (steps 26 to 24).
To quantitatively measure the distortion of the mapping from prompt to image, we construct the Physical Alignment Probe (PAP) dataset. It is a multi-modal corpus based on the CSG for diagnosing physical reasoning and OOD generation. It comprises:
We define two metrics based on the causal hierarchy: Texture Alignment (success rate for direct elements YD) and Physical Alignment (success rate for indirect elements YI).
We investigate the representational basis of these failures within the prompt embedding space.
Disentanglement of Relations and Objects: When we intervene on relation tokens (interaction verbs like "hits" or spatial prepositions like "on"), Physical Alignment collapses, yet Texture Alignment remains high. Conversely, intervening on object tokens destroys texture but preserves the underlying causal phenomenon (e.g., a reflection implies an object exists, even if the object is missing). This suggests that physical misalignment is not a knowledge deficit within the network εθ, but a miscalibrated guidance signal.
Timeline of Structure Formation: By analyzing the intermediate latent states, we find that the causal structure is disproportionately established at the very beginning of the process. In 97.8% of successful generations, the correct structure is identifiable within the initial 2-4 iterations (steps 26–24). This implies that causal structure is set during these computationally limited steps, while subsequent steps primarily refine texture.
Overview of the Lina framework, which operates in two phases.
(Left) Phase 1: (Offline) AIM Training. We identify baseline failures ("hard cases") from our PAP dataset. An MLLM evaluator performs an automated comparative search to find optimal intervention strengths (γ1*, γ2*) for these prompts. This creates a dataset Dhard, which is used to train the Adaptive Intervention Module (AIM) to predict these strengths directly from a prompt.
(Right) Phase 2: (Online) Lina-Guided Generation. For a new prompt Xnew, the pre-trained AIM predicts the (γ̂1, γ̂2_ratio). Lina consists of three components: (1) A Token-level Intervention (γ1) enhances relation tokens, (2) a Latent-level Intervention (γ2) introduces a contrastive guidance term, and (3) Computation Reallocation concentrates the inference budget on the initial Structure Formation phase to prioritize the establishment of causal structure.
To investigate the internal mechanism and effectiveness of Lina, we conduct grid searches over the intervention strengths. We analyze how the token-level intervention (γ1) and latent-level intervention (γ2) interact to guide the generation process.
Effectiveness of Lina's Intervention. We perform a grid search for the prompt "a person is close to the water and in the sand". The effects are largely disentangled and monotonic: γ2 (horizontal axis) primarily controls texture intensity (balancing water vs. sand features), while γ1 (vertical axis) calibrates the spatial layout and physical interactions. These findings validate the efficacy of our coordinate descent strategy, as the two parameters can be optimized with minimal interference.
We further challenge the model by using the same initial latent (seed = 0) and identical vocabulary but with semantically opposite prompts in Winoground dataset.
Comparing the intervention grids above (drag slider to view). Despite the reversed semantics ("Bird eats Snake" vs. "Snake eats Bird"), the generated textures exhibit high consistency under the same intervention combinations (same coordinates). Critically, the core semantic action ("eats") is consistently strengthened around the same intervention sweet spot (γ1 ≈ 2, γ2 ≈ 1) in both cases. This reveals that Lina successfully disentangles physical relations from texture, allowing for robust calibration of causal structure regardless of the subject-object order.
Adapting to Biased Priors (drag slider to compare).
We compare the intervention landscapes for two semantically opposite prompts from Winoground using the same seed (0).
(Left) For "there is more dirt than empty space in the jar", SD-3.5-large exhibits a strong biased prior towards empty containers, failing to generate the correct content in the baseline. Consequently, Lina adaptively identifies and applies a stronger intervention (around γ1=0, γ2=-2) to suppress this prior and correctly fill the jar.
(Right) For "there is more empty space than dirt in the jar", the instruction inherently aligns with SD-3.5-large's training prior. While a stronger intervention (around 0, -2) would still satisfy the text, Lina's search strategy prioritizes minimal intervention. Since the baseline region (near 0, 0) is already aligned, Lina avoids unnecessary modification, demonstrating the flexibility to intervene only when required.
We conduct experiments using state-of-the-art diffusion models: SD-3.5-large (SD-3.5), FLUX.1-Krea-dev (FLUX.1), and Wan2.2-T2V-A14B (Wan2.2). We employ Qwen2.5-VL-72B as our MLLM evaluator for automated evaluation.
We evaluate SOTA closed-source editing models on their ability to correct physical violations from the Dhard test set. Baseline models are given the flawed image and the correction prompt.
| Method | Optics (↑) | Density (↑) |
|---|---|---|
| Nano Banana | 2.5% | 25.0% |
| GPT-image | 2.5% | 22.5% |
| Nano Banana + CoT | 45.0% | 70.5% |
| GPT-image + CoT | 67.5% | 82.0% |
| Lina (Ours) | 96.4% | 86.0% |
We evaluate Lina and SOTA baselines on their ability to achieve physical alignment on the PAP-Optics and PAP-Density subsets.
| Method | Optics (↑) | Density (↑) |
|---|---|---|
| SD-3.5 (Baseline) | 80.4% | 54.2% |
| FLUX.1 (Baseline) | 86.9% | 64.3% |
| LMD (SD-3.5) | 80.5% | 81.5% |
| PPAD (SD-3.5) | 91.7% | 76.2% |
| LoRA (SD-3.5) | 95.9% | 91.3% |
| Lina (SD-3.5) | 97.4% | 92.3% |
| Lina (FLUX.1) | 96.8% | 94.0% |
We evaluate Lina and SOTA baselines on their ability to follow OOD instructions from the Winoground (Wino.) subset and our PAP-OOD set.
| Method | Wino. (↑) | PAP-OOD (↑) |
|---|---|---|
| SD-3.5 (Baseline) | 54.4% | 69.3% |
| FLUX.1 (Baseline) | 65.5% | 80.6% |
| LMD (SD-3.5) | 73.1% | 75.2% |
| PPAD (SD-3.5) | 62.6% | 74.1% |
| LoRA (SD-3.5) | 57.3% | 72.0% |
| Lina (SD-3.5) | 79.5% | 84.3% |
| Lina (FLUX.1) | 83.0% | 86.1% |
Analysis of Lina's components on SD-3.5-large. We evaluate on PAP-Optics (Opt.), PAP-Density (Dens.), and Winoground (Wino.).
| Method | Opt. | Dens. | Wino. |
|---|---|---|---|
| w/o γ1 (Token) | 85.1 | 80.5 | 60.2 |
| w/o γ2 (Latent) | 81.3 | 78.0 | 74.9 |
| Fixed γ | 90.5 | 85.2 | 68.4 |
| Std. Schedule | 92.3 | 88.1 | 74.3 |
| Lina (Full) | 97.4 | 92.3 | 79.5 |
In this work, we introduce Lina, a novel framework for adaptive causal intervention in diffusion models. Based on diagnostic insights from our Causal Scene Graph (CSG), Lina learns to perform prompt-specific interventions, significantly enhancing the physical alignment and OOD instruction-following capabilities of DMs. Without relying on MLLMs during inference or requiring DM retraining, Lina demonstrates strong generalizability across both image and video DMs. Our work sets a foundation for developing generative models that can function as robust world simulators, understanding and rendering complex causal structures.
If you find our work useful in your research, please cite:
@article{yu2025lina,
title={LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models},
author={Shu Yu and Chaochao Lu},
year={2025},
journal={arXiv preprint arXiv:2512.13290},
url={https://arxiv.org/abs/2512.13290},
}