LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models

Visual Results Gallery

1. Physical Alignment & OOD Generalization

Failures in DMs and Lina's improvement. (a) Generated with prompt: "A red ball on a small mirror." Baseline models generate reflections extending beyond the mirror surface or produce texture errors. (b) Generated with Winoground prompt: "a person is close to the water and in the sand." Baseline models incorrectly place the person in the water. By calibrating the sampling dynamics without altering the pre-trained weights, Lina successfully aligns the generation with the intended causal graph while preserving the original textures.

2. Improved Texture Quality and Causal Alignment

SD-3.5-large Lina LINA: Bird eats snake

Prompt: "a bird eats a snake". Seed = 0

SD-3.5-large Lina LINA: Snake eats bird

Prompt: "a snake eats a bird". Seed = 0

(Drag slider to compare) Lina improves Texture Quality and Causal Alignment. We utilize prompt pairs from the Winoground dataset (Left: "a bird eats a snake"; Right: "a snake eats a bird"). SD-3.5-large suffers from both texture hallucinations (e.g., the bird has three legs; the snake has two heads) and causal failures (neither image successfully captures the "eating" action). Lina effectively repairs these issues, achieving robust improvements in both texture fidelity and causal alignment.

3. Mitigating Biased Priors for Instruction Following

SD-3.5-large Lina LINA: Dirt in jar (Seed 0)

Prompt: "there is more dirt than empty space in the jar". Seed = 0

SD-3.5-large Lina LINA: Dirt in jar (Seed 1)

Prompt: "there is more dirt than empty space in the jar". Seed = 1

(Drag slider to compare) Lina corrects visual priors to satisfy quantitative constraints. Using the Winoground prompt "there is more dirt than empty space in the jar", we observe a consistent failure in the baseline model across different seeds. Influenced by training priors, the baseline incorrectly generates excessive empty space. By adaptively intervening in the sampling dynamics, Lina suppresses this bias, successfully filling the jar to align with the text description.

4. Extending Lina to Video Generation

Baseline (Wan2.2-T2V-A14B)

Lina (Ours)

Lina's Improvement on Video Generation. The prompt is from Winoground: “a person is close to the water and in the sand”. (Left) Baseline (Wan2.2-T2V-A14B): The model fails to capture the precise spatial preposition, incorrectly placing the person in the water throughout the video. (Right) Lina: Our method successfully guides the generation of a coherent temporal sequence. The person begins close to the water, moves towards the sand, and subsequently interacts with the sand (i.e., digging/entering), satisfying the complex causal and spatial requirements of the instruction.

Abstract

Diffusion models (DMs) have achieved remarkable success in image and video generation. However, they still struggle with (1) physical alignment and (2) out-of-distribution (OOD) instruction following. We argue that these issues stem from the models' failure to learn causal directions and to disentangle causal factors for novel recombination.

We introduce the Causal Scene Graph (CSG) and the Physical Alignment Probe (PAP) dataset to enable diagnostic interventions. This analysis yields three key insights. First, DMs struggle with multi-hop reasoning for elements not explicitly determined in the prompt. Second, the prompt embedding contains disentangled representations for texture and physics. Third, visual causal structure is disproportionately established during the initial, computationally limited denoising steps.

Based on these findings, we introduce Lina (Learning INterventions Adaptively), a novel framework that learns to predict prompt-specific interventions, which employs (1) targeted guidance in the prompt and visual latent spaces, and (2) a reallocated, causality-aware denoising schedule. Our approach enforces both physical alignment and OOD instruction following in image and video DMs, achieving state-of-the-art performance on challenging causal generation tasks and the Winoground dataset.

Causal Scene Graph

To systematically diagnose and repair physical misalignment, we introduce the Causal Scene Graph (CSG). Inspired by the Causal Graphical Model (CGM) and the Scene Graph (SG), the CSG models the prompt X and the generated image Y as a directed graph. Crucially, it establishes a causal hierarchy among semantic units:

Direct Elements (Y_D): Entities that are exactly one directed edge away from the prompt X. These are explicitly determined by the text description.
Indirect Elements (Y_I): Entities that are causally downstream, often separated by intermediate nodes. These arise from implicit physical laws (e.g., shadows, reflections) rather than explicit mention.

The Limitation of the Flattened Paradigm: In the physical world, causality follows a clear direction (X → Y_D → Y_I). However, conventional Diffusion Models operate on a flattened, non-hierarchical paradigm. Consequently, the model learns texture correlations rather than the underlying causal mechanism, leading to failures in multi-hop reasoning (e.g., generating shadows without objects).

We formalize the generation task as a graph alignment problem. The user prompt X implies a Ground-Truth Target CSG (G_X^*). The diffusion model's sampling mapping F produces an image with a Generated CSG (G_X^gen). Our framework, Lina, aims to calibrate this mapping to ensure G_X^gen ≈ G_X^*, strictly enforcing physical alignment and instruction following.

Causal Diagnostics of Diffusion Models

To diagnose the root of DMs' causal failures, we introduce the Causal Scene Graph (CSG) to formally define the problem. Conventionally, DMs learn to denoise elements simultaneously in a flattened paradigm, ignoring the inherent causal direction (e.g., X → Y_D → Y_I). We use the Physical Alignment Probe (PAP) dataset to conduct probe experiments, localizing the failure point to indirect elements in multi-hop reasoning and the failure source to guidance miscalibration.

Causal representation in embedding space

Visualizing Causal Failures and Mechanisms.
(Left) Multi-hop reasoning failures. (a) Given "a small iron block in a glass full of water," the model incorrectly floats the heavy object. Intervening on the "iron" token degrades it into an ice cube, revealing that the model relies on texture replacement rather than physical laws. (b) Given "a water drop in the space station," the liquid surface fails to assume the spherical shape required by microgravity. (c) Even the SOTA video model (Wan-2.2) exhibits similar physical misalignments, failing to capture consistent interactions.
(Right) Causal Mechanisms. (a) The prompt embedding exhibits disentangled representations: ablating relation tokens removes the spatial layout and physical interactions (e.g., reflection), whereas ablating object tokens destroys semantics but remarkably preserves the causal phenomenon (e.g., the reflection texture remains). (b) Analysis of the denoising schedule reveals that the visual causal structure is established almost exclusively in the initial, computationally limited steps (steps 26 to 24).

The Physical Alignment Probe (PAP) Dataset

To quantitatively measure the distortion of the mapping from prompt to image, we construct the Physical Alignment Probe (PAP) dataset. It is a multi-modal corpus based on the CSG for diagnosing physical reasoning and OOD generation. It comprises:

A structured library of 287 prompts divided into diagnostic subsets: Optics, Density, and OOD (focusing on counterfactual attributes and spatial relations).
A large-scale corpus of 28,700 images generated using SOTA DMs (SD-3.5-large and FLUX.1-Krea-dev).
Fine-grained segmentation masks generated via SAM2 to facilitate diagnostic interventions.

We define two metrics based on the causal hierarchy: Texture Alignment (success rate for direct elements Y_D) and Physical Alignment (success rate for indirect elements Y_I).

Causal Representation in Embedding Space

We investigate the representational basis of these failures within the prompt embedding space.

Disentanglement of Relations and Objects: When we intervene on relation tokens (interaction verbs like "hits" or spatial prepositions like "on"), Physical Alignment collapses, yet Texture Alignment remains high. Conversely, intervening on object tokens destroys texture but preserves the underlying causal phenomenon (e.g., a reflection implies an object exists, even if the object is missing). This suggests that physical misalignment is not a knowledge deficit within the network ε_θ, but a miscalibrated guidance signal.

Timeline of Structure Formation: By analyzing the intermediate latent states, we find that the causal structure is disproportionately established at the very beginning of the process. In 97.8% of successful generations, the correct structure is identifiable within the initial 2-4 iterations (steps 26–24). This implies that causal structure is set during these computationally limited steps, while subsequent steps primarily refine texture.

Architecture

Overview of the Lina framework, which operates in two phases.

(Left) Phase 1: (Offline) AIM Training. We identify baseline failures ("hard cases") from our PAP dataset. An MLLM evaluator performs an automated comparative search to find optimal intervention strengths (γ₁^*, γ₂^*) for these prompts. This creates a dataset D_hard, which is used to train the Adaptive Intervention Module (AIM) to predict these strengths directly from a prompt.

(Right) Phase 2: (Online) Lina-Guided Generation. For a new prompt X_new, the pre-trained AIM predicts the (γ̂₁, γ̂_{2_ratio}). Lina consists of three components: (1) A Token-level Intervention (γ₁) enhances relation tokens, (2) a Latent-level Intervention (γ₂) introduces a contrastive guidance term, and (3) Computation Reallocation concentrates the inference budget on the initial Structure Formation phase to prioritize the establishment of causal structure.

Intervention Analysis

To investigate the internal mechanism and effectiveness of Lina, we conduct grid searches over the intervention strengths. We analyze how the token-level intervention (γ₁) and latent-level intervention (γ₂) interact to guide the generation process.

1. Disentanglement & Monotonicity

Effectiveness of Lina's Intervention. We perform a grid search for the prompt "a person is close to the water and in the sand". The effects are largely disentangled and monotonic: γ₂ (horizontal axis) primarily controls texture intensity (balancing water vs. sand features), while γ₁ (vertical axis) calibrates the spatial layout and physical interactions. These findings validate the efficacy of our coordinate descent strategy, as the two parameters can be optimized with minimal interference.

2. Structural Robustness under Semantic Reversal

We further challenge the model by using the same initial latent (seed = 0) and identical vocabulary but with semantically opposite prompts in Winoground dataset.

Prompt: "a bird eats a snake" (Grid) Prompt: "a snake eats a bird" (Grid) Grid: Snake eats bird

Comparing the intervention grids above (drag slider to view). Despite the reversed semantics ("Bird eats Snake" vs. "Snake eats Bird"), the generated textures exhibit high consistency under the same intervention combinations (same coordinates). Critically, the core semantic action ("eats") is consistently strengthened around the same intervention sweet spot (γ₁ ≈ 2, γ₂ ≈ 1) in both cases. This reveals that Lina successfully disentangles physical relations from texture, allowing for robust calibration of causal structure regardless of the subject-object order.

3. Adaptive Correction of Asymmetric Priors

"more dirt than empty space" "more empty space than dirt" Grid: More empty space than dirt

Adapting to Biased Priors (drag slider to compare). We compare the intervention landscapes for two semantically opposite prompts from Winoground using the same seed (0).
(Left) For "there is more dirt than empty space in the jar", SD-3.5-large exhibits a strong biased prior towards empty containers, failing to generate the correct content in the baseline. Consequently, Lina adaptively identifies and applies a stronger intervention (around γ₁=0, γ₂=-2) to suppress this prior and correctly fill the jar.
(Right) For "there is more empty space than dirt in the jar", the instruction inherently aligns with SD-3.5-large's training prior. While a stronger intervention (around 0, -2) would still satisfy the text, Lina's search strategy prioritizes minimal intervention. Since the baseline region (near 0, 0) is already aligned, Lina avoids unnecessary modification, demonstrating the flexibility to intervene only when required.

Performance

We conduct experiments using state-of-the-art diffusion models: SD-3.5-large (SD-3.5), FLUX.1-Krea-dev (FLUX.1), and Wan2.2-T2V-A14B (Wan2.2). We employ Qwen2.5-VL-72B as our MLLM evaluator for automated evaluation.

1. Gap in Closed-Source Editing

We evaluate SOTA closed-source editing models on their ability to correct physical violations from the D_hard test set. Baseline models are given the flawed image and the correction prompt.

Method	Optics (↑)	Density (↑)
Nano Banana	2.5%	25.0%
GPT-image	2.5%	22.5%
Nano Banana + CoT	45.0%	70.5%
GPT-image + CoT	67.5%	82.0%
Lina (Ours)	96.4%	86.0%

2. Physical Alignment (Open Source)

We evaluate Lina and SOTA baselines on their ability to achieve physical alignment on the PAP-Optics and PAP-Density subsets.

Method	Optics (↑)	Density (↑)
SD-3.5 (Baseline)	80.4%	54.2%
FLUX.1 (Baseline)	86.9%	64.3%
LMD (SD-3.5)	80.5%	81.5%
PPAD (SD-3.5)	91.7%	76.2%
LoRA (SD-3.5)	95.9%	91.3%
Lina (SD-3.5)	97.4%	92.3%
Lina (FLUX.1)	96.8%	94.0%

3. OOD Instruction Following

We evaluate Lina and SOTA baselines on their ability to follow OOD instructions from the Winoground (Wino.) subset and our PAP-OOD set.

Method	Wino. (↑)	PAP-OOD (↑)
SD-3.5 (Baseline)	54.4%	69.3%
FLUX.1 (Baseline)	65.5%	80.6%
LMD (SD-3.5)	73.1%	75.2%
PPAD (SD-3.5)	62.6%	74.1%
LoRA (SD-3.5)	57.3%	72.0%
Lina (SD-3.5)	79.5%	84.3%
Lina (FLUX.1)	83.0%	86.1%

4. Ablation Study

Analysis of Lina's components on SD-3.5-large. We evaluate on PAP-Optics (Opt.), PAP-Density (Dens.), and Winoground (Wino.).

Method	Opt.	Dens.	Wino.
w/o γ₁ (Token)	85.1	80.5	60.2
w/o γ₂ (Latent)	81.3	78.0	74.9
Fixed γ	90.5	85.2	68.4
Std. Schedule	92.3	88.1	74.3
Lina (Full)	97.4	92.3	79.5

Conclusion

In this work, we introduce Lina, a novel framework for adaptive causal intervention in diffusion models. Based on diagnostic insights from our Causal Scene Graph (CSG), Lina learns to perform prompt-specific interventions, significantly enhancing the physical alignment and OOD instruction-following capabilities of DMs. Without relying on MLLMs during inference or requiring DM retraining, Lina demonstrates strong generalizability across both image and video DMs. Our work sets a foundation for developing generative models that can function as robust world simulators, understanding and rendering complex causal structures.

BibTeX

If you find our work useful in your research, please cite:

@article{yu2025lina,
      title={LINA: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models},
      author={Shu Yu and Chaochao Lu},
      year={2025},
      journal={arXiv preprint arXiv:2512.13290},
      url={https://arxiv.org/abs/2512.13290},
}

Lina: Learning INterventions Adaptively for Physical Alignment and Generalization in Diffusion Models