Introspective Diffusion Language Models

267 points • by zagwdt • yesterday at 7:57 AM • 47 comments • view on HN

Comments

If I’m reading this right, this is pretty wild. They turned a Qwen autoregressor into a diffuser by using a bunch of really clever techniques, and they vastly outperform any “native diffuser,” actually being competitive with the base model they were trained from. The obvious upside here is the massive speedup in generation.

And then through a LoRA adapter, you can ground the diffuser on the base model’s distribution (essentially have it “compare” its proposals against what the base model would’ve generated), which effectively means: exact same byte-for-byte output for the same seed, just roughly twice as fast (which should improve even more for batched tasks).

I’m not an expert, more of a “practicing enthusiast,” so I might be missing something, but at first glance, this reads super exciting to me.

➕ show 3 replies

xiphias2 • today at 3:00 AM

How does this compare to DFlash?

https://z-lab.ai/projects/dflash/

And DDTree?

https://liranringel.github.io/ddtree/

andsoitis • yesterday at 8:12 AM

Is anyone here experimenting seriously with Diffusion for text generation? I’d love to learn about your experiences!

➕ show 5 replies

mlmonkey • yesterday at 5:30 PM

I'm no expert (just a monkey... ;) ), but isn't Diffusion supposed to generate ALL of the output at once? From their diagram, it looks like their I-LDM model seems to use previously generated context to generate the next tokens (or blocks).

➕ show 1 reply

ilaksh • yesterday at 6:14 PM

Does this mean I should switch to sglang? How hard is it to add the capability for these type of models to vLLM? Or does it already handle them?

2001zhaozhao • yesterday at 11:46 PM

I always thought some kind of block-based diffusion architecture would be the future of LLMs, especially some architecture that can dynamically alter its token generation rate as well as "reason and generate at the same time", and have an opportunity to correct tokens that it has just generated. Something like the equivalent of a short term "working memory" for humans. But I have no understanding of the math. Fingers crossed.

ramon156 • yesterday at 10:09 AM

> 2025-04-12: Initial code release with training and inference support.

> 2025-04-12: Released I-DLM-8B, I-DLM-32B, and I-DLM-8B-LoRA on HuggingFace.

Is this old already? Not saying that's a bad thing, since it seems very sophisticated. Just curious if there's an update

➕ show 1 reply

keyle • today at 12:35 AM

This looks great. Can we use it yet?

simianwords • yesterday at 10:20 AM

Can diffusion models have reasoning steps where they generate a block, introspect and then generate another until the output is satisfactory?

➕ show 1 reply

scotty79 • yesterday at 11:48 AM

So can you just use this and have a faster Qwen32b?