Diffusion language models seem poised to smash purely autoregressive models. I'm giving it 1-2 years.
Feels like the sodium ion battery vs lithium ion battery thing, where there are theoretical benefits of one but the other has such a head start on commercialization that it'll take a long time to catch up.
Didn't thinking tokens resolve the most problematic part of autoregressive models (the first few tokens set the constraints the model can't overcome later) and give it a massive advantage compared to diffusion models by showing the thinking trace? I can see diffusion models being used as a draft model to quickly predict a bunch of tokens and let the autoregressive model decide to use them or throw them away quickly, speeding it up considerably while keeping thinking traces available.
One appeal of it is for RL. If it ends up being a lot faster for generation, you'll be able to do a lot more RL.
If people can make RL scalable-- make it so that RL isn't just a final phase, but something which is as big as the supervised stuff, then diffusion models are going to have an advantage.
If not, I think autoregressive models will still be preferred. Diffusion models become fixed very fast, they can't actually refine their outputs, so we're not talking about some kind of refinement along the lines of: initial idea -> better idea -> something actually sound.