Mercury 2: Fast reasoning LLM powered by diffusion

334 points • by fittingopposite • yesterday at 10:46 PM • 121 comments • view on HN

Comments

It could be interesting to do the metric of intelligence per second.

ie intelligence per token, and then tokens per second

My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.

But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.

➕ show 8 replies

Karuma • today at 5:02 PM

A simple test I just did:

Me: What are some of Maradona's most notable achievements in football?

Mercury 2 (first sentence only): Dieadona’s most notable football achievements include:

Notice the spelling of "Dieadona" instead of "Maradona". Even any local 3B model can answer this question perfectly fine and instantly... Mercury 2 was so incredibly slow and full of these kinds of unforgivable mistakes.

DoctorOetker • today at 9:39 AM

> Mercury 2 doesn't decode sequentially. It generates responses through parallel refinement, producing multiple tokens simultaneously and converging over a small number of steps. Less typewriter, more editor revising a full draft at once.

There has been quite some progress unifying DDPM & SGM as SDE

> DDPM and Score-Based Models: The objective function of DDPMs (maximizing the ELBO) is equivalent to the score matching objectives used to train SGMs.

> SDE-based Formulation: Both DDPMs and SGMs can be unified under a single SDE framework, where the forward diffusion is an Ito SDE and the reverse process uses score functions to recover data.

> Flow Matching (Continuous-Time): Flow matching is equivalent to diffusion models when the source distribution corresponds to a Gaussian. Flow matching offers "straight" trajectories compared to the often curved paths of diffusion, but they share similar training objectives and weightings.

Is there a similar connection between modern transformers and diffusion?

Suppose we look at each layer or residual connection between layers, the context window of tokens (typically a power of 2), what is incrementally added to the embedding vectors is a function of the previous layer outputs, and if we have L layers, what is then the connection between those L "steps" of a transformer and similarly performing L denoising refinements of a diffusion model?

Does this allow fitting a diffusion model to a transformer and vice versa?

volodia • today at 1:57 AM

Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.

➕ show 10 replies

dvt • today at 12:09 AM

What excites me most about these new 4figure/second token models is that you can essentially do multi-shot prompting (+ nudging) and the user doesn't even feel it, potentially fixing some of the weird hallucinatory/non-deterministic behavior we sometimes end up with.

➕ show 2 replies

nylonstrung • today at 1:26 AM

I'm not sold on diffusion models.

Other labs like Google have them but they have simply trailed the Pareto frontier for the vast majority of use cases

Here's more detail on how price/performance stacks up

https://artificialanalysis.ai/models/mercury-2

➕ show 3 replies

Ross00781 • today at 6:46 PM

The diffusion-based approach is fascinating. Traditional transformer LLMs generate tokens sequentially, but diffusion models can theoretically refine the entire output space iteratively. If they've cracked the latency problem (diffusion is typically slower), this could open new architectures for reasoning tasks where quality matters more than speed. Would love to see benchmark comparisons on multi-step reasoning vs GPT-4/Claude.

➕ show 1 reply

findjashua • today at 7:10 PM

failed the car wash test.

i think instead of postiioning as a general purpuse reasoning model, they'd have more success focusing on a specific use case (eg coding agent) and benchmark against the sota open models for the use case (eg qwen3-coder-next)

➕ show 1 reply

swiftcoder • today at 8:29 AM

Are there any open-weights diffusion LLM models I can play with on my local hardware? Curious about the performance delta of this style of model in more resource constrained scenarios (i.e. consumer Nvidia GPU, not H100s in the datacenter)

smusamashah • today at 5:58 AM

Does it mean if it was embedded on a Talaas chip, it could generate ~50,000+ tokens per second?

serjester • today at 2:53 AM

There's a potentially amazing use case here around parsing PDFs to markdown. It seems like a task with insane volume requirements, low budget, and the kind of thing that doesn't benefit much from autoregression. Would be very curious if your team has explored this.

sorenjan • today at 11:55 AM

Julia Turc recently did a video about diffusion LLMs as a paid collaboration with Inception: https://www.youtube.com/watch?v=-VGeHZqOk_s

tl2do • today at 12:14 AM

Genuine question: what kinds of workloads benefit most from this speed? In my coding use, I still hit limitations even with stronger models, so I'm interested in where a much faster model changes the outcome rather than just reducing latency.

➕ show 6 replies

Ross00781 • today at 6:45 AM

Diffusion-based reasoning is fascinating - curious how it handles sequential dependencies vs traditional autoregressive. For complex planning tasks where step N heavily depends on steps 1-N, does the parallel generation sometimes struggle with consistency? Or does the model learn to encode those dependencies in a way that works well during parallel sampling?

anshumankmr • today at 9:42 AM

It is capable of that seahorse faux pas. '''Is there a seahorse emoji?

Mercury 2

Thought for a minute

Well?

Mercury 2 Today at 3:06 PM

Yes – Unicode includes a seahorse emoji. It is U+1F9A0 and renders as:

(seahorse)

(If your device or browser doesn’t show the graphic, you may see a placeholder box.)

Ask

Explain

'''

vicchenai • today at 4:49 AM

The iteration speed advantage is real but context-specific. For agentic workloads where you're running loops over structured data -- say, validating outputs or exploring a dataset across many small calls -- the latency difference between a 50 tok/s model and a 1000+ tok/s one compounds fast. What would take 10 minutes wall-clock becomes under a minute, which changes how you prototype.

The open question for me is whether the quality ceiling is high enough for cases where the bottleneck is actually reasoning, not iteration speed. volodia's framing of it as a "fast agent" model (comparable tier to Haiku 4.5) is honest -- for the tasks that fit that tier, the 5x speed advantage is genuinely interesting.

ilaksh • today at 12:38 AM

It seems like the chat demo is really suffering from the effect of everything going into a queue. You can't actually tell that it is fast at all. The latency is not good.

Assuming that's what is causing this. They might show some kind of feedback when it actually makes it out of the queue.

➕ show 1 reply

rancar2 • today at 4:36 AM

My attempt with trying one of their OOTB prompts in the demo https://chat.inceptionlabs.ai resulted in: "The server is currently overloaded. Please try again in a moment."

And a pop-up error of: "The string did not match the expected pattern."

That happened three times, then the interface stopped working.

I was hoping to see how this stacked up against Taalas demo, which worked well and was so fast every time I've hit it this past week.

vinhnx • today at 8:38 AM

This research paper "Mercury: Ultra-Fast Language Models Based on Diffusion" from last year (2025)

https://arxiv.org/pdf/2506.17298

nowittyusername • today at 2:32 AM

Nice, I'm excited to try this for my voice agent, at worst it could be used to power the human facing agent for latency reduction.

➕ show 1 reply

mlhpdx • today at 6:53 PM

> Proxylity LLC is a technology company that builds and deploys diffusion‑based large language models and multimodal AI platforms for enterprise use.

Um, no it isn’t. Presumably this is the answer to any question about a company it doesn’t know? That’s some hardcore bias baking.

mhitza • today at 1:17 AM

Comment retracted. My bad, missed some details.

➕ show 2 replies

herlon214 • today at 6:55 AM

This looks really nice. When will it be available on OpenRouter?

dmix • today at 4:47 AM

I tried Mercury 1 in Zed for inline completions and it was significantly slower than Cursors autocomplete. Big reason why I switched backed to Cursor(free)+Claude Code

lprimeisafk • today at 2:13 AM

It fails the car wash test

➕ show 1 reply

chriskanan • today at 2:14 AM

I can see some promise with diffusion LLMs, but getting them comparable to the frontier is going to require a ton of work and these closed source solutions probably won't really invigorate the field to find breakthroughs. It is too bad that they are following the path of OpenAI with closed models without details as far as I can tell.

davistreybig • today at 3:24 AM

This is unbelievably fast

exabrial • today at 2:42 AM

I believe Jimmy Chat is still faster by an order of magnitude…

➕ show 1 reply

LarsDu88 • today at 2:38 PM

Imagine this type of generation with a custom Talaass style ASIC in 18 months from now on a Sonnet quality model for a 5 order magnitude speed up.

The future looks crazy

➕ show 1 reply

dhruv3006 • today at 1:47 AM

I am little underwhelmed by anything diffusion at the moment - they didn't really deliver.

➕ show 1 reply

dw5ight • today at 2:24 AM

this looks awesome!!

naillang • today at 3:01 PM

[dead]

nivcmo • today at 8:13 AM

[dead]

MarcLore • today at 2:00 AM

[dead]

alflex • today at 7:31 AM

[dead]

alflex • today at 7:35 AM

[dead]

arjie • today at 1:34 AM

Please pre-render your website on the server. Client-side JS means that my agent cannot read the press-release and that reduces the chance I am going to read it myself. Also, day one OpenRouter increases the chance that someone will try it.

alt Hacker News

Mercury 2: Fast reasoning LLM powered by diffusion

Comments