Really fascinating how this works; it's basically context-aware decoding. From the paper: >...

bensyverson • yesterday at 12:02 PM • 11 replies • view on HN

Really fascinating how this works; it's basically context-aware decoding. From the paper:

> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.

In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).

What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.

I love that we're still learning the emergent properties of LLMs!

Replies

user_7832 • yesterday at 1:03 PM

> I love that we're still learning the emergent properties of LLMs!

TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:

> 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)

So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.

(Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)

➕ show 4 replies

vova_hn2 • yesterday at 3:45 PM

I've always thought that it is kinda weird that we spend exactly the same amount of compute to calculate both "fork" tokens and "lock" tokens.

I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.

I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.

[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

[1] https://developers.redhat.com/articles/2025/06/03/structured...

➕ show 3 replies

khalic • yesterday at 1:13 PM

Another example of the mindf@#$ these systems are: I was doing some fine tuning to a small model, take data fields and make a sentence out of it. I was running into mode collapse (basically when the AI simplifies too much and always output the same thing).

I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...

➕ show 3 replies

stingraycharles • yesterday at 12:14 PM

Seems like this is true for not just code but for all content being generated? Albeit for code it’s more well-defined, but the fork / lock mechanism works for a lot more problem domains.

➕ show 2 replies

sdwr • yesterday at 11:29 PM

What's cool is that they aren't adjusting the temperature of the model live, or predicting/labeling any of the fork/lock points.

DavidPiper • yesterday at 1:15 PM

Sounds just like John Cleese's "Open Mode" and "Closed Mode" - https://www.youtube.com/watch?v=Pb5oIIPO62g

orbital-decay • yesterday at 2:09 PM

One relevant thing is that these forks are unnaturally narrow in all models, and rather resemble locks (not quite but close). From multiple possible continuations models tend to prefer just a couple, i.e. the model is a lot less random than it should be. That's why you're seeing annoying slop in writing and instantly recognizable color schemes in vibecoded sites. Lack of diversity probably limits the usefulness of this method as well.

>I love that we're still learning the emergent properties of LLMs!

There are tons of low-hanging fruits there.

➕ show 1 reply

nostrebored • yesterday at 4:28 PM

Could we not get the same with EAFT? Maybe that’s what it’s doing but definitely not the first to think “let’s lock in high probability solutions”

In nemotron the high perplexity solutions are selected for RL, in VLM training a few people are looking at the entropy distributions of the training set, etc

robocat • yesterday at 9:18 PM

> In other words, just like us

I think you are implying a reverse causation. They used a metaphor from us.

michaelbuckbee • yesterday at 1:02 PM

I don't really understand the internal mechanics of of this, but my first thought was why not combine this with a linter/tests. So that it produces all the forks and only keeps the syntactically correct ones.

➕ show 1 reply

TacticalCoder • yesterday at 1:39 PM

> What this paper shows is that their simple technique (SSD)

"Simple Self-Distillation". We had an acronym for Solid-State Drive. Don't know about that technique but the naming sure sound.. Simple?

alt Hacker News

Replies