logoalt Hacker News

Embarrassingly simple self-distillation improves code generation

536 pointsby Anon84yesterday at 10:26 AM163 commentsview on HN

Comments

bensyversonyesterday at 12:02 PM

Really fascinating how this works; it's basically context-aware decoding. From the paper:

> Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.

In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).

What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.

I love that we're still learning the emergent properties of LLMs!

show 10 replies
teleforcetoday at 12:00 AM

It seems that self-distillation is the way to go for LLM.

Self-distillation has been shown recently as very efficient and effective back in January this year by the team by MIT and ETH in their Self-Distillation Fine-Tuning (SDFT) LLM system [1],[2].

This paper is also their closest competitor named On-Policy Self-Distillation in the comparison table.

I hope they keep the original work real name that is Self-Distillation Fine-Tuning or SDFT. Imagine later paper citing this very paper as cross-entropy self-distillation instead of their very own given name Simple Self-Distallation or SSD. Although I'd have admitted it's a lousy name that breaks the namespace with common SSD nomenclature for solid-dtate drive, as others have rightly pointed.

I think they should given the proper credit to this earlier seminal earlier on SDFT but apparently they just put it as one as of the systems in their benchmark but not explaining much of the connection and lineage which is a big thing in research publication.

[1] Self-Distillation Enables Continual Learning:

https://arxiv.org/abs/2601.19897

[2] Self-Distillation Enables Continual Learning:

https://self-distillation.github.io/SDFT.html

wg0yesterday at 12:10 PM

After TurboQuant and Gemma 4, came across the following video[0] running Gemma on local machine at 50 token/second.

That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.

Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.

Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.

[0] https://www.youtube.com/watch?v=-_hC-C_Drcw

show 2 replies
try-workingtoday at 12:32 AM

most codebases dont have traces to train on. if you use rlm-workflow you will build up rich traceability in the form of requirements, plans, implementation artifacts, along with worktree diffs. with these, you can then use self-distillation on models or use autoagent to improve your harness. https://github.com/doubleuuser/rlm-workflow

zyklu5yesterday at 7:41 PM

Their explanation for why their idea (SSD) might work - precision-exploration conflict hypothesis - is something adaptive decoding also tries to solve.

https://ai.meta.com/research/publications/adaptive-decoding-...

uduniyesterday at 5:20 PM

It's crazy how much better you can make LLM output just by asking "is this the most elegant solution?" In a loop

(Not fine tuning, but interesting none the less. If a model can so easily find a more elegant solution, why didn't it pick that in the first place?)

show 1 reply
khalicyesterday at 11:51 AM

Incredible, will translate to better coding models in the near future.

We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.

0x3fyesterday at 11:49 AM

Haven't read the paper yet, but it is interesting how seemingly simple many breakthroughs in ML are. Even transformers are like that. Maybe it's hindsight bias.

I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.

show 2 replies
p1eskyesterday at 3:32 PM

It’s so ironic that Apple still publishes AI research and OpenAI does not.

show 2 replies
OxfordOutlanderyesterday at 8:02 PM

So... it's like a golfer who hits thousands of balls into an open field without ever once aiming for a hole. The relentless repetition flawlessly locks in their foundational muscle memory and basic swing mechanics, so when they finally step up to a real course, they don't have to waste a single thought on how to hold the club. Their basic swing is completely automatic - they can confidently take the creative, high-risk shot required to actually sink a hole-in-one.

hooloovoo_zooyesterday at 9:53 PM

One sentence summary: We fine-tuned a general-purpose model to produce valid benchmark code results and it got better at producing benchmark code results; we didn't bother to evaluate it on anything the model used to be good at.

show 1 reply
ultramannyesterday at 1:55 PM

Maybe not the thing I should be focusing on, but I was surprised this paper came from apple. I was under the impression that apples ai/LLM research was far behind the curve. I get that research is a rising tides lifts all boats situation, I just thought that I had seen lots of negative news about apples progress in the front, and heuristically haven’t seen many (any?) apple research papers make it the front page of hacker news. Wondering if anyone more familiar with apple/ai research could comment on this?

show 1 reply
drdrekyesterday at 8:48 PM

This is the "Factors" Bonanza in finance all over again. You get a generally useful model, then you over-fit it to some criteria and announce advancement in the field, then it performs worse in real life. New infinite academic article glitch just dropped boys!

Lercyesterday at 10:14 PM

This is the natural conclusion of what was really claimed about model collapse, and indeed natural evolution. Making an imperfect copy while invoking a selection mechanism is evolution.

Some of the claims about models training on their own data, in their enthusiasm to frame it as a failure, went further to suggest that it magnified biases. I had my doubts about their conclusions. If it were true, it would be a much greater breakthrough because the ability to magnify a property represents a way to measure a weak version that property. The ability to do that would mean they would have found a way to provide a training signal to avoid bias. It would be great if that's what they did but I suspect there would have been more news about it.

Perhaps this paper will put to rest the notion that AI output is useless as training data. It has only ever been the case that it was useless as an indiscriminate source of data.

l5870uoo9yyesterday at 12:18 PM

> Our method, simple self-distillation (SSD), is embarrassingly simple: sample solutions from the base model with specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss.

So you prompt the base model for answer and then rerun the prompt with the answer from the first run?

show 2 replies
gavinrayyesterday at 7:22 PM

Why have we been fed the narrative that training models on their own output progressively degrades quality?

It's the first thing anyone would think of (like a self-hosted compiler) but everything I've read said "it doesn't work."

EDIT: For context:

  > Shumailov et al. (2024) — "AI models collapse when trained on recursively generated data" (Nature, 2024)
mickdarlingyesterday at 7:02 PM

I'm working on a tool to determine which portions of an LLM process can be optimized, and how to measure that optimization and check whether it's optimizable at all. The shaping pattern that they talk about here is directly relevant and makes a whole lot more processes potentially optimizable by looking at the pattern rather than if the metrics just go up or down.

roger_yesterday at 11:58 AM

Skimmed this but don't have an intuitive understanding of why this works and how temperature and truncation factor in.

an0malousyesterday at 1:35 PM

I’d like to understand AI research better and I recall some posts a while back where someone collected all the key papers that one should read, but I don’t remember enough to be able to find it. Does anyone know what I’m talking about and could link me to that post?

show 1 reply
vishnuguptayesterday at 12:47 PM

Can someone please eli5 this to a friend web developer? I read the abstract but couldn’t understand much.

show 3 replies
itmiticayesterday at 2:47 PM

It’s an interesting claim, and the reported benchmark gains are large, but it is still an April 1, 2026 arXiv preprint, so I’d treat it as promising rather than settled.

dwa3592yesterday at 3:35 PM

Can anyone help clarify these doubts - I didn't see any information about how different the test/benchmark set is from the training set. It feels like an important gap to not fill in a ML paper. What if there is an overlap between the problems in the test set and the training set?? What is the decontamination strategy of going from LCBv5 to LCBv6 ?

hackermeowsyesterday at 11:32 PM

what is the big deal with obsidian ? I see a lot of people use it but I'm more than happy with giving an LLM a local sqlite table , embedding api and asking the agent to maintain its own memory

crustycoderyesterday at 4:33 PM

"SSD improves Qwen3-30B-Instruct from 42.4% to 55.3% pass@1 on LiveCodeBench v6"

I know virtually nothing about this area but my naive take is that something that means it still only passes tests around half the time doesn't seem like a particularly big jump forwards.

What am I missing?

show 1 reply
xbmcuseryesterday at 1:33 PM

So the chances of Singularity went up.

show 1 reply
fookeryesterday at 2:27 PM

I'm excited for the long tail of techniques like this that are going to be discovered over the next several decades that's going to make this technology eventually run on a toaster!

droobyyesterday at 12:24 PM

Fascinating...

This feels eerily similar to sleep consolidation or synaptic pruning

show 1 reply
hnretardsyesterday at 9:38 PM

I've been doing something even better than this for years using only Mistral 7b.

My local running Mistral 7b is a 100x better at modern JavaScript than any model on the market, mainly just from RAG on my own code samples.

That's basically what they are describing with "post-training", the TLDR is that code especially of a certain style is vastly simpler than written language.

You really don't need a huge model or data centers etc. you just need a small but good model like Mistral 7b and literally a few good samples.

But you guys keep doing you lol. A bunch of non-devs trying to solve code is pretty funny to watch.

augment_meyesterday at 2:42 PM

Isn't this was DeepSeek + Kimi did to Claude?

smallerizeyesterday at 12:12 PM

I don't suppose they published the improved models?

4b11b4yesterday at 1:43 PM

Self-consistency meets fine-tuning?

antirezyesterday at 1:59 PM

Another potentially usable trick is the following: based on the observation that longer token budget improves model performances, one could generate solutions using a lot of thinking budget, then ask the LLM to turn the trace into a more compact one, and later SFT on that. That said, I have the feeling the result of the paper will likely be hard to apply in practice without affecting other capabilities, and/or not superior to other techniques that provide similar improvement in sampling.

robwwilliamsyesterday at 3:07 PM

Very cool. An evolutionary biologist would say: Welcome to the party!

Mutation rate modulation is the AI engineers’ heat. And selection does the trimming of the outliers.

Some more serious biomorphic thinking and we may get to the next big insight courtesy of 3+ billion years of evolution—- evolution that enabled a great ape species to write a paper like this and build LMM’s like Gemma4 that totally rock on a 3.5 pound MacBookPro M5 Max with 128 GB of RAM.

porridgeraisinyesterday at 4:18 PM

There's an obvious baseline which seems missing

If you sample from the base model with T=1.6, top_k=20, top_p=0.8, i.e, the decode settings used for the distillation's ground truth, does it match the SSD'd model + some decoding? Performance wise.

Their sweep is missing this. And only covers "standard" decoding settings.

neuzhoutoday at 12:06 AM

[dead]

aplomb1026yesterday at 11:31 PM

[dead]

maxbeechyesterday at 9:19 PM

[dead]

techpulselabyesterday at 4:13 PM

[dead]

aplomb1026yesterday at 5:31 PM

[dead]

aiiaroyesterday at 6:02 PM

[flagged]

VoqalAIyesterday at 2:12 PM

[dead]

yubainuyesterday at 5:34 PM

[dead]

usermacyesterday at 12:58 PM

[dead]

pithtknyesterday at 12:57 PM

[dead]

dist-epochyesterday at 11:47 AM

[flagged]

show 5 replies
jofzaryesterday at 11:44 AM

> simple self-distillation (SSD):

Sorry apple, SSD is already taken, you can't use that acronym.

show 3 replies
politelemonyesterday at 11:55 AM

It's cringe worthy to see that the original paper itself is editorialised.

Title should be: Simple Self-Distillation Improves Code Generation

show 2 replies
ape4yesterday at 11:57 AM

Shouldn't a scientific paper be using metric units (like 30T) rather than 30B.

There are two distinct billions. https://en.wikipedia.org/wiki/Billion

show 1 reply