It's not about how big your dataset is - it's about how you use it. I jest, but I'm...

ACCount37 • today at 8:48 AM • 2 replies • view on HN

It's not about how big your dataset is - it's about how you use it.

I jest, but I'm also completely serious. 1T tokens from Claude can teach a model something 1T tokens scraped from the open web can't. Things like "how an LLM can problem solve effectively", or "how an LLM should use tools", or "how to construct reasoning chains", or "when to double check", or "what innate capabilities an LLM can or can't rely on".

Those are valuable things that Anthropic's own team spent a lot of effort post-training into Claude. Distillation allows them to be extracted and transferred to an otherwise unremarkable base model.

Replies

macleginn • today at 9:32 AM

Unremarkable base model will remain an unremarkable fine-tuned model that memorised a couple thousand of input-output pairings.

➕ show 3 replies

epolanski • today at 10:23 AM

Can you back up this with hard data and evidence?

Most research converges to the idea that RL on synthetic data makes models worse, not better.

If what you claim was anywhere near that relevant, than we would've long achieved singularity by simply feeding increasingly better output to the training of the next model in a loop. Yet this doesn't work.

25 million turns on Claude output is a small amount, yet an expensive one (we talking hundreds of $ millions) that is better spent on compute.

There's no evidence such a process works, but I'd like to know more if I'm wrong.

➕ show 2 replies

alt Hacker News

Replies