The core of the training data is public, but the part that actually makes these models smart came fr...

rafram • today at 2:33 PM • 5 replies • view on HN

The core of the training data is public, but the part that actually makes these models smart came from (pretty highly-paid) experts via platforms like Mercor. Claude didn't magically learn to write good code by reading all of GitHub - humans trained it in that, more or less manually.

Replies

rapind • today at 3:13 PM

If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist (*including the curated material)? Can we do the same with movies? Books?

/edit Added a note to make it more obvious that the material is included in the playlist, just like the material is incorporated as part of curated AI models.

➕ show 1 reply

datsci_est_2015 • today at 4:30 PM

Given the breadth of LLM knowledge, I somehow doubt this. Sure, it’s probably responsible for the quality of LLM insights, but I don’t think anyone was asking experts about e.g. the complex ecological effects of invasive zebra mussels and their provenance in Lake Michigan.

visarga • today at 3:00 PM

No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.

jaen • today at 2:38 PM

...and the rest of the training data (ie. the entire corpus of copyrighted works) was not written by experts expecting compensation? Double standards.

➕ show 2 replies

freejazz • today at 3:10 PM

So? What about the authors of all the works these companies stole?

alt Hacker News

Replies