logoalt Hacker News

raframtoday at 2:33 PM5 repliesview on HN

The core of the training data is public, but the part that actually makes these models smart came from (pretty highly-paid) experts via platforms like Mercor. Claude didn't magically learn to write good code by reading all of GitHub - humans trained it in that, more or less manually.


Replies

rapindtoday at 3:13 PM

If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist (*including the curated material)? Can we do the same with movies? Books?

/edit Added a note to make it more obvious that the material is included in the playlist, just like the material is incorporated as part of curated AI models.

show 1 reply
datsci_est_2015today at 4:30 PM

Given the breadth of LLM knowledge, I somehow doubt this. Sure, it’s probably responsible for the quality of LLM insights, but I don’t think anyone was asking experts about e.g. the complex ecological effects of invasive zebra mussels and their provenance in Lake Michigan.

visargatoday at 3:00 PM

No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.

jaentoday at 2:38 PM

...and the rest of the training data (ie. the entire corpus of copyrighted works) was not written by experts expecting compensation? Double standards.

show 2 replies
freejazztoday at 3:10 PM

So? What about the authors of all the works these companies stole?