The core of the training data is public, but the part that actually makes these models smart came from (pretty highly-paid) experts via platforms like Mercor. Claude didn't magically learn to write good code by reading all of GitHub - humans trained it in that, more or less manually.
Given the breadth of LLM knowledge, I somehow doubt this. Sure, it’s probably responsible for the quality of LLM insights, but I don’t think anyone was asking experts about e.g. the complex ecological effects of invasive zebra mussels and their provenance in Lake Michigan.
No, they do RLVR (reinforcement learning with verifiable rewards) like everyone else. And probably use claude data too, with human in the loop and tool feedback.
...and the rest of the training data (ie. the entire corpus of copyrighted works) was not written by experts expecting compensation? Double standards.
So? What about the authors of all the works these companies stole?
If you pay me to curate a playlist of musical hits, can you now publish and charge people for access to that playlist (*including the curated material)? Can we do the same with movies? Books?
/edit Added a note to make it more obvious that the material is included in the playlist, just like the material is incorporated as part of curated AI models.