Where's the training data and training scripts since you are calling this open source?
Edit: it seems "open source" was edited out of the parent comment.
They are exactly open source. The training data is the internet. Don't say it's on the internet. It IS the internet.
The training scripts are in Megatron and vLLM.
Aww yes, let me push a couple petabytes to my git repo for everyone to download...
doesn't it get tiring after a while? using the same (perceived) gotcha, over and over again, for three years now?
no one is ever going to release their training data because it contains every copyrighted work in existence. everyone, even the hecking-wholesome safety-first Anthropic, is using copyrighted data without permission to train their models. there you go.