The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.
This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.
https://arxiv.org/pdf/2601.02671
> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).
> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.
The lesson here is that the Internet compresses pretty well.
(I'm not needlessly nitpicking, as I think it matters for this discussion)
A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.
But your overall point still stands, regardless.
You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test.
> this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?)
A quick search brings up several C compilers written in Rust. I'm not claiming they are necessarily in Claude's training data, but they do exist.
https://github.com/PhilippRados/wrecc (unfinished)
https://github.com/ClementTsang/rustcc
https://codeberg.org/notgull/dozer (unfinished)
https://github.com/jyn514/saltwater
I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.