logoalt Hacker News

RyanCavanaughyesterday at 8:18 PM5 repliesview on HN

The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.


Replies

silver_suntoday at 3:31 AM

> this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?)

A quick search brings up several C compilers written in Rust. I'm not claiming they are necessarily in Claude's training data, but they do exist.

https://github.com/PhilippRados/wrecc (unfinished)

https://github.com/ClementTsang/rustcc

https://codeberg.org/notgull/dozer (unfinished)

https://github.com/jyn514/saltwater

I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.

philipportneryesterday at 9:46 PM

This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

show 1 reply
seba_dos1yesterday at 10:32 PM

> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.

The lesson here is that the Internet compresses pretty well.

mft_yesterday at 10:15 PM

(I'm not needlessly nitpicking, as I think it matters for this discussion)

A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.

But your overall point still stands, regardless.

uywykjdsknyesterday at 11:45 PM

You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test.