logoalt Hacker News

halxcyesterday at 8:14 PM4 repliesview on HN

We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.

It is a research topic for heaven's sake:

https://arxiv.org/abs/2504.16046


Replies

RyanCavanaughyesterday at 8:18 PM

The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.

show 5 replies
ben_wyesterday at 8:24 PM

We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.

The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"

show 4 replies
Aurornisyesterday at 10:39 PM

Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.

Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.

soulofmischiefyesterday at 9:13 PM

The point is that it's a probabilistic knowledge manifold, not a database.

show 1 reply