How exactly is it different? All the model itself is is a probability distribution for next token gi...

ndriscoll • yesterday at 2:07 PM • 2 replies • view on HN

How exactly is it different? All the model itself is is a probability distribution for next token given input, fitted to a giant corpus. i.e. a description of statistical properties. On its own it doesn't even "do" anything, but even if you wrap that in a text generator and feed it literal gcc source code fragments as input context, it will quickly diverge. Because it's not a copy of gcc. It doesn't contain a copy of gcc. It's a description of what language is common in code in general.

In fact we could make this concrete: use the model as the prediction stage in a compressor, and compress gcc with it. The residual is the extent to which it doesn't contain gcc.

Replies

jacquesm • yesterday at 2:16 PM

There already have been multiple documented cases of LLMs spitting out fairly large chunks of the input corpus. There have been some experiments to get it to replicate the entirety of 'Moby Dick' with some success for one model but less success with others most likely due to output filtering to prevent the generation of such texts, but that doesn't mean they're not in there in some form. And how could they not be, it is just a lossy compression mechanism, the degree of loss is not really all that relevant to the discussion.

➕ show 1 reply

gus_massa • yesterday at 3:39 PM

For an infographic, perhaps you claim claim fair use. I think it makes a lot of sense, but IANAL.

For a fan fiction episode that is different from all official episodes, you may cross your fingers.

For a remake of one of the episodes with a different camera angle and similar dialog, I expect that you will get in problems.

➕ show 1 reply

alt Hacker News

Replies