If using the books is fair use, then distilling the model, which is just a derived product of those ...

realusername • yesterday at 8:18 AM • 2 replies • view on HN

If using the books is fair use, then distilling the model, which is just a derived product of those books is also fair use.

These companies are trying to have their cake and eat it too.

Replies

drdaeman • yesterday at 9:29 AM

Hmm, training on a book’s text smears the content all over the weights, merging it with all other texts. The original text isn’t intentionally supposed to be reproducible in any larger part (although IIRC models were able to emit fairly large chunks verbatim).

Quite unlikely, training on behavior purportedly approximately replicates the behavior. It gets replicated intentionally as a whole.

IANAL, but I see significant differences with intent to copy a significant part as a whole into a competing product, surely shouldn’t fit under legal concept of fair use, no matter whether scanning books for LLM training fits or not.

Whether such things (behaviors) are copyrightable - and should they be so - is another interesting question. Those aren’t algorithms or databases (stuff clearly and explicitly covered in many copyright laws), those are human expectation models, something like how we train animals or teach our own.

➕ show 2 replies

ascorbic • yesterday at 9:17 AM

Probably, yes. It's likely just a breach in their terms of service. You'll note that they're not suing them – they're trying to get the government to do their work for them.

alt Hacker News

Replies