I have no idea how an LLM company can make any argument that their use of content to train the model...

jaccola • yesterday at 2:59 PM • 3 replies • view on HN

I have no idea how an LLM company can make any argument that their use of content to train the models is allowed that doesn't equally apply to the distillers using an LLM output.

"The distilled LLM isn't stealing the content from the 'parent' LLM, it is learning from the content just as a human would, surely that can't be illegal!"...

Replies

amenhotep • yesterday at 3:52 PM

When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models. When you get tokens from one of these providers, you sort of did.

I think it's a pretty weak distinction and by separating the concerns, having a company that collects a corpus and then "illegally" sells it for training, you can pretty much exactly reproduce the acquire-books-and-train-on-them scenario, but in the simplest case, the EULA does actually make it slightly different.

Like, if a publisher pays an author to write a book, with the contract specifically saying they're not allowed to train on that text, and then they train on it anyway, that's clearly worse than someone just buying a book and training on it, right?

➕ show 1 reply

TZubiri • today at 12:49 AM

Because the terms by each provider are different

American Model trains on public data without a "do not use this without permission" clause.

Chinese models train on models that have a "you will not reverse engineer" clause.

➕ show 1 reply

mikehearn • yesterday at 3:07 PM

The argument is that converting static text into an LLM is sufficiently transformative to qualify for fair use, while distilling one LLM's output to create another LLM is not. Whether you buy that or not is up to you, but I think that's the fundamental difference.

➕ show 2 replies

alt Hacker News

Replies