logoalt Hacker News

harshrealityyesterday at 10:53 PM0 repliesview on HN

Only because that quote is famous.

As you said, it's lossy. Try it with any other distinctive but non-famous passage, and you won't get a correct prediction for the immediately following clause, much less for multiple sentences or paragraphs.

That's the case even when an LLM correctly identifies which book the prompted text is from. It still won't accurately continue on from some arbitrary passage. By the time you ask it to reproduce hundreds of words, you're into brand new book territory. Even when it's slop content, it's distinct slop.

The exceptions are cases where a significant number of humans would also know a particular quote from memory. Then, chances are, a frontier LLM will too.

You know how else you can reproduce a quote? Search for it on google, and search the resulting top hits; if it's a significant quote, multiple people have probably quoted it -- legally. You can also search a pirate library for the actual book, and search the book for the quote; while illegal, it's very simple to do, so unless you propose to make the free and open internet illegal, I'd suggest that banning LLMs for being "derivative work" creation engines is not so different from destroying the internet.

> I predict, no pun intended, that a time is coming when the idea that it's not a derived work will be challenged in mainstream law.

If judges have any sense whatsoever, LLM generations (without specific prompt crafting to mimic existing works) will be judged to not be derived works and therefore not be violating copyright, in the same sense that you can live and breathe Taylor Swift's music, create new music in the same style, and still not be violating copyright.

The Stability AI case, and how Judge Orrick deals with it, will be interesting and uninteresting at the same time. It deals primarily with the fact that after specific prompting, an image-generation AI can generate something fairly close to existing copyrighted images. That doesn't say anything more about whether LLMs are inherently producers of [only or primarily] derivative works, just as the fact that a human can violate copyright doesn't say anything about whether humans primarily or exclusively output derivative works.

More likely, perhaps, is that everything will be so infused with LLM output that copyright ceases to be relevant, or forces copyright law to be rewritten from the ground up.

Copyright requirements, even prior to LLMs, weren't well-specified. There's no objective threshold for how close something has to be to a previous work before the new one violates copyright. It's whatever a judge thinks, refering to the 4-factor test but ultimately making subjective judgements about each of those prongs. It's all a house of cards, and LLMs may just be what topples it.