I'm not sure how llms count as fair use. It's just that we can't show HOW they've been encoded in the model, means it's fair use? Or that statistical representations are fair use? Or is it the generation aspect? I can't sell you a Harry potter book, but I can sell you some service that let's you generate it yourself?
I feel like this has really blown a hole in copyright.
Same. If I invented a novel new way of encoding video and used it to pack a bunch of movies into a single file, I would fully expect to be sued if I tried distributing that file, and equally so if I let people use a web site that let them extract individual videos from that file. Why should text be treated differently?
One should also keep in mind the countless people who got much of their education from pirated books.
I'm inclined to agree here. LLMs do not use just a paragraph here and there in accordance with fair use, but rather uses the entire body of work to train itself.
Or am I misunderstanding something about LLMs?
The judge is claiming that because the use is of the books are “so transformative,” the usage of these books to train an llm is fair use.
I’m not familiar with the facts of the case and IANAL, and its late, but how did the plaintiffs determine their books were being used for training of the llm? Was the model spitting out language that was similar or verbatim to their works?
The word transformative was put there in a time of manual transformative processes, like when you paint something similar to what you saw in a painting by another artist, with all the implied limitations that entails, like the time it took from you to watch that painting, and the time it takes you to create that new painting, nothing to do at all with the way LLMs operate, an honest assessment would have found that the word was meant for a wildly different use case and therefore it required a bigger and more nuanced discussion.
> It's just that we can't show HOW they've been encoded in the model, means it's fair use?
Describing training as “encoding them in the model” doesn’t seem like an accurate description of what is happening. We know for certain that a typical copyrighted work that is trained on is not contained within the model. It’s simply not possible to represent the entirety of the training set within a model of that size in any meaningful way. There are also papers showing that memorisation plateaus at a reasonably low rate according to the size of the model. Training on more works doesn’t result in more memorisation, it results in more generalisation. So arguments based on the idea that those works are being copied into the model don’t seem to be founded in fact.
> I can't sell you a Harry potter book, but I can sell you some service that let's you generate it yourself?
That’s the reason why cases like this are doomed to fail: No model can output any of the Harry Potter books. Memorisation doesn’t happen at that scale. At best, they can output snippets. That’s clearly below the proportionality threshold for copyright to matter.