People keep throwing this idea around haphazardly, but U.S. courts have pretty consistently decided that training on copyrighted works falls under fair use. You may not like it, but that doesn't make it "illegal".
Has it? Because as far as I can tell those cases keep getting settled out of court before a legal precedent can be set.
For record breaking amounts too.
The courts have never said piracy, which is how the training sets were originally built, is legal. There are several court cases still ongoing over this.
> that training on [lawfully obtained] copyrighted works falls under fair use
Fixed that for you.
> U.S. courts have pretty consistently decided that training on copyrighted works falls under fair use.
I don't believe that this has been resolved at all, and there are quite a few pending lawsuits about it at this very moment.
Right, so it seems that distilling an AI model is legal too then. At least it is somewhat similar.
You have to admit that "downloading every book ever written for free from a repository of books that is itself illegal to compile and to run, in order to write a text generation tool" being legal is at least unintuitive, to put it mildly.