logoalt Hacker News

ascorbictoday at 8:15 AM6 repliesview on HN

Using them was allowed as fair use – it was the downloading of the pirated copies that was infringement. That's why Anthropic switched to scanning paper books.


Replies

maccardtoday at 8:22 AM

> That's why Anthropic switched to scanning paper books.

After they threw away all the tainted data from the pirated books, right?

show 2 replies
peratoday at 10:34 AM

> Using them was allowed as fair use

That is only relevant in the US, and even there it is still not clear-cut whether the fair use doctrine applies on all these scenarios. Outside of the US the situation is also quite different: for example take a look at the recent ruling on GEMA vs OpenAI in Germany.

The reality is that the copyright issue with generative AI is very complex and reaching anything resembling a conclusion will take much more than a few opinion paragraphs from an American district judge.

kykeonauttoday at 8:51 AM

Isn't scanning also a form of copyright infringement? You are making a digital copy of a book, which is the same thing as downloading a book from the internet...

show 7 replies
olalondetoday at 8:52 AM

> That's why Anthropic switched to scanning paper books.

Could they not just subscribe to the academic publishers like universities do? Or buy eBooks? I don't understand how the "scanning" part is relevant here other than used physical books being cheaper perhaps?

show 1 reply
realusernametoday at 8:18 AM

If using the books is fair use, then distilling the model, which is just a derived product of those books is also fair use.

These companies are trying to have their cake and eat it too.

show 2 replies
niccetoday at 8:23 AM

In a different world it is not fair use. The benefits of the crime should be always taken off. If you isolate the training and pirating, you may say that it was fair, but that completely misses the point. The sole purpose of pirating (aka crime) was to train the models.

show 1 reply