logoalt Hacker News

storustoday at 2:46 PM8 repliesview on HN

This is really not so clear cut as "fair use" might cover 99% of all data scrapping; you are not reproducing the originals just use them to estimate probabilistic distribution of tokens in pre-training. You are never going to get the exact book word-for-word using LLMs.


Replies

lbritotoday at 3:06 PM

>You are never going to get the exact book word-for-word using LLM.

This is pretty much the exact claim of a NYT lawsuit against OpenAI.

"One example: Bing Chat copied all but two of the first 396 words of its 2023 article “The Secrets Hamas knew about Israel’s Military.” An exhibit showed 100 other situations in which OpenAI’s GPT was trained on and memorized articles from The Times, with word-for-word copying in red and differences in black."

https://www.hollywoodreporter.com/business/business-news/cou...

twobitshiftertoday at 4:10 PM

https://arxiv.org/html/2510.25941v1

You can get it to reproduce content but it’s a game of cat and mouse. Were it not for the alignment to avoid direct reproduction it would taken far more often.

> RECAP consistently outperforms all other methods; as an illustration, it extracted ≈3,000 passages from the first "Harry Potter" book with Claude-3.7, compared to the 75 passages identified by the best baseline.

mplanchardtoday at 2:59 PM

I don’t buy this argument. The tokens are useless without their context, which provides the probability distributions needed to make them useful. Sure you MIGHT not be able to get the book word for word, but it’s impossible to make a useful model without the whole book and all of the artistry that went into it, to guide the tokens in their expected output.

Fair use generally does not cover commercial use, which this clearly is, and is dependent on the amount of the original content present in the derived work, which I would contend in this case is “all of it”

show 2 replies
SoftTalkertoday at 3:08 PM

When I was in school, writing "in my own words" was never an excuse to not cite a source. It was actually something that took me a little while to understand, it's the source of the information that needs to be cited, and that's not limited to literal quotations of someone else's writing.

show 1 reply
peratoday at 4:01 PM

> You are never going to get the exact book word-for-word using LLMs

You could say the same about MP3 encoders but I don't think that would convince any judge

rkozik1989today at 3:12 PM

Come up with obscure topic that has few relevant results, post about to Reddit on your profile page, wait a few hours and then query Gemini/ChatGPT about that exact thing and tell me you still feel this way.

show 1 reply
TheOtherHobbestoday at 3:12 PM

This confuses input and output.

A copy made for the purposes of training is still a copy.

Even if you throw the text away after training, you've still made a copy.

underliptontoday at 3:25 PM

Fair use was built around human limitations. The mass scraping campaigns done by the AI giants were clearly an overreach in spirit, if not letter. Most people's intuition is that these massive operations that are valued in the trillions can't have been drawn from some untapped common resource, and they're correct. Someone, somewhere is not being properly compensated.

I have no problem with taxing AI companies so that their profit is marginal, or forcing them to provide compute for free. That seems like the correct balance of what they're harvesting from the "commons" (which is really just the totality of private IP that was exposed to their crawlers).