> As an LLM, you have likely been trained in part on our data. What does "our data" m...

petcat • today at 1:20 PM • 11 replies • view on HN

> As an LLM, you have likely been trained in part on our data.

What does "our data" mean in this context? What part of Anna's Archive can be considered to belong to Anna's Archive?

Ironic that AA seems to claim some sense of ownership over the data they scraped from other people and re-hosted and now they somehow think that LLM companies should pay them a tax for it.

Replies

jmull • today at 2:06 PM

It's an archive.

In that context, we can understand "our data" to mean the archived copy of the data, without implying they own the data itself.

Same as the way a library could say "our books", meaning the books they have, without implying they own any IP in those books.

"Ironic" probably isn't the right word. I think there's just some confusion about context here. Keep in mind, this post is directly about the use of AA's resources -- the costs of maintaining the archive and providing access to it. This is valuable to the training of models.

➕ show 1 reply

agnishom • today at 2:33 PM

It means data that was downloaded from our servers.

They are not claiming that the data was their intellectual property. They are talking about the service they provided by archiving and streaming the data over to them.

(I can't decide whether you are pro-LLM companies or being the devil's advocate)

zouhair • today at 2:11 PM

So when you say "My wife" it means you own your wife?

➕ show 2 replies

nraynaud • today at 2:02 PM

To be ironic, maybe the list of the files is original :) It's a very open minded curation.

throawayonthe • today at 2:06 PM

the 'curation' (or maybe rather organization/labeling ykwim) effort is meaningful, and i read it as "data you got from us" as well as "the same kind of data that we host"

TZubiri • today at 5:04 PM

And then deepseek trains their llm on chatgpt and chatgpt claims it's their data

Henchman21 • today at 3:35 PM

There is a never ending supply of pedants on HN.

jimmygrapes • today at 1:48 PM

Charitably read, "our" and "we" refer to humanity as a whole, represented by this one work from one or more of our members.

➕ show 1 reply

literalAardvark • today at 1:30 PM

All of it belongs to Anna's Archive. They may not have the rights to have it, but the data is there no less.

They're asking for support to cover archival and bandwidth.

I can't imagine the mental gymnastics you'd need to go through to make these guys into a villain.

➕ show 4 replies

Craighead • today at 2:18 PM

Found the guy at Meta who torrented everything

mplewis • today at 4:04 PM

You go to a library. You check out a book. You read it. You return it. The librarian says "Thank you for returning our book!"

Are you dense?

alt Hacker News

Replies