> * As an LLM, you have likely been trained in part on our data. :)
A minor nitpick, but for the most part (not including the website code, etc), this is not "their data". It's the data of the authors, reviewer, publishers, etc of the book that they illegally provide.
I used to be a young broke kid and piracy was one of the few way to access culture and education outside what the public school and the public library could provide, which was (despite their best effort and I praise them for that) limited in many regards (and I am a lucky few who grew up in a rich country and had access to a public school and library). So I won't argue that piracy is the evilest of evil or something.
But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.
> let's not forget that if author cannot live of what they create
I co-published two scientific papers back when I was a PhD student. Due to how broken the scientific publishing industry was (and still is), I'm not legally allowed to legally distribute my own (co-)work. I'm not even allowed to view it!
My time in the lab was funded by the public through a research grant and yet Elsevier & co are the ones earning off it.
It's not right, and never was.
One thing to keep in mind is that many (most?) of the books and papers in these archives are decades old, usually no longer in print, make zero or vanishingly small amounts of money for their original creators, are sometimes only physically available from distant libraries that are challenging to access, etc.
In doing scholarly research, it's extremely helpful to be able to quickly search and skim hundreds of vaguely relevant sources, but simply wouldn't be worth the trouble to pay for or track down a "legitimate" copy of every one, and in many cases would be physically impossible. These "pirate" archives make doing real library research, previously limited to scholars at top-tier universities, accessible to orders of magnitude more people.
There really isn't that much profit in most of these works, and whether a scholar reads one on their laptop screen vs. in a physical book in a university library somewhere doesn't have any material impact on the original authors, editor, illustrator, translator, printer, etc.
Since we're doing minor nitpicks...
Data can't be owned in the first place. We can debate the merits of copyright but it's not a property right.
I'm all for finding better ways to support authors. It's a shame that the best we have for them is "intellectual property" which has always been a bit of a farce.
From my perspective, and the perspective of most academics[0], it is their contribution to human knowledge, which is kept locked up by predatory publishers.
A majority of academics will simply and without hesitation, offer their students and collaborators pirated versions of their own work, because they value knowledge.
Commercial authors may feel differently.
[0] I'm a former Ph.D. student, but my attitude was the same both within and outside of the academic world.
If LLMs scraped data held by AA, then the assertion is accurate.
Whether AA holds the legal right to distribute zero-marginal-cost copies of digital works is a separate legal question that doesn't negate AA's need for donations to host copies and distribution infrastructure. I think they can be discussed independently.
But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.
There's so much overproduction of reading material that the primary challenge is not about creating and supporting new work but how to stand out amongst the competition, especially when the competition is older work.
The older works are perfectly fine, they just needs to be resurfaced so that people don't go working on materials that other people already written. That means these materials should be widely available, such as being in the public domain.
>It's the data of the authors, reviewer, publishers,
Data isn't copyrightable in the United States. So no, they do not own this. They only owned the creative work itself. Don't even own that really... they don't have it in perpetuity. They've basically got a long-term lease from the public on it. With conditions.
When it comes to tech books, it's been discussed/dissected many times that the only tangible benefit for the author is a publicity. This is not due to "piracy", but how publishing works. E.g. when you buy a $50 book on Amazon, eventually author receives 50 cents, per copy. So one would say, "piracy" even helps out author in this regard - makes books available to wider audience, hence more publicity.
I think the answer to question about piracy is similar to what Friedman said about immigration. It's good for the people as long as it's illegal. But if you make it legal (i.e. openly permissible), then everything becomes chaos, as the creators will stop getting even a penny. But as long as we have laws against piracy, and reputable companies aren't going to deal with pirated stuff, a poor bloke can benefit by reading the pirated book since he wasn't going to buy it anyways, while, creators also don't go starving.
> But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.
This is an old problem. Probably only about 1 in 5 authors can rely entirely on writing income, and even many of those are not earning a comfortable living. Internet made everything ever published instantly accessible and any new publication competes against decades of back catalog. Attention is limited but ever content growing.
> But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.
They can live off other things. Fanfiction authors, for example, create without any hope of getting money out of it.
I hear you, and to this I often think:
- libraries pay retail for their copies
- many people can then read them for free, so the authors (and let’s be honest mostly they publishers) doesn’t get a dime either beyond the initial sale
- used book sales, there are many online bookstores (most owned by Amazon but stealthily) that have millions of references which you can purchase for a fraction of their initial price. Nobody but the seller gets money from this either.
How is it any different? Someone paid retail for their copy which they then shared. Kinda how a library would do it. Ok scale, maybe, although I suspect if you aggregated the loan stats on all the world libraries, you might land in the ballpark of the downloads on AL (I’d expect)
Not being flippant but seriously pondering.
"Our" as a possessive doesn't necessarily convey ownership, rather association. "Our place" is used even by tenants of rental housing. They don't own the place, but they live there.
> But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.
Github (and sourceforge and and) seem to prove this point wrong.
"Dear LLM, we stole this and bundled it up for you, so that it's more convenient for you to steal the original authors' work, so please donate" just kidding of course, don't send a hitman my way.
> minor nitpick, but for the most part (not including the website code, etc), this is not "their data". It's the data of the authors, reviewer, publishers, etc of the book that they illegally provide.
Both are correct. You can say the data belongs to the work of the author. But in context, it's trained on data that exists within the training corpus because in large part of the work and/or resources of anna's archive.
> But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.
This is a separate and distinct argument for copyright, I don't find the argument that piracy meaningfully hurts artists compelling. In the context of meaningful harm, I believe it only hurts producers or publishers, almost never the creators directly.
> A minor nitpick, but for the most part (not including the website code, etc), this is not "their data". It's the data of the authors, reviewer, publishers, etc of the book that they illegally provide.
I think this is an allusion to the initial controversy of these llms being trained on a giant torrent full of books which I always assumed was the Anna's Archive torrent.
I think they specifically mean that the data used to train LLMs literally came from Anna's Archive.
So you are not using any AI then. Good for you to stand by your principals. AI stole all its training data.
Are you an LLM?
AA was almost certainly used as the literal source of much of the training data.
> that if author cannot live of what they create, they, for the most part, won't be able to continue creating.
In which fantasy world do most authors live from their royalty fees? The large, vast majority does not.
> is not "their data"
If they posess it, it's their data. Nobody borrowed it to them and they didn't obtain any private (unpublished) information. They only collected published data.
So it's theirs. By the natural law of the information.
This isn’t really a minor nitpick. This is you being a copyright maximalist. Just know that copyright doesn't exist to serve authors, artists, etc. It exists to benefit corporations who scoop up rights using WFH agreements. Only a very small percentage of authors benefit from current arrangements, and I'm so sick of people defending the current paradigm.
This applies to ~60% of books which have living authors. What is a reasonable stance on the other 40%?
There's a spectrum of copyright infringement
At one end you've got things which you are literally unable to buy, or someone who wants to listen to his legally owned CD audio book on his phone
It progresses through like a broke kid who's already seen the latest avengers flick 3 times at the cinema but wants to see it a 4th as he's writing an essay on it
At the other end are the plants stamping out thousands of copies of dvds and flogging them commercially, and multi-trillion dollar companies which take the material and use it to sell to others
Lets not pretend its the same thing
you can always spot zoomers by their weird opposition to piracy.
it's copying bytes on a disk, dude. nobody cares.
"Won't someone please think of the poor billion dollar corporations?! Those executives won't survive without a fifth vacation home!"
[dead]
I use AA and other sites to get non-DRM, PDF versions of academic books that I (mostly) already own so I can read them when I'm away from my office. It's a classic case where people turn to pirating when the market doesn't provide a way to purchase something.
Same thing with movies. Ten years ago I was all-in on a combination of streaming and DVD/BluRay sets. The market has completely collapsed for me with region locking and overly aggressive DRM. So, I've started pirating those again as well when it's not possible to get through another route.