logoalt Hacker News

AI is just unauthorised plagiarism at a bigger scale

529 pointsby speckxtoday at 1:38 PM389 commentsview on HN

Comments

danoramatoday at 4:24 PM

There’s a fallacy that gets used a whole lot to justify things like this (not just with LLMs), and I see it in many of the comments here: If it’s OK (or at least negligible on a small scale), then it must be OK on a large scale.

It usually goes something like: If I can make money by learning something from a web page, why does a computer making money by learning everything from everyone upset people so? It’s the same thing!

It’s like if I go to Golden Gate Park and pick one flower, I shouldn’t do that, but no one cares. But if I build a machine to automatically cut every flower in the park because I want to sell them, that’s different.

“You say I can pick one flower, but you get upset when I take a bunch. That’s inconsistent. Check and mate.”

But quantitative changes in an activity produce qualitative changes. Everyone knows this, but sometimes they seem to find it inconvenient to admit it. Not that effects of the qualitative change are always bad, but they are often different, and worth considering rather than dismissing.

show 1 reply
dvduvaltoday at 2:09 PM

The broader problem of original sources not being given credit in a way that rewards them remains. Websites owners are paying to host their content so that spiders can come and crawl them and index it into the AI and then if they’re lucky, they might get a citation, but otherwise there’s very little reward for being a provider of content. And of course, this is something that’s getting worse and worse. Why look at a website when it’s all in AI? And then the counter to that is maybe we need to start closing the website to crawlers and put everything behind a login.

show 8 replies
deatontoday at 2:09 PM

"Steal an apple and you're a thief. Steal a kingdom and you're a statesman." - Literal Disney villain

show 3 replies
tancoptoday at 2:46 PM

if theres just one good thing coming out of ai its breaking copyright law forever. no one should be able to "own" ideas. royalties for commercial use is another thing and i support it but what we know as (non commercial) piracy and unlicensed fan art should be 100% legal

show 11 replies
pluctoday at 2:15 PM

Seriously how is this surprising? We all know AI companies stole troves of data to train their models, why do you think they'll stop? Have they faced consequences for the mass theft of copyrighted data?

You can't steal or profit off of that data, but it's fine for them for whatever reason. I guess because they're a force for good in the world and are pushing humanity forward eh?

show 5 replies
storustoday at 2:46 PM

This is really not so clear cut as "fair use" might cover 99% of all data scrapping; you are not reproducing the originals just use them to estimate probabilistic distribution of tokens in pre-training. You are never going to get the exact book word-for-word using LLMs.

show 6 replies
MontyCarloHalltoday at 2:33 PM

Did You Say “Intellectual Property”? It's a Seductive Mirage. [0]

[0] https://www.gnu.org/philosophy/not-ipr.html

show 1 reply
kstenerudtoday at 1:55 PM

> their article contains links to my actual website, with the exact link text (?!)

I'm having a hard time understanding what's wrong here? Unless the link text is very long, why would someone linking to your article use different words for the link text?

show 5 replies
ggillastoday at 2:32 PM

IP attorney here and actively working on this problem.

nla: if you create content online (public repo code, blog, podcast, YouTube, publishing) the smartest thing you can do if to file a US copyright, even if you have a hobby blog.

Anthropic paid $1.5B in a class settlement to authors because it was piracy of copyrighted works. If we as a HN community had our works protected, there are potentially huge statutory damages for scraping by any and all llms. I work with hundreds of writers and publishers and am forming a coalition to protect and license what they're creating.

show 7 replies
dominicrosetoday at 4:14 PM

Talking about a bigger scale may be confusing because some of the information AI can train on comes from niches.

I wouldn't mind if an AI trained on old Disney movies (or new ones for that matter), but exploiting niches (like local newspapers) seems bad.

adamzwassermantoday at 2:12 PM

People need to cope with the fact that no thought is original. Even Newton and Leibniz were having the same thoughts at the same time. Get over it.

show 6 replies
rastrojero2000today at 4:06 PM

It's not though, that's just the business case, where the perverse business incentives lie.

LLMs are really cool text generators and it turns out we can generate a bunch of things from text they generate.

Problem is, several of those things can be horrendous for the continued survival of the species and those happen to make the people running those AIs a ton of money, and, in perverted societies, thus also clout.

hparadiztoday at 2:24 PM

You guys have fun arguing. I'm gonna be building cool stuff.

show 5 replies
erelongtoday at 4:15 PM

"intellectual property" is something of a legal fiction

andaitoday at 2:38 PM

There's two aspects to this.

The pretraining (common crawl, i.e. the entire internet. Also books and papers, mostly pirated), and the realtime web scraping.

The article appears to be about the latter.

Though the two are kind of similar, since they keep updating the training data with new web pages. The difference is that, with the web search version, it's more likely to plagiarize a single article, rather than the kind of "blending" that happens if the article was just part of trillions of web pages in the training data.

There's this old quote: "If you steal from one artist, they say oh, he is the next so-and-so. If you steal from many, they say, how original!"

damnesiantoday at 3:58 PM

Not the first time I've had the thought massive lawsuits could be in all AI company's future. Surely they realize they are living on borrowed time simply by being the current trendy tech.

oytmealtoday at 2:37 PM

Isn't plagiarism inherently unauthorized?

show 2 replies
tptacektoday at 1:55 PM

People were effectively copying websites (especially ecommerce tutorials) and beating the original authors at SEO decades before ChatGPT 2.

show 8 replies
baqtoday at 2:18 PM

turns out plagiarism at scale can solve Erdos problems

show 2 replies
adamtaylor_13today at 3:57 PM

I read the article, but I disagree. People are angry, and that's completely understandable. I believe it's a justifiable response to the huge upheaval happening. But being angry about LLMs does not magically transmute their output into "plagiarism".

It has always been possible to take someone's public work, put a twist on it, and then sell it as unique. (I'm not making a moral/ethical argument, only a legal one.) I have yet to see any evidence that LLMs are fundamentally different from that approach.

frankesttoday at 3:31 PM

You are going to see the same thing that happened with newspapers. Those who want to train the AI with their content (advertisers, PR) will push out more content for AI in the open. Those who have quality content that gives you an advantage will try to lock out AI or get pricy subscription APIs for humans and even pricier for AI.

isoprophlextoday at 3:04 PM

> Is this what the pinnacle of human is? Lazy and greedy?

Yes. At least it is what the currently prevailing economic system of "value extraction and capital concentration at all cost" incentivises us towards.

jeisctoday at 3:14 PM

AI is an organized intellectual property rip off in the name of advancing human learning but the commercialization of the products seem like legal licenses to steal.

saghmtoday at 2:21 PM

It's basically the same thing as the old joke "if you owe the bank a million dollars, you have a problem; if you owe the bank a billion dollars, they have a problem". IP law seems to always be disproportionately wielded against smaller players, and the ones who are big enough get away with it.

show 1 reply
ironman1478today at 4:04 PM

People keep saying open source is an example of how copyright doesn't quite matter. However, many of the biggest open source projects are contributed to by massive corporations. Linux has lots of contributions from all the FAANGs, Red Hat, etc. Yes, it's not protected by copyrighted, but also the way it's produced is wholly different from how an artistic work is produced. Contributing to Linux is nothing on the balance sheet of Google for example, whereas producing art for an independent person or a whole company who's purpose is to create art can be very expensive.

Artists are taking risks and need legal protection if they want to make art for a living. If artists were making FAANG engineer compensations or all worked at institutions like universities (with all their protections) then maybe they wouldn't care about copyright, but that isn't the living situation for every artist.

You could say an artist shouldn't rely on making art for a living, but that's actually a different discussion.

barnabeetoday at 3:15 PM

The war on copying is like the war on drugs: unwinnable, and socially useless.

Let information be free for personal and recreational uses[0], and vote for governments that will fund the arts. The corporations will be just fine.

[0] The AI companies and big tech vs publishers, music labels, etc. can fight to the death in the courts over who owes who what, for all I care.

sublineartoday at 4:19 PM

At the very least, we see there is minimal practical value for LLMs for any serious work. This is sort of good news. The effort to build this type of "AI" is all in the training data and navigating politics.

That leaves two possibilities: either another AI winter comes as people fail to capture long term value, or we get less swampy models that are much more useful and trained the correct way.

cryptocod3today at 1:53 PM

There's authorized plagiarism?

show 4 replies
dspilletttoday at 3:38 PM

More like “GenAI enables plagiarism at a bigger scale”.

People copying through GenAI would have done so before if they had a tool that so easily allowed them that facility.

hmokiguesstoday at 3:07 PM

It's so wild, I can't even think what the end path will look like. Will there be a major settlement? Will this abolish some form of copyright as a precedent? Something else? My brain hurts just to try and reason about it, yet, the fact remains it's now ubiquitous and change is inevitable.

mindcandytoday at 3:42 PM

> AI takes in all the input, whether the original authors have consented or not, and do some "learning"

What would it mean for authors who publish content publicly to the web, without access restrictions, to provide consent for learning from it?

"EULA: Most people are allowed to learn from this text. If you work in an AI-related field, even though you can clearly see this page because you are reading this text right now, you are not permitted to learn anything from it. Bob Stanton, you are an a-hole. I do not consent to you learning from this web page. Dave Simmons, you are annoying. But, I'll give you a pass. For now... Also: plumbers. I do not like plumbers for reasons I will not elaborate. No plumbers may learn from my writing in an way."

ecommerceguytoday at 2:39 PM

I remember playing around with Writesonic in my days of spammy seo tactics (some of my products weren't allowed on marketplaces & advertising platforms due to hazmat products so..). Often times I would see my own product descriptions nearly verbatim in the output.

100% creators should get compensated by ai platforms for their work.

Further, I can see a day where someone like Reddit will close off or license their data to llms. No doubt they are losing traffic right now.

show 1 reply
ProllyInfamoustoday at 2:20 PM

>>"The underlying purpose of AI is to allow wealth to access skill while removing from the skilled the ability to access wealth." @jeffowski (first I read it, not sure if author)

Bezos' admission, recently, that the bottom 50% of current taxpayers ought'a NOT pay any taxes... is just preparing us for the inevitable UBI'd masses.

: own nothing, be happy!

pull_my_fingertoday at 2:43 PM

What gets me is when this was brought up, they said "requiring explicit permission will kill the AI industry"[1]. No shit! Why do you think all the rest of us didn't build a business/"industry" around stealing shit? They could have done it at a slower pace while respecting copyright laws, but they were too greedy to be first to market and secure a hold.

[1]: https://www.theverge.com/news/674366/nick-clegg-uk-ai-artist...

motbus3today at 2:17 PM

It allows data do be compressed into the weights and the mere coincidence of certain strings of a book will make it spit the full book

biscuits1today at 2:59 PM

"Is this what the pinnacle of human is? Lazy and greedy?"

Selfishness, too. But if I follow the logic, and citations are added, how would one enforce a copyright claim if the creator is amorphous and all-knowing?

hiroto_lemontoday at 2:40 PM

Worth noting what changed isn't AI itself — copying always existed. LLM just made per-article rewrites a 5-second job. Detection didn't get the same speedup; that's the actual break.

kingleopoldtoday at 2:39 PM

with this logic, business is also just unauthorised plagiarism at a bigger scale. Because all the products/services gets copied and not all of them have patents etc???

schwartzworldtoday at 2:27 PM

Let this sink in: I wanted to open source a package at work at needed approval from legal and other teams to make sure I wasn't leaking anything proprietary. The same executives that worried about proprietary, copyrighted code being leaked 10 years ago are now mandating using the plagiarism machine.

The whole AI bubble is The Emperor's New Clothes, and it feels liek more people are finally admitting it.

illiac786today at 3:39 PM

Isn’t it rather authorized plagiarism?

I_am_tiberiustoday at 3:45 PM

It's essentially a new napster.

peterbell_nyctoday at 2:15 PM

I do just want to highlight that this is also what humans do. We read a bunch of content online and then use it in our work product. The vast majority of the value that I provide comes from copyrighted information that I have ingested - either directly with a payment to the creator (bought and read the book, paid for and attended the seminar) or indirectly via third party blog posts or summaries where I did not then pay the originator of the materials.

I think there are real questions around motivations for creation of novel, high quality valuable content (I think they still exist but move to indirect monetization for some content and paywalls for high value materials).

I don't inherently have any problems with agents (or humans) ingesting content and using it in work product. I think we just need to accept that the landscape is changing and ensure we think through the reasons why and how content is created and monetized.

show 3 replies
muldvarptoday at 2:54 PM

I agree but AI is a) owned by rich people and b) (sadly) too useful for this to matter.

mrbluecoattoday at 2:10 PM

> AI ... do some "learning"

Is AI plural or is that a typo?

show 2 replies
iloveooftoday at 2:35 PM

I don’t know if this author supports OSS but I’ll share this because HN generally is full of people with that mindset.

It’s deeply ironic that if you forget about LLMs and look only at the outcome—-we’ve found a way to legally circumvent copyright and the siloing of coding knowledge, making it so you can build on top of (almost) the whole of human coding knowledge without needing to pay a rent or ask for permission—-it sounds like the dream of open source software has been realized.

But this doesn’t feel like a win for the philosophy of OSS because a corporation broke down the gates. It turns out for a lot of people, OSS is an aesthetic and not an outcome, it’s a vibe against corporate use or control of software, not for democratized access to knowledge.

show 5 replies
waffletowertoday at 4:03 PM

Use of the word "plagiarism" is plagiarism itself. Culture and thought are deeply shared phenomena. Using a common language, such as English, to communicate is equally an act of plagiarism. You didn't invent these words -- you use them without attribution and without payment. To decry and malign the collective training of all available digitally represented thought and discourse by large language models as simple binary plagiarism is deeply ironic -- where did you pay for your own thoughts? I don't want to live in your pay-per-thought society. I want to live with the ethos "information wants to be free". En garde!

joriswtoday at 2:45 PM

> X is just Y but

Can't recall the last time a compelling argument started out like this

energy123today at 2:59 PM

It's a problem with only one practical solution: taxation.

tiahuratoday at 2:04 PM

To answer the author's question: Yes, progress IS largely built on the shoulders of those who came before.

🔗 View 42 more comments