Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.
The former relies on fairly controversial ideas about copyright and fair use to qualify as abuse, whereas the latter is direct financial damage – by your own direct competitors no less.
It's fun to poke at a seeming hypocrisy of the big bad, but the similarity in this case is quite superficial.
I understand why OpenAI is trying to reduce its costs, but it simply isn't true that AI crawlers aren't creating very significant load, especially those crawlers that ignore robots.txt and hide their identities. This is direct financial damage and it's particularly hard on nonprofit sites that have been around a long time.
Interesting how other people's cost is "near-zero marginal cost" while yours is "an expensive LLM service". Also, others' rights are "fairly controversial ideas about copyright and fair use" while yours is "direct financial damage". I like how you frame this.
That is ridiculous.
You imply that "an expensive llm service" is harmed by abuse, but, every other service is not? Because their websites are "static" and "near-zero marginal cost"?
You have no clue what you are talking about.
Lets not try to qualify the wrongs by picking a metric and evaluating just one side of it. A static website owner could be running with a very small budget and the scraping from bots can bring down their business too. The chances of a static website owner burning through their own life savings are probably higher.
Have you not seen the multiple posts that have reached the front page of HN with people taking self-hosted Git repos offline or having their personal blogs hammered to hell? Cause if you haven't, they definitely exist and get voted up by the community.
It's not like those models are expensive because the usefulness that they extracted from scraping others without permission right? You are not even scratching the surface of the hypocrisy
Getting scraped by abusive bots who bring down the website because they overload the DB with unique queries is not marginal. I spent a good half of last year with extra layers of caching, CloudFlare, you name it because our little hobby website kept getting DDoS'd by the bots scraping the web for training data.
Never in 15 years if running the website did we have such issues, and you can be sure that cache layers were in place already for it to last this long.
It's more ironic because without all the scraping openai has done, there would have been no ChatGPT.
Also, it's not just the cost of the bandwidth and processing. Information has value too. Otherwise they wouldn't bother scraping it in the first place. They compete directly with the websites featuring their training data and thus they are taking away value from them just as the bots do from ChatGPT.
In fact the more I think of it, I think it's exactly the same thing.
"near-zero marginal costs". For whom exactly????
https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...
I don't think a rule along the lines of "Doing $FOO to a corporate is forbidden, but doing $FOO to a charitable initiative is fine" is at all fair.
What "$FOO" actually is, is irrelevant. I'm curious how you would convince people that this sort of rule is fair.
The corp can always ban users who break ToS, after all. They don't need any help. The charitable initiative can't actually do that, can they?
You’re describing the tragedy of the commons. No single raindrop thinks it’s responsible for the flood.
> Scraping static content from a website at near-zero marginal cost to its server
It's not possible to know in advance what is static and what is not. I have some rather stubborn bots make several requests per second to my server, completely ignoring robots.txt and rel="nofollow", using residential IPs and browser user-agents. It's just a mild annoyance for me, although I did try to block them, but I can imagine it might be a real problem for some people.
I'm not against my website getting scraped, I believe being able to do that is an important part what the web is, but please have some decency.
60% of our traffic is bot, on average. Sometimes almost 100%.
It is direct financial damage if my servers not on an unmetered connection — after years of bills coming in around $3/mo I got a surprise >$800 bill on a site nobody on earth appears to care about besides AI scrapers.
It hasn’t even been updated in years so hell if I know why it needs to be fetched constantly and aggressively, - but fuck every single one of these companies now whining about bots scraping and victimizing them, here’s my violin.
AI providers also claim to have small marginal costs. The costs of token is supposedly based on pricing in model training, so not that different from eg your server costs being low but the content production costs being high. And in many cases AI companies are direct competitors (artists, musicians etc.)
(TBH it's not clear to me that their marginal costs are low. They seem to pick based on narrative.)
My website serving git that only works from Plan 9 is serving about a terabyte of web traffic monthly. Each page load is about 10 to 30 kilobytes. Do you think there's enough organic, non-scraper interest in the site that scrapers are a near-zero part of the cost?
> net-zero marginal cost
Lol, you single-handedly created a market for Anubis, and in the past 3 years the cloudflare captchas have multiplied by at least 10-fold, now they are even on websites that were very vocal against it. Many websites are still drowning - gnu family regularly only accessible through wayback machine.Spare me your tears.
You are, of course, ignoring the production costs of the static content that OpenAi is stealing.
Stop justifying their anti-social behavior because it lines your pockets.
> Scraping static content
How do you know the content is static?
The cost is so marginal that many, many websites have been forced to add cloudflare captchas or PoW checks before letting anyone access them, because the server would slow to a crawl from 1000 scrapers hitting it at once otherwise.
I'm sure the copyright holders would consider your use of their content as direct financial damage
I think this also explains why the checks are moving up the stack.
If the real cost is in actually running the app or the model, then just verifying a browser isn’t enough anymore. You need to verify that the expensive part actually happened.
Otherwise you’re basically protecting the cheapest layer while the expensive one is still exposed.
And yet I have to pay in my time and cash to handle the constant ddos'es from the constant LLM scraping
It’s not for techbros to decide at what threshold of theft it’s actually theft. “My GPU time is more valuable than your CPU time” isn’t a thing and Wikipedias latest numbers on scraping show that marginal costs at scale are a valid concern
Are they, actually?
Absolutely not, the former relies on controversial ideas to qualify as legal.
Stealing the content from the whole planet & actively reducing the incentive to visit the sites without financial restitution is pretty bad.
Because you say it is?
I obviously disagree. I mean, on top of this we are talking about not-open OpenAI.
Speak for yourself.
I don’t know what world you live in but it’s not this one.
Bait or genuine techbro? Hard to say
> Scraping static content from a website at near-zero marginal cost to its server
The gall. https://weirdgloop.org/blog/clankers
[dead]
The issue is that there are so many awful webmasters that have websites that take hundreds of milliseconds to generate and are brought down by a couple requests a second.
> Scraping static content from a website at near-zero marginal cost to its server, vs scraping an expensive LLM service provided for free, are different things.
I bet people being fucking DDOSed by AI bots disagree
Also the fucking ignorance assuming it's "static content" and not something needing code running