I would love to understand this. Just a few years ago badly behaved scrapers were rare enou...

simonw • yesterday at 3:49 AM • 3 replies • view on HN

I would love to understand this.

Just a few years ago badly behaved scrapers were rare enough not to be worth worrying about. Today they are such a menace that hooking any dynamic site up to a pay-to-scale hosting platform like Vercel or Cloud Run can trigger terrifying bills on very short notice.

"It's for AI" feels like lazy reasoning for me... but what IS it for?

One guess: maybe there's enough of a market now for buying freshly updated scrapes of the web that it's worth a bunch of chancers running a scrape. But who are the customers?

Replies

SCHiM • yesterday at 2:42 PM

The bar to ingest unstructured data into something usable was lowered, causing more people to start doing it.

Used to be you needed to implement some papers to do sentiment analysis. Reasonably high bar to entry. Now anyone can do it, the result: more people doing scraping (in less competent scrapers too).

zerocrates • yesterday at 10:43 PM

I would say there's a couple aspects.

The crawlers for the big famous names in AI are all less well behaved and more voracious than say, Googlebot. Though this is all somewhat muddied by companies that ran the former "good" crawlers all also being in the AI business and sometimes trying to piggyback on people having allowed or whitelisted their search crawling User-Agent, mostly this has settled a little where they're separating Googlebot from GoogleOther, facebookexternalhit from meta-externalagent, etc. This was an earlier "wave" of increased crawling that was obviously attributable to AI development. In some cases it's still problematic but this is generally more manageable.

The other stuff, the ones that are using every User-Agent under the sun and a zillion datacenter IPs and residential IPs and rotate their requests constantly so all your naive and formerly-ok rate-based blocking is useless... that stuff is definitely being tagged as "for AI" on the basis of circumstantial evidence. But from the timing of when it seemed to start, the amount of traffic and addresses, I don't have any problem guessing with pretty high confidence that this is AI. To your question of "who are the customers"... who's got all the money in the world sloshing around at their fingertips and could use a whole bunch of scraped pages about ~everything? Call it lazy reasoning if you'd like.

How much this traces back ultimately to the big familiar brand names vs. would-be upstarts, I don't know. But a lot of sites are blocking their crawlers that admit who they are, so would I be surprised to see that they're also paying some shady subcontractors for scrapes and don't particularly care about the methods? Not really.

devsda • yesterday at 4:05 AM

For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material.

May be everyone is trying to take advantage of the situation before law eventually catches up.

➕ show 1 reply

alt Hacker News

Replies