logoalt Hacker News

simonwtoday at 5:39 PM3 repliesview on HN

> These bots are almost certainly scraping data for AI training; normal bad actors don't have funding for millions of unique IPs thrown at a page. They probably belong to several different companies. Perhaps they sell their scraped data to AI companies, or they are AI companies themselves. We can't tell, but we can guess since there aren't all that many large AI corporations out there.

Is the theory here that OpenAI, Anthropic, Gemini, xAI, Qwen, Z.ai etc are all either running bad scrapers via domestic proxies in Indonesia, or are buying data from companies that run those scrapers?

I want to know for sure. Who is paying for this activity? What does the marketplace for scraped data look like?


Replies

marginalia_nutoday at 7:28 PM

I agree it's a more than a bit handwavy. The common consensus seems to be that AI companies are driving this, but it's really hard to conclusively prove who or what is behind the attacks.

Weird part #1 is that the traffic isn't for the most part shaped like crawler traffic. It's incredibly bursty, and heavily redundant, missing even the most obvious low hanging fruit optimizations.

Could be someone is using residential proxies to wrap AI agents' web traffic, but even so, there's a lot of pieces that don't really make sense, like why the traffic pattern is like being hit by a shotgun. It isn't just one request, but anywhere between 40 and 100 redundant requests.

A popular theory is that this is because of sloppy coding, AI companies are too rich to care, but then again that doesn't really add up. This isn't just a minor inefficiency, if it is "just" bad coding, they stand to gain monumental efficiency improvements by fixing the issues, in the sense of getting the data much faster, a clear competitive edge.

Really weird.

My unsubstantiated guess is the residential proxy/botnet is very unreliable, and that's why they fire so many request. Makes sense if it's sold as a service.

show 2 replies
oasisbobtoday at 6:51 PM

I want more data too.

The root sources of the traffic from residential proxies gets murky very quickly.

It's easy to follow the chain partway for some traffic, eg "Why are we receiving all this traffic from Digital Ocean? ... oh, it's their hero client Firecrawl, using a deceptive UserAgent" ... but it still leaves the obvious question about who the Firecrawl client is.

Res proxy traffic is insane these days. There is also plenty of grey-market snowshoe IPs available for the right price, from a handful of ASNs. I regularly see unified crawling missions by unknown agents using 1000+ "clean" IP addresses an hour.

ghywertellingtoday at 6:12 PM

https://parallel.ai/

I bet lot of companies want to provide search results to AI agents.