logoalt Hacker News

snowhaletoday at 5:14 PM1 replyview on HN

The anti-bot stuff mentioned upthread is real, but at this scale per-domain politeness queuing also becomes a genuine headache. You end up needing to track crawl-delay directives per domain, rate-limit your outbound queues by host, and handle DNS TTL properly to avoid hammering a CDN edge that's mapping thousands of domains to the same IPs. Most crawlers that work fine at 100M pages break somewhere in that machinery at 1B+.


Replies

overfeedtoday at 8:40 PM

> this scale per-domain politeness queuing also becomes a genuine headache

Not really a headache - if you've ever implemented resource-based, server-side rate limiting (per-endpoint, with client-ID and/or IP buckets), that's all the logic that's required, adapted for the client side. One could wrap rate-limiting libraries designed for server-side usage and call it a day.

I hate how people who a bad at parallelizing their user-agents across the internet are causing needless pain and giving scrapers a bad name. They are also causing blowback on the more well-behaved scrapers.