logoalt Hacker News

Crawling a billion web pages in just over 24 hours, in 2025

148 pointsby pseudolustoday at 3:54 AM50 commentsview on HN

Comments

bndrtoday at 1:26 PM

I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.

Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.

show 5 replies
throwaway77385today at 1:10 PM

> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth

Am I missing something here? Even Optane is an order of magnitude slower than RAM.

Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.

Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.

In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.

show 1 reply
finnlabtoday at 7:49 AM

Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.

show 2 replies
snowhaletoday at 5:14 PM

The anti-bot stuff mentioned upthread is real, but at this scale per-domain politeness queuing also becomes a genuine headache. You end up needing to track crawl-delay directives per domain, rate-limit your outbound queues by host, and handle DNS TTL properly to avoid hammering a CDN edge that's mapping thousands of domains to the same IPs. Most crawlers that work fine at 100M pages break somewhere in that machinery at 1B+.

dangoodmanUTtoday at 1:35 PM

> because redis began to hit 120 ops/sec and I’d read that any more would cause issues

Suspicious. I don’t think I’ve ever read anything that says redis taps out below tens of thousands of ops…

thefoundertoday at 12:44 PM

Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.

ph4rsikaltoday at 12:41 PM

When I read this, I realize how small Google makes the Internet.

sunpolicetoday at 2:02 PM

I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization. It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.

Thought about making it public but probably no one would use it.

show 1 reply
handfuloflighttoday at 12:51 PM

There was a time when being able to do this meant you were on the path to becoming a (m)(b)illionaire. Still is, I think.

corvtoday at 5:51 PM

Python is obviously too slow for web-scale

gethlytoday at 4:53 PM

> I also truncated page content to 250KB before passing it to the parser.

WTF did I just read?

show 1 reply
T3RMINATEDtoday at 6:27 PM

[dead]