I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can ...

bndr • today at 1:26 PM • 5 replies • view on HN

I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.

Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.

Replies

mettamage • today at 7:06 PM

I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.

➕ show 1 reply

mrweasel • today at 2:20 PM

Can't your users just whitelist your IPs?

➕ show 2 replies

0xdeadbeefbabe • today at 4:30 PM

Blocking seems really popular. I wonder if it coincides with stack overflow closing.

gilrain • today at 2:57 PM

> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries

I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.

➕ show 4 replies

spiderfarmer • today at 3:58 PM

Just stop scraping. I'll do everything to block you.

➕ show 2 replies

alt Hacker News

Replies