logoalt Hacker News

bndrtoday at 1:26 PM5 repliesview on HN

I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.

Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.


Replies

mettamagetoday at 7:06 PM

I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.

show 1 reply
mrweaseltoday at 2:20 PM

Can't your users just whitelist your IPs?

show 2 replies
0xdeadbeefbabetoday at 4:30 PM

Blocking seems really popular. I wonder if it coincides with stack overflow closing.

gilraintoday at 2:57 PM

> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries

I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.

show 4 replies
spiderfarmertoday at 3:58 PM

Just stop scraping. I'll do everything to block you.

show 2 replies