I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
Blocking seems really popular. I wonder if it coincides with stack overflow closing.
> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries
I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.
I wonder if circumvention is legal. It's so odd. In the US it seems you can just do this whereas if I'd start something like this in the EU, I don't think I could.