logoalt Hacker News

hsbauauvhabzbtoday at 6:57 AM1 replyview on HN

Does mass scraping need google for content discovery? Surely most sites contain a site map or index that would effectively self enumerate once you know the domain, which is more often than not publicly disclosed?


Replies

rvztoday at 11:27 AM

What matters is when websites put this new version of reCAPTCHA on their site, just like archive.is has done. Then the scrapers will have a hard time getting around that.