logoalt Hacker News

theamkyesterday at 11:56 PM2 repliesview on HN

no? it takes 10 seconds to check:

> The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed".

You don't need any scraping countermeasures for crawlers like those.


Replies

Machatoday at 1:10 AM

So what’s the user agent for their bot? They don’t seem to specify the default in the docs and it looks like it’s user configurable. So yet another opt out bot which you need your web server to match on special behaviour to block

show 2 replies
PeterStuertoday at 7:50 AM

Like they explain in the docs, their crawler will respect the robots.txt dissalowed user-agents, right after the section hat explains how to change your user-agent.