logoalt Hacker News

snorremdyesterday at 3:06 PM2 repliesview on HN

I've recently been setting up web servers like Forgejo and Mattermost to service my own and friends' needs. I ended up setting up Crowdsec to parse and analyse access logs from Traefik to block bad actors that way. So when someone produces a bunch of 4XX codes in a short timeframe I assume that IP is malicious and can be banned for a couple of hours. Seems to deter a lot of random scraping. Doesn't stop well behaved crawlers though which should only produce 200-codes.

I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt.


Replies

lowdudeyesterday at 5:04 PM

There was a comment in a different thread that suggested they may respect the robots.txt for the most part, but may ignore wildcards: https://news.ycombinator.com/item?id=46975726

Maybe this is worth trying out first, if you are currently having issues.

V__yesterday at 3:12 PM

If possible block I would block by country first. Even on public websites I block Russia/China by default and that reduced port scans etc.

On "private" services where I or my friends are the only users, I block everything except my country.