I "solved" this by adding a fail2ban rule for everyone accessing specific commits (no one does that 3 times in a row) and then blocking the following ASs completely (just too many IPs coming from those, feel free to look them up yourself): 136907 23724 9808 4808 37963 45102. And after that: sweet silence.
How to block ASs? Just write a small script that queries all of their subnets once (even if it changes, its not so much to have an impact) and add them to a nft set (nft will take care of aggregating these into continouus blocks). Then just make nft reject requests from this set.
Is there a way to block it by shibboleth? Curious, since the recent Google hack where you add -(n-word) to the end of your query so the AI automatically shuts down works like a charm.
This kind of thing can be mitigated by not publishing a page/download for every single branch, commit and diff in a repo.
Make only the HEAD of each branch available. Anyone who wants more detail has to clone it and view it with their favourite git client.
For example https://mitxela.com/projects/web-git-sum (https://git.mitxela.com/)
Does anyone know what's the deal with these scrapers, or why they're attributed to AI?
I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.
Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.
I have a self-hosted Gitea instance behind a Cloudflare Tunnel protected by CloudFlare Access. Zero issues. Obviously not "public", but it is accessible from the internet with a simple login.
Ugh, exposing it with cgit is why.
Put it all behind an OAuth login using something like Keycloak and integrate that into something like GitLab, Forgejo, Gitea if you must.
However. To host git, all you need is a user and ssh. You don’t need a web ui. You don’t need port 443 or 80.
So, what's up with these bots, why am I hearing about that so often lately? I mean, DDoS atacks aren't a new thing, and, honestly, this is pretty much the reason why Cloudflare even exists, but I'd expect OpenAI bots (or whatever this is now) to be a little bit easier to deal with, no? Like, simply having resonable aggressive fail2ban policy? Or do they really behave like a botnet, where each request comes from different IP from a different network? How? Why? What is this thing?
Oh poor soul :) I had the same problem. And I solved it easly. I pulled out stuff from Internet, keeping only VPN overlay network..
The future is dark I mean.. Darknets.. For people by people. Where you can deal with bad actors.. Wake up! and starting networking :)
Fail2ban has decent jails for Apache httpd. And writing a rule that matches requests to nonexistent resources is very easy -- one-liners + time based threshold. Basically you could ban differently according to the http errors they cause (e.g. bots on migrated resources: many 404 within a minute, Slowloris is visible as a lot of 408).
At this point, I think we should look at implementing filters that send different response when AI bots are detected or when the clients are abusive. Not just simple response code but one that poisons their training data. Preferably text that elaborates on the anti consumer practices of tech companies.
If there is a common text pool used across sites, may be that will get the attention of bot developers and automatically force them to backdown when they see such responses.
I've recently been setting up web servers like Forgejo and Mattermost to service my own and friends' needs. I ended up setting up Crowdsec to parse and analyse access logs from Traefik to block bad actors that way. So when someone produces a bunch of 4XX codes in a short timeframe I assume that IP is malicious and can be banned for a couple of hours. Seems to deter a lot of random scraping. Doesn't stop well behaved crawlers though which should only produce 200-codes.
I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt.
I don't understand. Your HTTPS server was being hammered so you stopped serving Git? That doesn't make any sense at all, if it's a private server, why not just turn off the web frontend?
Scrapers are relentless but not DDoS levels in my experience.
Make sure your caches are warm and responses take no more than 5ms to construct.
The Chinese AI scrapers/bots are killing quite a bit of the regular web now. YisouSpider absolutely pummeled my open source project's hosting for weeks. Like all Chinese AI scrapers, it ignores robots.txt. So forget about it respecting a Crawl-delay. If you block the user agent, it would calm down for a bit, then it would just come back again using a generic browser user agent from the same IP addresses. It does this across 10s of thousands of IPs.
I presume people have logs that indicate the source for them to place blame on AI scrapers. Is anybody making these available for analysis so we can see exactly who is doing this?
hello,
as always: imho. (!)
idk ... i just put a http basic-auth in front of my gitweb instance years ago.
if i really ever want to put git-repositories into the open web again i either push them to some portal - github, gitlab, ... - or start thinking about how to solve this ;))
just my 0.02€
Maybe put the git repos on radicle?
Some run git over ssh, and a domain login for https:// permission manager etc.
Also, spider traps and 42TB zip of death pages work well on poorly written scrapers that ignored robots.txt =3
I use a private gitlab that was setup by claude, have my own runners and everything. It's fine. I have my own little home cluster, net storage compute around 2.5k. Go NUCs, cluster, don't look back.
Can we not charge for access? If I have a link, that says "By clicking this link you agree to pay $10 for each access" then sending the bill?
Just another example of AI and its DoSaaS ruining things for everyone. The AI bros just won't accept "NO" for an answer.
[dead]
[flagged]
You could put it behind Cloudflare and block all AI.
Does this author have a big pre-established audience or something? Struggling to understand why this is front-page worthy.
The author of this post could solve their problem with Cloudflare or any of its numerous competitors.
Cloudflare will even do it for free.
I cut traffic to my Forgejo server from about 600K request per day to about 1000: https://honeypot.net/2025/12/22/i-read-yann-espositos-blog.h...
1. Anubis is a miracle.
2. Because most scrapers suck, I require all requests to include a shibboleth cookie, and if they don’t, I set it and use JavaScript to tell them to reload the page. Real browsers don’t bat an eye at this. Most scrapers can’t manage it. (This wasn’t my idea; I link to the inspiration for it. I just included my Caddy-specific instructions for implementing it.)