I just threw up a public Forjego instance for some lightweight collaboration. About 2 minutes after the certificate was created, I'm guessing they picked up the instance from the transparency logs for certificates, and started going through every commit and so on from the two repositories I had added.
Watched it for a while, thinking eventually it'd end. It didn't, seemed like Claudebot and GPTBot (which was the only two I saw, but could have been forged) went over the same URLs over and over again. They tried a bunch of search queries too at the same time.
The day after I got tired of seeing it so added a robot.txt forbidding any indexing. Waited a few hours, saw that they were still doing the same thing, so threw up basic authentication with `wiki:wiki` as the username:password basically, wrote the credentials on the page where I linked it and as expected they stopped trying after that.
They don't seem to try to bypass anything, whatever you put in front will basically defeat them except blocking them by user-agent, then they just switch to a browser-like user-agent instead, which is why I went the "trivial basic authentication" path instead.
Wasn't really an issue, just annoying when they try to masquerade as normal users. Had the same issue with a wiki instance, added rate limits and eventually they seemingly backed off more than my limits were set too, so I guess they eventually got it. Just checked the logs and seems they've stopped trying completely.
Seemingly it seems like people who are paying for their hosting by usage (which never made sense to me) is the ones hard hit by this. I'm hosting my stuff on a VPS, and don't understand what the big issue is, worst case scenario I'd add more aggressive caching and it wouldn't be an issue anymore.
Since you had the logs for this, can you confirm the IP ranges they were operating from? You mention "Claudebot and GPTBot" but I'm guessing this is based off of the user-agent presented by the scrapers and could easily be faked to shift blame. I genuinely doubt Anthropic and such would be running scrapers that are this badly written/implemented, it doesnt make economic sense. I'd love to see some of the web logs from this if you'd be willing to share! I feel like this is just some of the old scraper bots now advertising themselves as AI bots to shift blame into the AI companies.
Huh, I had a gitea instance in the public web on one of my netcup vps's. I didn't set any logs and was using cloudflare tunnels (with a custom bash script which makes cf tunnels expose PORT SUBDOMAIN).
Maybe its time for me to go ahead and start it again with logs to see if there are any logs.
I will maybe test it in all three 1) With CF tunnels + AI Block, 2) Only CF tunnels, 3) On a static IP directly. Maybe you can try the experiment too and we can compare our findings (also saying because I am lazy and I had misconfigured that cf tunnel so when it quit, I was too lazy to restart the vps given I just use it as a playground and just wanted to play around self hosting but maybe I will do it again now)
I had the same issue when I first put up my gitea instance. The bots found the domain through cert registration in minutes, before there were any backlinks. GPTbot, ClaudeBot, PerplexityBot, and others.
I added a robots.txt with explicit UAs for known scrapers (they seem to ignore wildcards), and after a few days the traffic died down completely and I've had no problem since.
Git frontends are basically a tarpit so are uniquely vulnerable to this, but I wonder if these folks actually tried a good robots.txt? I know it's wrong that they ignore wildcards, but it does seem to solve the issue