End of an era for me: no more self-hosted git

225 points • by dzulp0d • today at 1:50 AM • 150 comments • view on HN

Comments

I cut traffic to my Forgejo server from about 600K request per day to about 1000: https://honeypot.net/2025/12/22/i-read-yann-espositos-blog.h...

1. Anubis is a miracle.

2. Because most scrapers suck, I require all requests to include a shibboleth cookie, and if they don’t, I set it and use JavaScript to tell them to reload the page. Real browsers don’t bat an eye at this. Most scrapers can’t manage it. (This wasn’t my idea; I link to the inspiration for it. I just included my Caddy-specific instructions for implementing it.)

➕ show 2 replies

mqus • today at 10:31 PM

I "solved" this by adding a fail2ban rule for everyone accessing specific commits (no one does that 3 times in a row) and then blocking the following ASs completely (just too many IPs coming from those, feel free to look them up yourself): 136907 23724 9808 4808 37963 45102. And after that: sweet silence.

How to block ASs? Just write a small script that queries all of their subnets once (even if it changes, its not so much to have an impact) and add them to a nft set (nft will take care of aggregating these into continouus blocks). Then just make nft reject requests from this set.

kristjank • today at 2:57 PM

Is there a way to block it by shibboleth? Curious, since the recent Google hack where you add -(n-word) to the end of your query so the AI automatically shuts down works like a charm.

➕ show 1 reply

moebrowne • today at 3:15 PM

This kind of thing can be mitigated by not publishing a page/download for every single branch, commit and diff in a repo.

Make only the HEAD of each branch available. Anyone who wants more detail has to clone it and view it with their favourite git client.

For example https://mitxela.com/projects/web-git-sum (https://git.mitxela.com/)

➕ show 2 replies

data-ottawa • today at 3:28 AM

Does anyone know what's the deal with these scrapers, or why they're attributed to AI?

I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.

Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.

➕ show 12 replies

DarmokJalad1701 • today at 8:47 PM

I have a self-hosted Gitea instance behind a Cloudflare Tunnel protected by CloudFlare Access. Zero issues. Obviously not "public", but it is accessible from the internet with a simple login.

reactordev • today at 6:46 PM

Ugh, exposing it with cgit is why.

Put it all behind an OAuth login using something like Keycloak and integrate that into something like GitLab, Forgejo, Gitea if you must.

However. To host git, all you need is a user and ssh. You don’t need a web ui. You don’t need port 443 or 80.

➕ show 1 reply

krick • today at 4:39 AM

So, what's up with these bots, why am I hearing about that so often lately? I mean, DDoS atacks aren't a new thing, and, honestly, this is pretty much the reason why Cloudflare even exists, but I'd expect OpenAI bots (or whatever this is now) to be a little bit easier to deal with, no? Like, simply having resonable aggressive fail2ban policy? Or do they really behave like a botnet, where each request comes from different IP from a different network? How? Why? What is this thing?

➕ show 3 replies

Borg3 • today at 7:10 PM

Oh poor soul :) I had the same problem. And I solved it easly. I pulled out stuff from Internet, keeping only VPN overlay network..

The future is dark I mean.. Darknets.. For people by people. Where you can deal with bad actors.. Wake up! and starting networking :)

GuestFAUniverse • today at 8:03 PM

Fail2ban has decent jails for Apache httpd. And writing a rule that matches requests to nonexistent resources is very easy -- one-liners + time based threshold. Basically you could ban differently according to the http errors they cause (e.g. bots on migrated resources: many 404 within a minute, Slowloris is visible as a lot of 408).

➕ show 1 reply

devsda • today at 3:57 AM

At this point, I think we should look at implementing filters that send different response when AI bots are detected or when the clients are abusive. Not just simple response code but one that poisons their training data. Preferably text that elaborates on the anti consumer practices of tech companies.

If there is a common text pool used across sites, may be that will get the attention of bot developers and automatically force them to backdown when they see such responses.

➕ show 2 replies

snorremd • today at 3:06 PM

I've recently been setting up web servers like Forgejo and Mattermost to service my own and friends' needs. I ended up setting up Crowdsec to parse and analyse access logs from Traefik to block bad actors that way. So when someone produces a bunch of 4XX codes in a short timeframe I assume that IP is malicious and can be banned for a couple of hours. Seems to deter a lot of random scraping. Doesn't stop well behaved crawlers though which should only produce 200-codes.

I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt.

➕ show 2 replies

thesuitonym • today at 8:09 PM

I don't understand. Your HTTPS server was being hammered so you stopped serving Git? That doesn't make any sense at all, if it's a private server, why not just turn off the web frontend?

➕ show 1 reply

vachina • today at 4:35 AM

Scrapers are relentless but not DDoS levels in my experience.

Make sure your caches are warm and responses take no more than 5ms to construct.

➕ show 2 replies

JohnTHaller • today at 3:58 AM

The Chinese AI scrapers/bots are killing quite a bit of the regular web now. YisouSpider absolutely pummeled my open source project's hosting for weeks. Like all Chinese AI scrapers, it ignores robots.txt. So forget about it respecting a Crawl-delay. If you block the user agent, it would calm down for a bit, then it would just come back again using a generic browser user agent from the same IP addresses. It does this across 10s of thousands of IPs.

➕ show 2 replies

Lerc • today at 3:29 AM

I presume people have logs that indicate the source for them to place blame on AI scrapers. Is anybody making these available for analysis so we can see exactly who is doing this?

➕ show 2 replies

t312227 • today at 3:01 PM

hello,

as always: imho. (!)

idk ... i just put a http basic-auth in front of my gitweb instance years ago.

if i really ever want to put git-repositories into the open web again i either push them to some portal - github, gitlab, ... - or start thinking about how to solve this ;))

just my 0.02€

➕ show 1 reply

ptman • today at 8:44 AM

Maybe put the git repos on radicle?

Joel_Mckay • today at 4:04 AM

Some run git over ssh, and a domain login for https:// permission manager etc.

Also, spider traps and 42TB zip of death pages work well on poorly written scrapers that ignored robots.txt =3

anarticle • today at 4:00 PM

I use a private gitlab that was setup by claude, have my own runners and everything. It's fine. I have my own little home cluster, net storage compute around 2.5k. Go NUCs, cluster, don't look back.

hattmall • today at 3:55 AM

Can we not charge for access? If I have a link, that says "By clicking this link you agree to pay $10 for each access" then sending the bill?

➕ show 2 replies

bigbuppo • today at 5:46 PM

Just another example of AI and its DoSaaS ruining things for everyone. The AI bros just won't accept "NO" for an answer.

sdf2erf • today at 3:22 AM

[dead]

Amol-917 • today at 3:26 PM

[flagged]

oceanplexian • today at 3:38 AM

[flagged]

➕ show 4 replies

october8140 • today at 4:03 AM

You could put it behind Cloudflare and block all AI.

CuriouslyC • today at 3:16 AM

Does this author have a big pre-established audience or something? Struggling to understand why this is front-page worthy.

➕ show 4 replies

Jaxkr • today at 3:11 AM

The author of this post could solve their problem with Cloudflare or any of its numerous competitors.

Cloudflare will even do it for free.

➕ show 9 replies

alt Hacker News

End of an era for me: no more self-hosted git

Comments