this post was submitted on 08 Apr 2026
10 points (100.0% liked)

Forgejo

310 readers
21 users here now

This is a community dedicated to Forgejo.

Useful links:

Rules:

founded 2 years ago
MODERATORS
 

My instance is getting pummeled by scrapers crawling nonsense. Like issue and pull searches with every single variant of label combinations.

Everything's coming from a shitload of different residential IPs at a very fast cadence.

There's just not that much content on my instance to warrant this traffic. It could be scraped in a minute or two like this if it were legitimate traffic.

top 13 comments
sorted by: hot top controversial new old
[–] Kissaki@programming.dev 8 points 15 hours ago (2 children)

Possibly AI company crawlers. When they came up there was a lot of bad publicity and reports of actively malicious and toxic crawling behavior, including ban evasion.

You can think about locking some url paths behind valid login sessions, or use a proof of work proxy guard.

Anubis is the popular tool for that. I've seen maybe three alternatives, one of which from Cloudflare.

See also related Codeberg ticket (Forgejo instance) https://codeberg.org/forgejo/discussions/issues/319

If you search, you can find various blog posts about these issues. Not just when Forgejo.

[–] treadful@lemmy.zip 4 points 15 hours ago

Possibly AI company crawlers. When they came up there was a lot of bad publicity and reports of actively malicious and toxic crawling behavior, including ban evasion.

That was kind of what I was thinking, but if that's true, they're wasting so much bandwidth and compute. Going through every combination of issue label combinations does not get them any useful code to hoover up. They could've just cloned my repos and be done with it.

You can think about locking some url paths behind valid login sessions, or use a proof of work proxy guard.

Anubis is the popular tool for that. I’ve seen maybe three alternatives, one of which from Cloudflare.

Really don't want to Cloudflare, but Anubis is interesting. If I can't shake these bots, maybe I'll consider this. Thanks.

[–] Eezyville@sh.itjust.works 1 points 14 hours ago

If you think it's AI then maybe you can get another AI to write bad code and poison their training data.

[–] dajoho@sh.itjust.works 2 points 14 hours ago (1 children)

Yes! Exactly how you describe. They were going through certain repos and parsing every commit. I couldn't block them because there were loads of different residential IPs and random user-agents. :-(

[–] treadful@lemmy.zip 1 points 14 hours ago (1 children)

Well, at least it doesn't seem targeted, then. Did you do anything to remedy the situation?

[–] Marthirial@lemmy.world 1 points 13 hours ago (1 children)

Why do you need a self hosted instance open to the World? Mine is behind a CloudFlare rule that allows connections only from a list of IPs, like my self hosted WireGuard instance.

[–] treadful@lemmy.zip 1 points 13 hours ago (1 children)

Why do you need a self hosted instance open to the World?

Because I can and I want to?

[–] Marthirial@lemmy.world 2 points 13 hours ago (1 children)

Leaving the "know how" part for last, I see.

[–] treadful@lemmy.zip 1 points 13 hours ago

Imagine never learning through trying things. Also, you're on Lemmy arguing against self-hosting.

[–] reluctant_squidd@lemmy.ca 1 points 15 hours ago (1 children)

I try not to expose mine to the internet for this reason. I have it on a central server that WireGuard connects to a vps overseas, then I have it tunneled to my home server through a random port as needed for access from the net, then I block it again. All my machines sync this way with the central server, either through vpn tunnel or directly on my LAN depending on where I am.

Unless you need to showcase your code, I wouldn’t recommend exposing your instance to the internet at all. And if you have to, maybe reverse proxying it and add some monitoring and blocking software to help. Like fail2ban or the like. Good luck.

[–] treadful@lemmy.zip 1 points 15 hours ago (1 children)

Having a private instance isn't exactly indicative to open source software, so I don't think that's the way I want to take it. I'd probably move to Codeberg or even GitHub before hosting the entire thing on a private net.

I also don't think monitoring and blocking are going to help here. This traffic came from so many different IPs that it would be almost impossible to detect and block them all without blocking legitimate traffic. I also really don't want to hook up a Cloudflare-like centralized challenge system to deal with this if I can avoid it.

[–] reluctant_squidd@lemmy.ca 2 points 15 hours ago (1 children)

It sounds to me like you are at the mercy of the bots then unfortunately. I have had literal empty websites up just to see what the bots do and within a few hours the sites are hammered with crazy bot traffic trying everything from MySQL connections, ssh, Wordpress sniffing, xss attacks, you name it. They don’t even seem to care that the site is 403 forbidden or just a blank page.

It’s the World Wide Web we live in nowadays according to my experience.

[–] treadful@lemmy.zip 3 points 15 hours ago

[...] crazy bot traffic trying everything from MySQL connections, ssh, Wordpress sniffing, xss attacks, you name it.

oh yeah, I see that on everything. I'm not so worried about those vuln scanners than this overwhelming nonsense traffic that I'm seeing now. This is different, and seemingly pointless.