this post was submitted on 17 Aug 2025

667 points (99.7% liked)

Technology

593 readers

525 users here now

Share interesting Technology news and links.

Rules:

No paywalled sites at all.
News articles has to be recent, not older than 2 weeks (14 days).
No external video links, only native(.mp4,...etc) links under 5 mins.
Post only direct links.

To encourage more original sources and keep this space commercial free as much as I could, the following websites are Blacklisted:

Al Jazeera;
NBC;
CNBC;
Substack;
Tom's Hardware;
ZDNet;
TechSpot;
Ars Technica;
Vox Media outlets, with exception for Axios;
Engadget;
TechCrunch;
Gizmodo;
Futurism;
PCWorld;
ComputerWorld;
Mashable;
Hackaday;
WCCFTECH;
Neowin.

More sites will be added to the blacklist as needed.

Encouraged:

Archive links in the body of the post.
Linking to the direct source, instead of linking to an article talking about the source.

Misc:

Relevant Communities:

Beehaw Technology- Technology Related Discussions.
lemmy.zip Technology- Hard Tech news.

founded 4 months ago

MODERATORS

Pro@programming.dev

667

Codeberg: army of AI crawlers are extremely slowing us; AI crawlers learned how to solve the Anubis challenges. (infosec.pub)

submitted 1 month ago* (last edited 1 month ago) by Pro@programming.dev to c/Technology@programming.dev

113 comments fedilink hide all child comments

Comments

Lemmy;
Hackernews.

Source.

top 50 comments

sorted by: hot top controversial new old

[–] Probius@sopuli.xyz 209 points 1 month ago (3 children)

This type of large-scale crawling should be considered a DDoS and the people behind it should be charged with cyber crimes and sent to prison.

[–] FauxLiving@lemmy.world 72 points 1 month ago (1 children)

If it’s disrupting their site, it is a crime already. The problem is finding the people behind it. This won’t be some guy on his dorm PC and they’ll likely be in places interpol can’t reach.

[–] finitebanjo@lemmy.world 22 points 1 month ago

Huawei

[–] isolatedscotch@discuss.tchncs.de 28 points 1 month ago (1 children)

good luck with that! not only is a company doing it, which means no individual person will go to prison, but it's from a chinese company with no regard for any laws that might get passed

[–] humanspiral@lemmy.ca 12 points 1 month ago

The people determining US legislation have said, "how can we achieve skynet if our tech trillionaire company sponsors can't evade copyright or content licensing?" But they also say if "we don't spend every penny you have on achieving US controlled Skynet, then China wins."

Speculating on "Huawei network can solve this", doesn't mean that all the bots are Chinese, but does confirm that China has a lot of AI research, and Huawei GPUs/NPUs are getting used, and successfully solving this particular "I am not a robot challenge".

It's really hard to call "amateur coding challenge" competition web site a national security threat, but if you hype Huawei enough, then surely the US will give up on AI like it gave up on solar, and maybe EVs. "If we don't adopt Luddite politics and all become Amish, then China wins" is a "promising" new loser perspective on media manipulation.

[–] eah@programming.dev 18 points 1 month ago (2 children)

Applying the Computer Fraud and Abuse Act to corporations? Sign me up! Hey, they're also people, aren't they?

load more comments (2 replies)

[–] Gullible@sh.itjust.works 96 points 1 month ago (2 children)

I really feel like scrapers should have been outlawed or actioned at some point.

[–] floofloof@lemmy.ca 74 points 1 month ago (1 children)

But they bring profits to tech billionaires. No action will be taken.

[–] BodilessGaze@sh.itjust.works 11 points 1 month ago (4 children)

No, the reason no action will be taken is because Huawei is a Chinese company. I work for a major US company that's dealing with the same problem, and the problematic scrapers are usually from China. US companies like OpenAI rarely cause serious problems because they know we can sue them if they do. There's nothing we can do legally about Chinese scrapers.

load more comments (4 replies)

[–] programmer_belch@lemmy.dbzer0.com 37 points 1 month ago (11 children)

I use a tool that downloads a website to check for new chapters of series every day, then creates an RSS feed with the contents. Would this be considered a harmful scraper?

The problem with AI scrapers and bots is their scale, thousands of requests to webpages that the internal server cannot handle, resulting in slow traffic.

[–] S7rauss@discuss.tchncs.de 29 points 1 month ago (2 children)

Does your tool respect the site’s robots.txt?

[–] who@feddit.org 18 points 1 month ago* (last edited 1 month ago) (3 children)

Unfortunately, robots.txt cannot express rate limits, so it would be an overly blunt instrument for things like GP describes. HTTP 429 would be a better fit.

load more comments (3 replies)

load more comments (1 replies)

load more comments (10 replies)

[–] gressen@lemmy.zip 74 points 1 month ago (3 children)

Write TOS that state that crawlers automatically accept a service fee and then send invoices to every crawler owner.

[–] BodilessGaze@sh.itjust.works 40 points 1 month ago (6 children)

Huawei is Chinese. There's literally zero chance a European company like Codeberg is going to successfully collect from a company in China over a TOS violation.

[–] wischi@programming.dev 14 points 1 month ago

It's not even a company. It's a non-profit "eingetragener Verein". They have very limited resources, especially money because they purely live on membership fees and donations.

load more comments (5 replies)

[–] wischi@programming.dev 33 points 1 month ago (2 children)

They typically don't include a billing address in the User Agent when crawling 🤣

load more comments (2 replies)

[–] Kissaki@feddit.org 9 points 1 month ago

Cloudflare had a similar idea: Introducing pay per crawl: Enabling content owners to charge AI crawlers for access

[–] cecilkorik@lemmy.ca 59 points 1 month ago

Begun, the information wars have.

[–] folken@lemmy.world 38 points 1 month ago* (last edited 1 month ago) (2 children)

When you realize that you live in a cyberpunk novel. The AI is cracking the ICE. https://cyberpunk.fandom.com/wiki/Black_ICE

[–] Regrettable_incident@lemmy.world 15 points 1 month ago (1 children)

I love seeing how much influence William Gibson had on cyberpunk.

[–] ThePyroPython@lemmy.world 16 points 1 month ago

It's not intentional but the chap ended up writing works that defined both the Cyberpunk (Neuromancer) and Steampunk (The Difference Engine) genres.

Can't deny that influence.

load more comments (1 replies)

[–] cadekat@pawb.social 35 points 1 month ago

Huh, why does Anubis use SHA256? It's been optimized to all hell and back.

Ah, they're looking into it: https://github.com/TecharoHQ/anubis/issues/94

[–] 0_o7@lemmy.dbzer0.com 33 points 1 month ago (3 children)

I blocked almost all big players in hosting, China, Ruasia, Vietnam and now they're now bombarding my site with residential IP address from all over the world. They must be using compromised smart home devices or phones with malware.

Soon everything on the internet will be behind a wall.

[–] irelephant@programming.dev 10 points 1 month ago (1 children)

This isn't sustainable for the ai companies, when the bubble pops it will stop.

[–] aev_software@programming.dev 18 points 1 month ago (1 children)

In the mean time, sites are getting DDOS-ed by scrapers. One way to stop your site from getting scraped is having it be inaccessible... which is what the scalpers are causing.

Normally I would assume DDOS-ing is performed in order to take a site offline. But ai-scalpers require the opposite. They need their targets online and willing. One would think they'd be a bit more careful about the damage they cause.

But they aren't, because capitalism.

load more comments (1 replies)

load more comments (2 replies)

[–] chicken@lemmy.dbzer0.com 30 points 1 month ago (1 children)

Seems like such a massive waste of bandwidth since it's the same work being repeated by many different actors to piece together the same dataset bit by bit.

[–] chuckleslord@lemmy.world 38 points 1 month ago

Ah Capitalism! Truly the king of efficiency /s

[–] sp3ctr4l@lemmy.dbzer0.com 26 points 1 month ago (6 children)

Do we all want the fucking Blackwall from Cyberpunk 2077?

Fucking NetWatch?

Because this is how we end up with them.

....excuse me, I need to go buy a digital pack of cigarettes for the angry voice in my head.

load more comments (6 replies)

[–] Blackmist@feddit.uk 23 points 1 month ago (6 children)

Business idea: AWS, but hosted entirely within the computing power of AI web crawlers.

[–] Kissaki@feddit.org 9 points 1 month ago (5 children)

Reminds me of the "store data inside slow network requests for the in-transit duration". It was a fun article to read.

load more comments (5 replies)

[–] ryanvade@lemmy.world 22 points 1 month ago

It's being investigated at least, hopefully a solution can be found. This will probably end up in a constantly escalating battle with the AI companies. https://github.com/TecharoHQ/anubis/issues/978

[–] LiveLM@lemmy.zip 22 points 1 month ago

Uuughhh I knew it'd always be a mouse and cat game, sincerely hope the Anubis devs figure out how to fuck up the AI crawlers again

[–] tal@lemmy.today 19 points 1 month ago (1 children)

If someone just wants to download code from Codeberg for training, it seems like it'd be way more efficient to just clone the git repositories or even just download tarballs of the most-recent releases for software hosted on Codeberg than to even touch the Web UI at all.

I mean, maybe you need the Web UI to get a list of git repos, but I'd think that that'd be about it.

[–] witten@lemmy.world 26 points 1 month ago (1 children)

Then they'd have to bother understanding the content and downloading it as appropriate. And you'd think if anyone could understand and parse websites in realtime to make download decisions, it be giant AI companies. But ironically they're only interested in hoovering up everything as plain web pages to feed into their raw training data.

[–] Natanael 16 points 1 month ago

The same morons scrape Wikipedia instead of downloading the archive files which trivially can be rendered as web pages locally

[–] Kolanaki@pawb.social 16 points 1 month ago* (last edited 1 month ago) (2 children)

I dont understand how challenging an AI by asking it to do some heavy computational stuff even makes sense... A computer is literally made to do computations, and AI is just a computer. 🤨

Wouldn't it make more sense to challenge the AI with a Voight-Kampff test? Ask it about baseball.

[–] purplemonkeymad@programming.dev 39 points 1 month ago

The scrapers are not actually an ai, they are just dumb scrapers there to get as much textual information as possible.

If they have to do Anubis tests, that is going to take more time to get the data they scrape. I suspect that they are probably paid per page they provide, so more time per page is less money for them.

[–] BodilessGaze@sh.itjust.works 27 points 1 month ago

The point is to make scraping expensive enough it isn't worth the trouble. The only reason AI scrapers are trying to get this data is because it's cheaper than the alternatives (e.g. generating synthetic data). Once it stops being cheaper, the smart scrapers will stop. The dumb scrapers don't matter because they don't have the talent to devise these kind of workarounds.

[–] MonkderVierte@lemmy.zip 15 points 1 month ago* (last edited 1 month ago) (12 children)

I just thought that having a client side proof-of-work (or even only a delay) bound to the IP might deter the AI companies to choose to behave instead (because single-visit-per-IP crawlers get too expensive/slow and you can just block normal abusive crawlers). But they already have mind-blowing computing and money ressources and only want your data.

But if there was a simple-to-use integrated solution and every single webpage used this approach?

[–] witten@lemmy.world 11 points 1 month ago

Believe me, these AI corporations have way too many IPs to make this feasible. I've tried per-IP rate limiting. It doesn't work on these crawlers.

load more comments (11 replies)

[–] rozodru@lemmy.world 15 points 1 month ago (4 children)

I run my own gitea instance on my own server and within the past week or so I've noticed it just getting absolutely nailed. One repo in particular, a Wayland WM I built. Just keeps getting hammered over and over by IPs in China.

[–] ZILtoid1991@lemmy.world 10 points 1 month ago

Just keeps getting hammered over and over by IPs in China.

Simple solution: Block Chinese IPs!

load more comments (3 replies)

[–] metacolon@lemmy.blahaj.zone 11 points 1 month ago (1 children)

Are those blocklists publicly available somewhere?

[–] Taldan@lemmy.world 11 points 1 month ago (1 children)

I would hope not. Kinda pointless if they become public

[–] daniskarma@lemmy.dbzer0.com 28 points 1 month ago (5 children)

On the contrary. Open community based block lists can be very effective. Everyone can contribute to them and asphyxiate people with malicious intents.

If you think something like, "if the blocklist is available then malicious agents simply won't use that ips" I don't think if that makes a lot of sense. As the malicious agent will know any of their IPs being blocked as soon as they use them.

load more comments (5 replies)

load more comments