It's A Digital Disease!

23 readers

1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago

MODERATORS

bOt@zerobytes.monster

What do you use for website archiving? (zerobytes.monster)

submitted 9 months ago by bOt@zerobytes.monster to c/datahoarder@zerobytes.monster

0 comments fedilink hide all child comments

The original post: /r/datahoarder by /u/Melodic-Network4374 on 2025-06-19 18:12:59.

Yeah, I know about the wiki, it has links to a bunch of stuff but I'm interested in hearing your workflow.

I have in the past used wget to mirror sites, which is fine for just getting the files. But ideally I'd like something that can make WARCs, singlefile dumps from headless chrome and the like. My dream would be something that can handle (mostly) everything, including website-specific handlers like yt-dlp. Just a web interface where I can put in a link, set whether to do recursive grabbing and if it can follow outside links.

I was looking at ArchiveBox yesterday and was quite excited about it. I set it up and it's soooo close to what I want but there is no way to do recursive mirroring (wget -m style). So I can't really grab a whole site with it, which really limits its usefulness to me.

So, yeah. What's your workflow and do you have any tools to recommend that would check these boxes?

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here