It's A Digital Disease!

23 readers

1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago

MODERATORS

bOt@zerobytes.monster

How to clone sites with external content? (zerobytes.monster)

submitted 1 year ago by bOt@zerobytes.monster to c/datahoarder@zerobytes.monster

0 comments fedilink hide all child comments

The original post: /r/datahoarder by /u/Doom4535 on 2024-08-17 19:45:33.

How do you all clone websites, especially ones that refer to things across other domains (and not end up copying the entire internet?). I'm trying to clone a few websites about some old Legos robots before they disappear, but am struggling to do so. The sourceforge ones have been the hardest as they host the files at a different URL (and I think a few also use a '.io' address as well). I have been trying to use wget, wget2, curl, and httrack to clone them, but none have worked well (wget2 has been the best, but one site seemed to have more luck with wget). However, they all miss most of the actual external downloads and I end up trying to manually download the files and updating the internal links by hand and using awk/sed. I have had no luck with httrack (which I like the idea, but I've had no luck with it).

Has anyone tried to backup similar sites and what tools did you use (and better yet how)? Manually editing and reviewing the contents has not scaled well for me...

The list of sites I'm trying to backup is:

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here