It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
126
 
 
The original post: /r/datahoarder by /u/wobblydee on 2025-07-29 23:52:55+00:00.

Windows 11 mini pc

Ran wget with this entered

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://example.com/

Thats what i found online somewhere to use

The website i saved is speedhunters.com an EA owned car magazine site thats going away

It seems to completely work but only a handful of images are present on the webpages with >95% articles missing the photos.

Due to the way wget did its files theyre all firefox html files for each page so i cant look to see if i have a folder of the images somewhere that i can find yet.

Did i mess up the command prompt or is it based on website construction?

I initially tried with httack on my gaming computer but after 8 hours i decided to get a mini pc locally for 20 bucks instead to run it and save power and thats when i went to wget. But i noticed httrack was saving photos but i couldnt click website links to other pages though i may just need to let it run its course.

Is there something to fix in wget while i let httrack run its course too

edit comment reply on potential fix in case it gets deleted

You need to span hosts, just had this recently.

/u/wobblydee check the image domain and put it in the allowed domains list along with the main domain.

Edit to add, now that i'm back at computer - the command should be something like this, -H is span hosts, and then the domain list keeps it from grabbing the entire internet - img.example.com should be whatever domain the images are from:

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent -H --domains=img.example.com,example.com,www.example.com http://example.com/

yes you want example.com and www.example.com both probably.

oh edit 2 - didn't see you gave the real site - so the full command is:

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent -H --domains=s3.amazonaws.com,speedhunters.com,www.speedhunters.com www.speedhunters.com

127
 
 
The original post: /r/datahoarder by /u/KryptoLouie on 2025-07-29 23:11:59+00:00.

A word of warning on using this service. Data can be silently dropped with GDrive.

About a year ago, I upload files to my paid Google drive. All seems fine, but I started noticing not all files are accounted for, (96 files in the folder when I uploaded 100). No errors. No warnings. No retries. I have since stopped using the mobile app as a reliable way to upload files and only used the service as a way to share files when needed.

Fast forward to today, I wanted to download a few folders to my computer. Selected 5 folders on my Gdrive and clicked download. Upon unzipping the folder, only 3 folders showed in the zip file. Again no errors. No warnings no retries nor any indication of something went wrong. WTF.

Unreliable garbage.

128
 
 
The original post: /r/datahoarder by /u/One-Poet7900 on 2025-07-29 19:49:49+00:00.

Yahoo recently drastically cut their email storage from 1tb to 20gb. I am far beyond the limits. What I would like to do is:

  1. Periodically archive all emails offline
  2. Periodically delete emails over a certain age from the server
  3. Have a browser based app to search & view my email archive
  4. Synchronize the email archive to some kind of other cloud based storage (e.g. Backblaze) for backup purposes

Ideally, I'd like this all to be run on my Linux server, using components deployed in Docker. I do not want to host a full fledged email server, if possible.

I've put the below together with the help of ChatGPT. I really dislike the need to host a mail server. However, netviel looks dead and doesn't have an official Docker container. What do you think of this setup? Has anyone attempted something similar?

| Component | Purpose | Tooling Options | |


|


|


| | 1. IMAP→Local Archive | One‑way sync from Yahoo IMAP into a local Maildir, preserving flags & folder structure. | imapsync | | 2. Off‑site Backup | Mirror the local Maildir to cloud storage (e.g. Backblaze B2) for redundancy. | rclone | | 3. Simple IMAP Server (optional) | Expose your archive as a single‑user IMAP endpoint for desktop mail clients (e.g. Thunderbird). | Dovecot - Configure to point at the mounted Maildir. | | 4. Webmail UI (IMAP‑client) | Full‑featured, browser‑based IMAP client to read/search your archive without desktop software. | Roundcube | | 5. Lightweight Web Viewer | Single‑user search UI directly over Maildir (no IMAP server required). | netviel or notmuch‑web |

129
 
 
The original post: /r/datahoarder by /u/treezoob on 2025-07-29 17:17:03+00:00.

I have an ancient DS414 that works. I also have an Optiplex 7060. I would like to connect the DS414 to the optiplex so that the newer system can manage services and function as a nas. I would like to avoid running anything through the intel atom cpu on the DS414. My ideal solution would be connecting the DS414's backplane directly to the optiplex, but it appears to be using a PCIE connector for both data and power.

I like having a nice clean disk enclosure as the optiplex doesn't have as much HDD space as I would like it to have.

Is this doable? If it is, is it a stupid thing to do? All advice is very much appreciated

130
 
 
The original post: /r/datahoarder by /u/andreas0069 on 2025-07-29 15:11:06+00:00.

Hey fellow DataHoarders,

As someone with over 0.5 PB of deployed storage, I’m always hunting for better disk deals—and I wasn’t satisfied with the tools out there. That’s why I built a lightweight tool to track SSD and HDD prices and highlight good deals.

I'd really appreciate your thoughts before I polish it up further:

  • What parts feel smooth or helpful so far?

  • Anything feels confusing or awkward?

  • What filters or features would you add?

I’m the sole developer behind this side project, so I’ve tried to keep it simple and user-focused—but I’d love to know what would make it genuinely useful for you. You can check it out below, but more than anything I’d welcome feedback—on Reddit or via the email on the contact page.

The data constantly gets updated, so right now there might not be all disks out there, but daily fetch jobs across many amazon and ebay regions is running ATM.

Thanks in advance!

HG Software

https://hgsoftware.dk/diskdeal

131
 
 
The original post: /r/datahoarder by /u/Reasonable_Sport_754 on 2025-07-29 14:59:41+00:00.

I've been thinking about this, and I wanted to hear your thoughts on pros, cons, use-cases, anything you feel is relevant, etc.

I found this repo: https://github.com/ambv/bitrot . Its single feature is to recursively hash every file in a directory tree and store the hashes in a SQLite DB. If both the mtime and the file have changed, update the hash, otherwise alert the user that the file has changed (bit rot or other problems). It got me thinking: what does Snapraid bring to the table that this doesn't?

AFAIK, Snapraid can recreate a failed drive from the parity information, which a DIY method couldn't (without recreating Snapraid, at which point, just use Snapraid).

But, Snapraid requires a dedicated parity drive, thus using a drive you could fill with more data (of course the hash DB would take up space too). Also, you could backup the hash DB from a DIY method.

Going DIY would mean if a file does bit rot, you would have to go to a backup to get a non-corrupt copy.

The repo I linked hasn't been updated in 2 years, and SHA1 may be overkill (wouldn't MD5 suffice?). So I'm asking in a general sense, not specifically this exact repo.

It also depends on the data in question: a photo collection is much more static than a database server. Since Snapraid only suits more static data, let's focus on that use case

132
 
 
The original post: /r/datahoarder by /u/Horror_Ad1740 on 2025-07-29 12:20:08+00:00.
133
 
 
The original post: /r/datahoarder by /u/Flimsy_Tomatillo4874 on 2025-07-29 12:14:59+00:00.

I'm from Asia and working on my thesis alone. My research is focused on cinema marketing strategies in the Philippines, and I’m having a hard time gathering secondary data, especially financial data. I’ve already tried emailing several government agencies, but they told me the data isn't available.

I found what I need on Statista, but it requires a professional account. I really wish I had one right now 😭

If anyone could help me access this data, I’d be so grateful:

https://www.statista.com/outlook/amo/media/cinema/philippines

Thank you so much in advance. I can send my email if needed.

134
 
 
The original post: /r/datahoarder by /u/7dsfalkd on 2025-07-29 11:25:29+00:00.

Some time ago, CMC changed the mixture of their MDISC BD-R's. The material was visually different, and the media ID's also changed. It generated some controversy, also here on reddit.

In order to find out about the reliability of these discs, I took two standard BD-R's (CMCMAGBA5), two MDISC BD-R's (VERBATIMe), and two DVD+R's (MCC 004), burned data on it (Pioneer BDR-UD03) and put them outside exposed to the elements for about four month.

The result was that the DVD+R's and standard BD-R's were literally physically destroyed, the carrier material just vanished.

The MDISC's looked better, but unfortunately none of them could be read anymore. The drive gave an error "unknown media".

That experiment really made me reconsider my backup strategy, and I cannot really trust optical media anymore. What are your thoughts/back strategies?

you can read more about the experiment including some picture here https://umij.wordpress.com/

135
 
 
The original post: /r/datahoarder by /u/Miloldr on 2025-07-29 10:17:46+00:00.

If I use api.php with action parse or expandtemplates it still has a lot of incomplete commands and if I try to download html and parse it to markdown it doesn't work out that great either..

136
 
 
The original post: /r/datahoarder by /u/Not_a_Moose_Man on 2025-07-29 10:02:06+00:00.

Currently I have my nas mirrored to another computer across the country to a friends place just in case. I’d like to have a copy on some cloud storage medium. I’m currently only using 11tb of data out of 24 so I wanna know some suggestions. Currently my set up is one local another at my friends place so I want a copy on the cloud in the end

137
 
 
The original post: /r/datahoarder by /u/Pale_Broccoli_5997 on 2025-07-29 09:10:16+00:00.

Recently I wanted to resize the partition, however I was dumb to not check the free space left, and I think I exceeded the resize request beyond the free space left, then after the error message the software restarted my pc to the chkdsk, however after repaired through it the partition became inaccessible. I've fucked up. (Cant show you the screen because the drive has been plugged off)

138
 
 
The original post: /r/datahoarder by /u/Ostromilski on 2025-07-29 08:23:18+00:00.

Hello, it's a bit of a weird ask, but I'm worried about the recent enforcement of age verification laws in the UK, and it's coming soon to the EU and maybe even the US as well. From my perspective, it looks like the internet is getting locked down globally, and there will soon be very few safe heavens available. But, I'm not here to argue about that, feel free to just call me crazy and that can be that if you'd like :)

I've got my own homelab setup and a good 20TB of free space. What I'm looking for is a collection of media/articles/data, something like a microscopic snapshot of the internet with the most important things included. The purpose for this is obvious, since I'm afraid of censorship of the internet, I'd like to extract as much valuable data right now before it all gets shut down, and use it from my local setup in the future. I can imagine in the future this "snapshot" can be updated by passing around physical media, like people have done in countries like Cuba in the past.

So does anyone know of the existence of such a repository of data, or is this something I'll have to put in the effort to assemble myself? Thanks in advance :)

P.S. I did try searching reddit and online, but I don't know what search terms to even use for this. The things I tried didn't produce any worthwhile results

139
 
 
The original post: /r/datahoarder by /u/Powerful-Ad3561 on 2025-07-29 19:56:12+00:00.

Is there an selfhost or api-capable alternative to archive.is for bypassing paywalls 12ft.io or archive.org can't bypass the paywalls on the websites I need to get to, olny archive.is (and .today, .ph and so on) is capable of that

140
 
 
The original post: /r/datahoarder by /u/lifeisahighway2023 on 2025-07-29 19:18:42+00:00.

Looking to pick up 2-3 external Hard Drives that would be hooked up to notebooks/mini-pcs. They will be storing a variety of media and will be on 24/7/365 and have constant activity, including file erasures using a program such as Eraser or file shredder.

In the past we had fantastic luck with Seagate Backup Plus drives but they are no longer available. The ones we picked up some 8 yrs ago are all still operating but essentially full, and now "old".

I am a bit behind on what is what in external drives and I know some have cheap drives inside that can only be on for limited durations p.a. and are not intended for constant i/o activity.

Has to be available from a seller in America (or Canada).

Appreciate all suggestions. 16tb -24tb size is the goal.

141
 
 
The original post: /r/datahoarder by /u/Future-Raisin3781 on 2025-07-29 18:22:33+00:00.

I need a way to backup my Synolgy NAS. For a while I was using a 14TB and Hyper Backup, but I've surpassed the ability to do that.

Eventually I'll want to build a second NAS and keep it off-site, but for the medium-term I'm getting antsy about not having a complete backup of my system. Money is a bit tight, so the less I need to spend, the better.

The things that seem the easiest to me currently are:

  1. A multi-bay enclosure with a few discs in some kind of array to make a single volume. Mostly would be used as cold backup that I'd plug directly into the NAS and run an incremental backup from time to time.
  2. Same idea, but with a couple disks in my PC (running Windows 10 currently). This idea seems.... less good, but maybe cheaper and more convenient since I wouldn't have to buy the enclosure, and I'd be able to run incremental backups more frequently/automatically over my home network.

Are there solutions I'm not thinking of? If not, I'm thinking #1 is probably the better way to go. Thoughts? Recommendations for hardware/configuration?

EDIT:

Follow-up question: If/when I get a second NAS setup, does it matter if the second one is Synology? I'm hesitant to buy any more Synology gear, since they seem to be extremely hostile towards consumers lately.

142
 
 
The original post: /r/datahoarder by /u/katanez on 2025-07-29 18:09:35+00:00.

hi i’m trying to clone a 500gb hdd with around 300gb on it and i’ve been stuck at ‘less than a minute’ since 8 hours ago, and it took over 6 hours to get to that point in the first place im not sure what i’ve done wrong or should i just wait longer and see if it might work

143
 
 
The original post: /r/datahoarder by /u/Dramatic_Profession7 on 2025-07-29 18:06:02+00:00.

Hey all, I know this isn't the typical sort of hard drive question asked here but I found threads on the Xbox and the Gamestop subs and wanted to get the opinion from people more focused on data and hard drives, rather than gaming.

The Xbox Series X/S has a special hard drive slot on the back to expand the storage. To use this you have to buy one of their purpose build drives from either Seagate or WD. I don't know all of the technical details but, these are what you have to use if you want to expand storage for Series X/S games, if you want to run older games you can use an external drive (USB connected).

With the context out of the way, my question is whether or not a "refurbished" one of these drives would be fine or if there is concerns with not buying it new? The rough price for new ones are ~$150 for 1tb or ~$225 for 2tb. Gamestop sells refurbished 2tb for $180, so it is a solid savings. All of the threads I found, on r/Gamestop and r/Xbox, people are saying to just buy new and don't risk a refurbished, but I'm wondering what you guys think?

Thanks in advance for any help, I know this isn't normally the type of hard drives discussed here.

144
 
 
The original post: /r/datahoarder by /u/fenrirofdarkness on 2025-07-29 11:53:48+00:00.

So I have a few datas I kept around for a long while already, and it's almost 1TB too, so thinking to possibly either upgrade to 2TB, or maybe going SSD?

The assorted data is mostly documents, powerpoints, images and videos.

I was thinking of getting another HDD, but my friend recommended me to get SSD instead since they are more durable/hardy? Not sure though since I read that SSD need to be plugged in regularly and I might at most do it once a year, but likely to be multiple years and only once will I plug it in.

I also don't have too much money right now as income is tight, so I can't pick both. (Right now leaning to 1TB SSD from Seagate, either the ultra compact, or One Touch version)

145
 
 
The original post: /r/datahoarder by /u/MarcellusDrum on 2025-07-29 11:35:42+00:00.

It took me a couple of years to find a disc of the game by reaching out to a guy on the developer team.

The game is protected by a custom DRM, he said it can only be decrypted by his own PC from 2007 (which he no longer has). I have his explicit permission to try and crack it, as even he no longer has a digital copy (and only 2 physical copies, he gave me one).

Trying to create an ISO took more than 6 hours to reach around 33%, and it got stuck there.

Any way to actually preserve this thing? It was never released digitally, and you can't even buy it anywhere as far as I know.

The game is Rodwan Operation. An FPS game released by Hezbollah about the Israeli/Lebanese war.

146
 
 
The original post: /r/datahoarder by /u/TheDuke2300 on 2025-07-29 05:34:29+00:00.

New 22tb iron wolf pro drives always seem to be out of stock. 18s and 24s seem easier to get ahold of.

What’s the deal, any ideas?

147
 
 
The original post: /r/datahoarder by /u/luckyrunner on 2025-07-29 05:15:57+00:00.

I'm considering buying this drive (link to Canadian Amazon). Currently, the price for the 26TB model sits at CA$414 (around CA$16/TB). The primary use-case would be for storing a Plex library of movies and shows, as well as personal photos and videos.

I've never used an external hard drive before -- always stuck with internal drives as I've been told that they are faster and more reliable. But I'm not sure if that's the case anymore, as USB speeds may exceed SATA by now? Plus I just haven't found any internal drives of similar sizes for similar prices.

So, overall, just wondering if this is a good deal or if folks might recommend an alternative setup for a similar price?

148
 
 
The original post: /r/datahoarder by /u/RewanDemontay on 2025-07-29 03:44:54+00:00.

As per the title, I'm wondering what are good/decent/not terrible external hard drive exist. I'm thinking something simple to have a main copy, a back up, and a back up back up. I think 1/2/3TB would be ample enough since I don't have all that much. Something I can keep stowed and take out/connect easily enough as needed. Something I can easily transfer to, delete from, and shuffle the copies around on of all my data. All in all I wish for something I can use with any computer/laptop as I might feel switching out with.

General advice/recommendations is the idea, please. I am not going to interrogate on the details of anything, just seeking leads to start with from those far more knowledgable than me.

149
 
 
The original post: /r/datahoarder by /u/Various_Candidate325 on 2025-07-29 01:01:40+00:00.

Managing cold storage for research lab's genomics data. Currently 500TB, growing 20TB/month. Debating architecture for next 5 years.

Current Iwe need RAID-60 on-prem, but hitting MTBF concerns with 100+ drives. Considering S3-compatible object storage (MinIO cluster) for better durability.

The requirements are 11-nines durability, occasional full-dataset reads for reanalysis, POSIX mount capability for legacy pipelines. Budget: $50K initial, $5K/month operational.

RAID gives predictable performance but rebuild times terrify me. Object storage handles bit rot better but concerned about egress costs when researchers need full datasets.

Anyone architected similar scale for write-once-read-rarely data? How do you balance cost, durability, and occasional high-bandwidth access needs?

150
 
 
The original post: /r/datahoarder by /u/didyousayboop on 2025-07-29 00:37:24+00:00.

Archive Team is a collective of volunteer digital archivists.

Currently, Archive Team is running a project to archive billions of goo.gl links before Google shuts down the link shortener on August 25, 2025.

You can contribute by running a program called ArchiveTeam Warrior on your computer. Similar to folding@home, SETI@home, or BOINC, ArchiveTeam Warrior is a distributed computing project that lets anyone join in on a project.

For this project, you should have at least 150 GB of free disk space and no bandwidth caps to worry about. You will be continuously downloading 1-3 MB/s and will need to temporarily store a chunk of data on your computer. For me, that chunk has gotten as large as ~90 GB and that's only what I happened to spot.

Here's how to install and run ArchiveTeam Warrior.

Step 1. Download Oracle VirtualBox: https://www.virtualbox.org/wiki/Downloads

Step 2. Install it.

Step 3. Download the ArchiveTeam Warrior appliance: https://warriorhq.archiveteam.org/downloads/warrior4/archiveteam-warrior-v4.1-20240906.ova (Note: The latest version is 4.1. Some Archive Team webpages are out of date and will point you toward downloading version 3.2.)

Step 4. Run OracleVirtual Box. Select "File" → "Import Appliance..." and select the .ova file you downloaded in Step 3.

Step 5. Click "Next" and "Finish". The default settings are fine.

Step 6. Click on "archiveteam-warrior-4.1" and click the "Start" button. (Note: If you get an error message when attempting to start the Warrior, restarting your computer might fix the problem. Seriously.)

Step 7. Wait a few moments for the ArchiveTeam Warrior software to boot up. When it's ready, it will display a message telling you to go to a certain address in your web browser. (It will be a bunch of numbers.)

Step 8. Go to that address in your web browser or you can just try going to http://localhost:8001/

Step 9. Choose a nickname (it could be your Reddit username or any other name).

Step 10. Select your project. Next to "goo.gl", click "Work on this project". You can also select "ArchiveTeam’s Choice" and it should assign you to the goo.gl project anyway.

Step 11. Confirm that things are happening by clicking on "Current project" and seeing that a bunch of inscrutable log messages are filling up the screen.

view more: ‹ prev next ›