It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
3926
 
 
The original post: /r/datahoarder by /u/HoardingBitByBit on 2025-02-02 19:21:01.

I have roughly 50 TB of Movies and TV Shows on a NAS with raid6 that runs Plex. Up until now, I haven't put much thought into backing that data up, mainly because it's .. well .. just movies and I can probably get it back in case of a disaster, but it would certainly be annoying and time-consuming. Looking at other posts here, some share that sentiment (e.g. https://www.reddit.com/r/DataHoarder/comments/1es4ry2/do_you_guys_backup_your_movies/ ).

While I have looked into multiple backup solutions like Borg, and others, they all lack that feature of what I wanted to achieve and provide much more of things that I don't need for that kind of data. I considering writing a tool for this usecase, but before I start I wanted to ask here if that already exists in some form.

My thinking was: Why not just buy a few 20TB Exos HDDs and store the data on there and keep the disk mostly offline. From time to time, run a script / tool that makes compressed archives of individual movie-directories and store them on a disk. The backup disk will not be online all the time, only while backing up stuff. In case my main NAS goes to hell beyond my raid 6 fault tolerance, at least I have a single backup of them. I could build a database stored on the nas that keeps track of file hashes and what is backed up to which disk. In case the directory changes because of updates (TV Shows, or improved encoding) the backup can be replaced or updated.

To be clear: This should not become a full-fledged backup solution for business critical data with a 3-2-1 backup. I don't need data deduplication and complex versioning. I just want to make sure that I have a backup for each Movie / TV Show somewhere. I just want to attach a HDD to the NAS, run the backup that compresses new files/folders and keeps track of where it's stored. If the disk is full, I will just buy another disk and continue from there. The metadata of what is stored where, when and what fingerprint it had can stay on the nas or I can back it up to a cloud host, since it wouldn't be much data.

Is there already something around that does that or something similar? Otherwise I'll probably write something in the near future.

3927
 
 
The original post: /r/datahoarder by /u/Honest_Cheetah8458 on 2025-02-03 00:24:41.

I'm a relatively paranoid person. With all the .gov sites being taken away, I want to ensure I at least have a copy of relevant information. I don't have much downloaded, just pretty much some movies, albums, and the Kiwix Wikipedia file. I'm mainly concerned with CDC info and Climate reporting. Can y'all help me at the start of this journey?

Thank you so much, you all seem to be excellent people for excellent causes

3928
 
 
The original post: /r/datahoarder by /u/a_vanilla_malted on 2025-02-02 23:50:55.

Hi all - first off, thank you for making it harder for our government to conduct virtual book burnings. I'm not a regular on this sub, but I am a health sciences researcher. I was wondering about biorxiv and medrxiv, the preprint servers. With the latest banned terms from the US government and the demands to retract papers even if they've been accepted to journals, if they are posted on preprint servers, that would be one way to have a copy before they are officially published/retracted. Would it be helpful to backup things there, or are biorxiv and medrxiv considered sufficiently backed up themselves?

3929
 
 
The original post: /r/datahoarder by /u/Independent_Echo_363 on 2025-02-02 22:52:29.

Resume: Recently, I saw the DAS ORICO-9858T3 with 5 bays and Thunderbolt 3 connectivity. In the advertisement images, it says that it was possible to achieve 800MB/s transfer rates in Orico's labs, and I believe that this speed is limited by the number of hard drives, even though the connection is 40Gbps. However, I saw that it has two 40Gbps connections allowing another DAS to be connected with Daisy Chain Connection, so my question is if I connect two units of this DAS to each other, can I set up a RAID with these 10 disks together (5 from each)? So that I can really take advantage of Thunderbolt 3.

obs.: I am aware that it does not have a RAID controller (via hardware); the ad itself mentions that RAID needs to be done via software. In my case, I usually use OWC's SoftRAID to create my RAIDs because I find it more efficient, secure, and it offers extensive monitoring of my drives.


Complete explanation:

I work with video editing and VFX, so I need large storage capacity with good speed. My dream setup would be to buy an OWC ThunderBay 8 because, with its Thunderbolt connection and using RAID 5 across its 8 bays, I could achieve transfer speeds of around 1,300 MB/s, with the ability to connect up to 8 more of these devices in a Daisy Chain Connection. However, I live in Brazil, and this hardware is not available here, making it unfeasible to import from the U.S.

The solutions I have, then, are imports from China via AliExpress. There are two models there that catch my attention:

TerraMaster D8 Hybrid - Based on my research, this is probably the one I'll buy. It's not as fast as the OWC Thunderbolt, but it's one of the few DAS devices I've found with USB 3.2 - 10 Gbps.

This is a hybrid enclosure; in addition to its 4 HDD bays, it has 4 M.2 NVMe slots, which, although limited by the 10 Gbps bandwidth, can still provide the maximum transfer speed of the machine in a single slot (approximately 980 MB/s). This is interesting because, in RAID 5 with just 4 HDD bays, I would only reach a maximum of 600 MB/s, which would be insufficient for my daily work. So, I'm thinking of using the SSDs in parity to work directly on them safely, as I would always have a backup, and after finishing the project, I would transfer it to the HDD RAID, which serves as my long-term storage.

These 10 Gbps are sufficient for my current editing needs; practically any resolution and video codec can run at this speed. However, it's already somewhat outdated considering the size of files being generated nowadays. So, thinking about storage solutions designed for the long term, something with higher speeds would be more interesting.

But beyond speed, which would already be sufficient, my biggest concern with this TerraMaster is its expandability. It only has one connection, which is direct to the PC, so if one day I want to use daisy chaining, it won't be possible to connect anything to it. What I could do in the future is buy another module that allows daisy chaining and then connect this TerraMaster D8 to it, but in doing so, I would lose the functionality of the 4 NVMe slots. For this reason, I would prefer a Thunderbolt solution with only HDDs and the possibility of expansion. This way, I could achieve the desired speeds using just the HDDs and have peace of mind to expand my pool in the future.

ORICO-9858T3 - This one seemed like my solution since it is Thunderbolt 3 and allows for expansion. But I'm not sure if the Daisy Chain Connection will work 100% on Thunderbolt's 40 Gbps bandwidth, so that if I connected two of these devices and created a RAID 5 across their 10 disks (5 bays each), I could achieve speeds of over 1,500 MB/s.

3930
 
 
The original post: /r/datahoarder by /u/94ArcadeEdition on 2025-02-02 22:48:43.

New to the subreddit so I apologize if this has been asked already: Scratching my head regarding this issue, however:

I have a Twitter Account with well over 50K+ Bookmarks, I've been there for nearly 20 years at this point.

Point Withstanding - With all of the political turmoil that's been developing: I'd like to delete said account

However - The Bookmarks that I have, I use for work, personal research and the like - Is there a program I can download that can extract all of those Bookmarks for me? The data I've collected there is too precious for me to just ax the account.

Is there a Firefox Add-On/Chrome Extension? Something simple? Or is this process complicated? Any and all feedback is wholly appreciated, thank you.

3931
 
 
The original post: /r/datahoarder by /u/Daconby on 2025-02-02 22:29:35.

Any way of monitoring the temperature of an LSI SAS card while it's running? It's a 9305 on Debian if it matters.

3932
 
 
The original post: /r/datahoarder by /u/aqsgames on 2025-02-02 21:55:51.

This is a small subreddit so few will know what you guys are doing. But on behalf of the many who don’t know, thank you, thank you, thank you. You are doing a wonderful thing

3933
 
 
The original post: /r/datahoarder by /u/surfingstoic on 2025-02-02 21:44:46.
3934
 
 
The original post: /r/datahoarder by /u/WishboneIntrepid3088 on 2025-02-02 21:36:56.

The mobo will have pcie 3.0 16x slot running at x4

And I'm curious if the SAS9300-16i will run and have all ports usable? https://amzn.asia/d/c6FEGl0 Also is this a good deal? Is it a good card?

Idk I'm new to this stuffs. Midway this yeah I'll get main PC upgraded and use the current PC as a NAS/server (mc server, unraid{probably} some other games maybe, Plex, torrenting)

To start with I'll get 3-4 drives, butt I will need more drives in the future, so I've been researching ways to do this.

3935
 
 
The original post: /r/datahoarder by /u/oromis95 on 2025-02-02 21:35:52.

Already posted in the Internet Archive subreddit, but thought I'd share here too.

What if there could be a backup of the internet archive hosted by volunteers?

  • It would have to be different from traditional torrenting, more similar to BOINC, where data is stored in blocks rather than files. The volunteer should have control over the subject of the content, but not the files to prevent volunteers from being liable in case of claims of piracy. The default configuration is for the volunteer to store the next non-backed-up block.

  • In my mind the project would back-up the whole archive, then start over to increase availability of data. Yes, I am aware the project is over 50PB, I still think it's doable.

  • Scientific data, content at risk due to censorship, and data over 50 years old could be prioritized. This would occur democratically.

3936
 
 
The original post: /r/datahoarder by /u/ProfessionalSolid692 on 2025-02-02 21:17:27.
3937
 
 
The original post: /r/datahoarder by /u/signalwarrant on 2025-02-02 21:12:25.

Would like to help archiving, what do we think needs to be archived ahead of more dumbassery?

3938
 
 
The original post: /r/datahoarder by /u/madhatton on 2025-02-02 20:20:25.
3939
 
 
The original post: /r/datahoarder by /u/Life_Memory_5754 on 2025-02-02 19:17:05.

I'm not a technical person but was curious if anyone is thinking about how the administration might manipulate historical Consumer Price Index data? I imagine they may want to alter the narrative around the impact of their upcoming tariffs against Mexico, China, and Canada.

3940
 
 
The original post: /r/datahoarder by /u/Lelo_B on 2025-02-02 18:11:54.

I am tech illiterate, but I work in public health.

I've seen many sources here, like EOTW and u/VeryConsciousWater archiving all of these pages, but when I click on them I just see random files and text. It feels like I'm looking into the Matrix. I just don't have the eyes or brain to make sense of all of this.

I specifically want to find every CDC webpage for HIV/Sexual And Reproductive Health site, Injury Prevention site, and School & Adolescent Health site. There's probably a dozen or two pages associated with each site.

How could I find a site map (with all associated pages) of each CDC site from Jan. 31 or earlier? I figure if I get a list of URLs, I can find them all in Wayback Machine.

3941
 
 
The original post: /r/datahoarder by /u/aequitssaint on 2025-02-02 15:18:46.

I have a few TB i'm willing to spare for archiving and public info availability. I already have the cdc database that was posted here recently, but what others should I be adding?

I don't currently have the ability to host a mirror, yet, but I am more than happy to torrent just about anything.

3942
 
 
The original post: /r/datahoarder by /u/cptfraulein on 2025-02-02 15:16:25.

tl;dr: can we archive the National Library of Medicine and/or PubMed?

Hi folks, unfortunately I am completely unversed in data hoarding and am not a techie but I am in public health and the recent set of purges has affected myself and colleagues. A huge shout out and a million thanks to all of you for being prescient and saving our publicly available datasets/sites. I don't think it's overstating to say that all of you may very well have saved our field and future, not to mention countless lives given the downstream effects of our work.

Since I don't (yet) know how to do things like archive, I wanted to flag/ask for help in terms of archiving the National Library of Medicine. I know myself and colleagues use PubMed and PubMed Central every day and I worry about articles and pdfs being pulled or unsearchable in the coming days. This includes stuff like MMWRs, which are crucial for clinical medicine and outbreak alerts.

Does anyone have an archive of either NLM or PubMed yet? If not, is anyone able to do so? Is it even possible? In my limited Googling, the only thing I kept finding was that I could scrape for specific keywords but the library is so broad that doesn't feel tenable. Thanks in advance for your help and comments. Y'all rock, so much.

3943
 
 
The original post: /r/datahoarder by /u/superfastturtlepower on 2025-02-02 13:13:32.

I am a disaster researcher who uses a combination of FOIA requests and public facing data to research FEMA assistance denial rates and reasons in disaster declarations. For instance, one disaster I am working on from 2020 in rural America had a whopping 76% denial rate for Individual Housing Assistance. Let the reprecussions of that sink in.

Knowing why denials for federal help happen, and where, is important. And its under threat.

Please, I beg you, help scrap and archive FEMA data.

3944
 
 
The original post: /r/datahoarder by /u/HumanButterscotch854 on 2025-02-02 13:13:01.

When my grandpa passed he left multiple cases of 35MM film slides he took of his time stationed overseas.

are there any product or mail-in services you guys would recommend?

3945
 
 
The original post: /r/datahoarder by /u/Corsaer on 2025-02-02 13:03:08.
3946
 
 
The original post: /r/datahoarder by /u/ParticularNebula4663 on 2025-02-02 12:52:02.

Hi there,

I would love to digitise our family's old print photos - for safety more than anything. I don't live in a particularly high risk place but, the LA fires made me realise they are the only material thing I'd be devastated to lose.

I've seen the Epson V600 recommended but I was also looking at a CANON PIXMA TS5151 All-in-One Wireless Inkjet Printer which has a scan resolution of 1200-2400 dpi. I like that it also prints as I'm sorely lacking one at home.

The main intention is to archive them but it would be nice to scan them at such a quality that they would make decent prints as well.

What do you all think and please explain like I'm a child thank you!

3947
 
 
The original post: /r/datahoarder by /u/LAXBASED on 2025-02-02 12:50:14.

Examples, From project gutenberg to dataset of cdc info to full archives of gameinfrormer magazines and such, what would you think is worth sharing a collection / full archive of?

3948
 
 
The original post: /r/datahoarder by /u/mottenkug3l on 2025-02-02 11:14:16.

In the light of the current scrub of valuable information on US Government websites, I'm wondering whether there are communities archiving e.g. German (Governement) websites?

3949
 
 
The original post: /r/datahoarder by /u/causal_triangulation on 2025-02-02 11:05:55.

I'm downloading my favourites.

https://x.com/opensauceai/status/1885483639611531704

3950
 
 
The original post: /r/datahoarder by /u/matefeedkill on 2025-02-02 04:31:04.

https://ntrs.nasa.gov/. The corpus is about 6TB.

view more: ‹ prev next ›