It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
7676
 
 
The original post: /r/datahoarder by /u/Soundwave_47 on 2024-08-05 19:52:38.

Slack messages from inside a channel the company set up for the project show employees using an open-source YouTube video downloader called yt-dlp, combined with virtual machines that refresh IP addresses to avoid being blocked by YouTube. According to the messages, they were attempting to download full-length videos from a variety of sources including Netflix, but were focused on YouTube videos. Emails viewed by 404 Media show project managers discussing using 20 to 30 virtual machines in Amazon Web Services to download 80 years-worth of videos per day. 

“We are finalizing the v1 data pipeline and securing the necessary computing resources to build a video data factory that can yield a human lifetime visual experience worth of training data per day,” Ming-Yu Liu, vice president of Research at Nvidia and a Cosmos project leader said in an email in May.

The article discusses their methods for many other sources as well: http://archive.is/Zu6RI

7677
 
 
The original post: /r/datahoarder by /u/SfanatiK on 2024-08-05 19:40:46.

I want to start hoarding media I find on the internet. Mainly video games, music, images and shows. I already have ~15TB of stuff spread out in different HDD/SSD/USB sticks. I am kind of second guessing on which path I should take. Make a dedicated file storage PC or just go for a simple DAS.

PC Build:

    • Easily run Unraid, TrueNAS or Proxmox.
    • Upgradeable for other usage, like Plex or hosting VMs if I ever want to use those.
    • Can add more HDD storage by just buying SAS HBA cards
    • ECC
    • Can get really expensive (~$800+)
    • Used parts might not work out of the box
    • Limited documentation to troubleshoot problems
    • Higher power consumption than a DAS and needing a separate UPS.
    • Bigger footprint means it will be under my desk which is a dust trap.

DAS Route:

    • Much cheaper than building a new PC (~$150)
    • Can turn off when not needed so drives aren't on all the time reducing wear and tear.
    • Small form and can fit on my desk
  • +- An enclosure with a backplane means using my motherboard's SATA ports and it's limited to 4 drives but avoids USB connection.
    • USB connection which I read is not recommended with Snapraid
    • More HDD bays gets expensive and can reach the same prices as building a used PC.
    • No ECC

Which would you guys think is better for me?

I do not need fast transfer speed. I do not need 24/7 access to my files. However I do want data resiliency; a song I downloaded now should still be the same 5+ years in the future. Kind of leaning towards a DAS setup because I really, really don't want to deal with the headache and the cost of building a PC with used parts.

If I go DAS route has anyone had any problems using Snapraid and Drivepool with a USB-C/B connection? I would probably buy a 5 bay HDD, 1 parity and 4 raw storage. Or is it better to find one with a backplane and use my motherboard's SATA ports rather than USB, 1 parity and 3 raw storage? 10 bay HDDs exists but those almost cost the same as building a PC so there's no reason for me to get those. These DAS also has some form of hardware RAID so I need to avoid those.

What are some DAS you guys would recommend? I also read that some DAS enclosures don't play well with others. Removing HDDs and plugging it in a different enclosure or directly to the motherboard's SATA ports doesn't show the drives. I am not sure if this is true or a common thing.

Or just bite the bullet and build a PC; deal with the noise and heat it comes with constantly (I read that turning a server PC on and off multiple times can wear it down faster), along with a higher electricity consumption which leads to a higher electricity bill.

Lastly my backup solution is pretty rudimentary. Just files in a different HDD in cold storage at the moment. There's no RAID applied to them so it's not protected against bitrot and other silent corruptions. But once I get either the DAS or the PC I'll just make a second copy in the same format (Drivepool and Snapraid) and put them in cold storage.

7678
 
 
The original post: /r/datahoarder by /u/C96Alia on 2024-08-05 18:52:29.

Hello all! I'm looking to backup a very obscure FTP archive, for old BeOS software. Unfortunately, there's only obscure archives left (for the most part), and I don't wan't the data to become dead links as such can become.

There's no contact information for the archive owner, as the FTP and it's portal page has none listed, and they've deleted their Reddit account. Otherwise, I would just ask for the data.

Is there any chance I can do this with an automatic tool, even command line? I need something that will work with Linux, if such a tool exists. It would be tedious and impractical to go through what I suspect to be a thousand or more downloads on an FTP, download, and sort them manually.

7679
 
 
The original post: /r/datahoarder by /u/ReagentX on 2024-08-05 17:29:19.
7680
 
 
The original post: /r/datahoarder by /u/TerraWhoo on 2024-08-05 16:34:33.

New to enclosures and backing up large amounts of data, I keep seeing USB C or USB 3.0. All the images when wired to computers are Mac, so USB C.

I have USB A ports on my PC, I'd rather not use an adaptor if not needed (tell me what's good if needed) and am interested in a 4 bay metal enclosures. Any suggestions?

7681
 
 
The original post: /r/datahoarder by /u/Gixxerfool on 2024-08-05 14:39:48.

I am currently running an older mac mini as a computer on my network hosting PLEX, I will be adding other functions sooner than later.

I have three external drives totaling 15Tb connected storing my media. I am nearing completion of my rips and decided I don't want to ever do this again. So I am looking at adding a DAS when I am done, specifically the Terramaster D3-400. I want to put minimum 4 10Tb drives in it, but if I can afford more I will do the higher capacity.

One of the videos on the Terramaster suggested adding one SSD to the array for higher access speeds. My mini runs dual internal SSDs, about 1.5Tb configured in a fusion setup, is the SSD in the DAS needed?

I'm looking at the recertified drives on serverpartdeals from WD at 7200 speed.

Are there better options?

For clarity, I don't want to spend the extra money on a NAS and don't want to set it up or manage it. I am going to put my externals offsite and maybe subscribe to Backblaze too. That's for future me to figure out.

Thanks for the help.

7682
 
 
The original post: /r/datahoarder by /u/Alexander_Alexis on 2024-08-05 14:25:16.

Is it possible to find urls of things that are kind of like this?

https://domain.com/wp-content/uploads/2024/06/

https://domain.com/wp-content/uploads/

??? for example

https://domain.com/wp-content/uploads/2024/06/Meowmeow.mp4

I'm trying to scrape a website to find stuff liek these

7683
 
 
The original post: /r/datahoarder by /u/ArgyleDiamonds on 2024-08-05 14:01:03.

Question: Is an Additional Local Backup of Photos on My MacBook Pro Necessary?

Hi everyone,

I have a MacBook Pro, iPad Pro, and iPhone Pro, all synced with iCloud for my photos. I keep the original photos downloaded on all devices for offline use (I don't optimize storage).

I'm considering creating an additional local backup of my photos on my MacBook Pro, separate from the current local downloads. Essentially, I want to copy all the photos to another location on my MacBook because I have ample storage space.

Is it redundant to create this second local backup on my MacBook Pro if I already: * Store originals in iCloud with photos synced to all my devices. * Keep originals downloaded for offline use on each device.

While I have plenty of storage, I'm unsure if this additional backup is necessary or just overkill.

Thanks for your help!


Edit: The backup would be on the internal drive of my MacBook, not an external drive.

7684
 
 
The original post: /r/datahoarder by /u/CiriloTI on 2024-08-05 13:12:32.

I want to mount a NAS with 6 of these for a costumer and I would like to know if this model is a reliable one.

https://www.kingstonstore.com.br/products/skc600-256g-ssd-de-256gb-sata-iii-sff-2-5-serie-kc600-para-desktop-notebook

7685
 
 
The original post: /r/datahoarder by /u/Matt_Bigmonster on 2024-08-05 13:06:38.

Just finished encrypting drives on my PC and my 2 backups, both portable ssds. One to be kept with me, other one to go somwhere offsite (this one wil be updated every few months). Now where to keep it? Friends? Work? Abandoned cabin in the woods?

Please can we not talk about network servers and cloud (I use that for importand documents and data anyways).

What is a good location for one of your backups?

7686
 
 
The original post: /r/datahoarder by /u/ReclusiveEagle on 2024-08-05 09:06:09.

So just as a general question, is there a better way to store or link related files and data together on Windows?

For example, lets say I have 2 folders.

  1. iPhone Photos
  2. Trips

The iPhone Photos folder will obviously store the photos taken with my phone. The Trips folder will store all photos, videos, text files, anything related to a specific trip. Lets say I go to America.

So the question becomes:

  • Do I only store iPhone Photos from America in the iPhone Photos folder?
  • Do I only store iPhone Photos from America in the Trips folder?
  • Do I duplicate iPhone Photos from America and add them to both folders?

Is there no better way to store data? SQL has relational databases, so you can link Customers to Rooms for example. Or for programs like Notion, Obsidain, Tiddlywiki, or any other markdown based document renderer, you can link multiple text documents together, regardless of where they are stored, and intelligently update all existing links if a file or folder is moved or renamed . Is there a way to do this with files and data on a hard drive?

Why do folders need to be exclusive? And if you want to link data you just have to duplicate it or do something archaic like create a folder shortcut to link folders. But now you have to attempt to remember how many shortcuts there are, where they go, can certain data be linked to another folder? does a new folder link to old data?

Is there a way for me to associate iPhone Photos taken in America to my Trips folder without duplicating images? Or any better way to store data on a hard drive?

This is just a basic example but imagine if you had 1TB of data and a lot of the data could be related to other data in a way that's not simply "This thing is exactly like these other things, so store them only here".

The solution should also not be exclusively related to one file type. Obviously in this example I could create a Lightroom database and have iPhone Photos be tagged to America Trip and just search for the keywords or tags related to that trip. But then I'd only find images related to the trip and no other documents. I'd also be reliant on paying Adobe \Insert per month] just to maintain the database.)

Is there no way to organize and store data to do this natively with all file types and data?

7687
 
 
The original post: /r/datahoarder by /u/Mashic on 2024-08-05 08:50:50.

Sorry, I don't know where else to ask. I'm uploading a disc iso item into archive.org, and it's creative mp4 files for it. I don't want it to create any extra mp4 or thumbnail images, how do I make it keep the raw files only?

7688
 
 
The original post: /r/datahoarder by /u/AT28BA on 2024-08-05 04:38:32.

So, I've been looking to make myself a NAS. So far I'll be using a Raspberry Pi 4B as its kinda affordable and perfect for what im tryna do. But here's the problem; i don't really want to buy new external drives cuz i have some HDDs but i need a (preferably) 4-bay docking station to plug them with usb to the RPI however i cant find good ones under CAD$100. theres those chinese brands with some sketchy reviews : https://a.co/d/cM6hKPQ it seems like a good deal but im not sure about the quality of the internal components.

Please help me if you can or just advise me. that's a project i really wanna do

7689
 
 
The original post: /r/datahoarder by /u/Wizard_of_Od on 2024-08-05 03:43:21.

Dezoomify-rs is a great program, but it always seems to reencode. I was wondering if it was possible to grab all of the tiles at the maximum zoom level and losslessly join them together (like you can losslessly crop or rotate jpegs).

In the past I grabbed a few of the little tiles from the browser cache and IrfanView told me they were encoded at Jpeg 84 with Chroma Subsampling (the default in Dezoomify is Jpeg 80 with no Chrma subsampling; it doesn't have an option for subsampling). From what I have read, if you have to reencode, it is best to reencode with the settings the file was originally created at. However, today I pulled out fragments from the Browser cache (using MzCacheView) and noticed a problem. The image tiles had a jfif extension (apparently a subset of Jpeg) but nothing could parse them. IfranView at least gives me an error message, "bogus Huffman table definition", which doesn't mean much to me (I'm not a coder). There is a JFIF near the start of the file, and the tiles from one of the images had some metadata eg rdf:liCreekside Digital/rdf:li, then I assume the image data begins.

I managed to get Dezoomify-rs to download the tiles by putting something like DirectoryName.iiif after the URL, but the tiles all seem to have been reencoded.

Also, there doesn't seem to be a way to force Dezoomify-rs to download as lossless (Png) without specifying a Filename followed by the .png extension. I want to maintain the automatic file naming functionality so I can download a batch in one sitting without having to specify filenames one by one.

I tried 2 Python scripts but they didn't work for me.

If anyone is able download without reencoding from GAC, could you tell me what tool you are using and exactly what syntax.

Update1: I just remembered that Jpeg has a dimensional limit (just over 64,000 x 64,000 pixels, in comparison the newer WebP is only 16K by 16K). GAC images at zoom levels 7 and 8 (only relatively few are that large) could not be reconstituted as a single file.

I was able to get Dezoomify-rs to download the raw tiles to a specified directory by suffixing -c DirectoryName . I'm not sure what to do with them though. They have names like https_lh3.googleusercontent.com_ci_AL18g_SP6cLRt0FWKGWHxH_TRSc-uHNzi6LmyDPGx3NjZWx6cuXfwkmSGDlq1ANqscwbsyR93EZUdw=x1-y13-z5-tG2Gy1wuJECyG1JJMtpEX9j4DxJk . If I change the humongous extension to jfif I can open them in an image viewer (IrfanViewer told me the tile in question was 'JPEG, Adobe RGB (1998), quality: 89, subsampling O'). I'm not sure all of the tiles for every image on GAC has exactly these same parameters, but Dezoomify definitely seems to be throwing away the Adobe RGB colorspace information when it resaves.

Even if I can find a tool to losslessly reassemble the full images, I will still have to manually rename each image. There is no descriptive metadata in the tiles that I can see (eg Artist / Title /Date); with most audio files I can rename them from the metadata if need me using something like Mp3Tag.

7690
 
 
The original post: /r/datahoarder by /u/Gadetron on 2024-08-05 01:32:55.

I grabbed the collection files that has all the stuff they have in one go, however a good amount of the files have some sort of macosx file in the cbr/cbz file, which prevents a thumbnail from being made. When I delete the macosx folder the thumbnail loads normally, but wanted to know if there's a way to bulk delete a file in a cbz/cbr folder? They will all be the same name

7691
 
 
The original post: /r/datahoarder by /u/Glass-Fix-4624 on 2024-08-05 01:22:58.

Currently, I have about 8 TB of data to store, split between files for backup (4-6 TB) and files in active use. My setup includes three external HDDs (2x 2TB and 1x 4TB) and a 4TB internal HDD in my PC.

Due to past HDD failures, I subscribed to an 8TB cloud service, but it's costly and hard to manage.I'm considering a refurbished 12TB HDD to consolidate my backups and cancel the cloud service, but I'm concerned about data loss.

I've also looked into NAS/DAS systems but need advice on whether they're necessary for my use case.

I need a solution that provides redundancy and is cost-effective, possibly utilizing my existing drives.

Thanks for time and help! Any suggestions will be appreciated!!

7692
 
 
The original post: /r/datahoarder by /u/outm on 2024-08-05 01:08:03.

Hello,

First, I know what you're thinking: "NAND is the worst for backups! do LTO!" but I have a specific use case.

USE CASE: I have my multimedia (movies, shows) already hosted and with redundancy on a server and it doesn't worry me. The movies I really like I even have a physical copy of them (blu-ray) so I'm OK with it.

But I have a bit of data (about 300GB or so I think) that are just documents, emails, family pics/vids, business data, personal (health, financial...) data that I really really want to protect. If the multimedia data one day disappears I won't cry (example: fire on the house) but about this data? I would certainly lol.

My current approach is having all the data on "the cloud" (Google Drive) - some of it encrypted with cryptomator because sensitive info. Then, it's kept synced with my laptop (GDrive for Windows), and with the multimedia server (rclone), which keeps a local (un-encrypting the data just to have it if cryptomator corrupts it one day) + BackBlaze B2 incremental backup (duplicacy encrypted) - also, I burn a Blu-Ray with the most precious data if I feel like it.

But I want to keep a copy with me at all times, I think it would be nice. For example, being on a light-baggage travel, at work or whatever and be able to access the data even if there isn't internet. My setup would be having 3 partitions on that disk: 1 for clear data if I have to share something with another person without having to give passwords, 1 encrypted (bitlocker) with business data, and 1 encrypted (bitlocker, another pass) for all the personal and family data (so if I need it at work, I can use it without opening personal files).

Problem is I don't know what is my best option:

  • Samsung Fit/Bar Plus 256/512GB pendrive (45€/$50 for 256GB - I think I could limit myself to 200GB?)? I see they are the best pendrives out there, Samsung sells they are physically strong (water resist, X rays, hits, whatever) and have 100MB/s witing speed and 400MB/s, which reviews tend to say it's true. What worries me is... wouldn't that pendrives be unreliable? I usually read here that pendrives can't be trusted and so on, even if they use TLC memory, they don't have a controller so they just go on until they fail, don't have TRIM, wear leveling... I worry if it would be a bad investment. BUT It would be the smallest option to keep in the wallet for example.
  • Transcend ESD310c 512GB (64€/$70)? I see it's a bit bigger, but still pendrive-size, and is a full SSD? It has a Silicon Motion's SM2320 controller and Kioxia's BiCS5 112L 3D TLC NAND, up to 10Gbps. IDK if it would be easily ported on a wallet (maybe? seems small enough) but it also has dual interface which is nice (USB-A and USB-C) and aluminum body (even if I don't like the plastic parts covering the ports). But... it's Kioxia NAND reliable? Would this option be better than the Samsung pendrives?
  • Samsung T7 Shield 1TB (100€/$110 for 1TB, 150€/$163 for 2TB)? This would be surrendering the idea of going for a small-size option and just going for what could be considered the most reliable out of this three? Also the most expensive.

At the end, my setup would be: working with files on the cloud, sync daily with the home server, which will make 1-3 months of incremental backups (with sensitive folders being unencrypted) and upload a copy of this local incremental backups to BackBlaze B2 (encrypted copy)

And, manually, regularly, making a manual copy of all that data (and cryptomator files unenrypted) to the pendrive/SSD chose from the last 3 options I described.

What do you think? Tips? Also, if you think I'm overthinking anything tell me. For example, I was thinking of dropping the BackBlaze B2 backup copy, thinking that having the cloud -> local server -> SSD/HDD would be fine.

Also I think if I just am overthinking the encryption thing and should drop Cryptomator for Google Drive just for some excels with financial info or family data - that way, I protect myself from another weak point (Cryptomator failing without me noticing) and make easier the process of backing up?

Thank you all for all your tips/help/POV

7693
 
 
The original post: /r/datahoarder by /u/spicy_rice_cakes on 2024-08-05 00:46:24.

I am the developer of a file manager called Ritt (https://ritt.app), a tag-centric file manager built for Windows 10/11. It combines the simplicity of a file manager with the power of a Digital Asset Management solution. It is designed to organize large volumes of images, videos and documents. It has the following features:

  • Fast and batch tagging of files
  • Create tags from File Explorer tags (keywords)
  • Auto Tag image files with AI
  • Searchable notes for files and folders
  • Option to tag folders instead of directly tagging files
  • Retrieval with tag intersection and exclusion
  • Advanced search by combining tags using logical operations
  • Link related files and folders
  • Powerful built-in previewer, hover over video to scrub
  • Sync tags across machines
  • Scripting support

Ritt has free and paid tiers. Please check it out if this interests you!

7694
 
 
The original post: /r/datahoarder by /u/Minighost244 on 2024-08-04 23:59:53.

The situation is that I have a home server (Installed with Windows, currently) and I want to put Immich on it. I don't really want to pay for a cloud solution, as the whole point of the home server was to avoid paying for Google Photos.

Right now, I have the server set up with 4 1TB HDDs. The 3-2-1 rule specifies that a copy of the data should live on a separate machine, but would copying the data to one of the HDDs, then unplugging it, be enough? Or is having a cloud destination / another machine simply unavoidable?

If copying the data, then unplugging the HDD, is enough, could I automate the process by unmounting the drive instead? Or is unmounting not close enough to unplugging the drive?

Thanks for any replies! Sorry if these questions are kind of silly, I just couldn't find straight answers to them anywhere.

7695
 
 
The original post: /r/datahoarder by /u/gxvicyxkxa on 2024-08-04 21:14:08.

Like everyone here, I've trawled serverpartdeals and diskprices and the prices are good and acceptable, etc.

However, I'm not in the states and shipping and customs to my country are hiking the prices up between 50% and 70%.

I've come across a local seller who has a few that I could possibly haggle down. Seller says they've been running a couple of years, all seems good and he's willing to provide smart data for each if requested. (I haven't seen any yet though).

Prices for various drives are between €10 and €12 per terabyte.

I'm in a bit of a panic because my two largest drives just failed on scrutiny (both passed smartctl though). The idea is to have important stuff (docs, backups, photos - which are 321 backed up also) on my good drives, and the more expendable, easily replaced stuff (movies, TV, games) on the less favourable ones. So it's not the end of the world if they fail.

Is there a percentage discount any of you would accept under these conditions? Assuming smart data of the seller's drives come back relatively clean, I can ideally pick and choose by serial number.

Or is it just a bad idea all around?

7696
 
 
The original post: /r/datahoarder by /u/nothingveryobvious on 2024-08-04 18:30:00.

Is anyone using gallery-dl and successfully scraping Instagram without getting warnings for automated behavior? If so, could you please share your sleep settings and any other important settings? I use the following, but I still get warnings about automated behavior:

"sleep-request": [30.0, 60.0], "sleep-429": [60.0, 90.0], "sleep": [30.0, 60.0], "sleep-extractor": [30.0, 60.0],

I'd appreciate any help. Thank you!

Or if you use some other software, please let me know. I used Instaloader before this but I got warnings. I thought gallery-dl would be better because you could edit these settings, but it's still not helping. Thanks again.

7697
 
 
The original post: /r/datahoarder by /u/Haenni01 on 2024-08-04 17:39:49.

Hey everybody,

I have a slightly different question. The thing is not, that I'm trying to start hoarding, the thing is, that I lost control.

Not in the numbers, my Data consists of around 4 Tb pure data, but the most important stuff (around 400 GB of personal pictures, memories and documents) are out of control. They are backed up in different versions encrypted on different drives, the originals are spanned on different locations and PCs. I'm trying to get back to a centralized system, but I do not know, where to start and how to sort the data.

The Data should stay encrypted the whole time, and preferebly not be sent via a network connection, so the NAS as central organizing point is no solution.

Also, my biggest drive is 2 TB, and the versioning of the backups lead to having around 10 TB of data (those 400 GB spanned across all devices).

How and where would you start? Thanks in advance for your inspirations and ideas :)

7698
 
 
The original post: /r/datahoarder by /u/Hayrianil on 2024-08-04 17:20:53.

Hi, I have a WD My Cloud Expert EX4100 and a 6TB Wd Elements. I am thinking about shucking the elements and put the white label 6TB inside of my Ex4100. I am using JBOD not a RAID configuration. Will this work?

7699
 
 
The original post: /r/datahoarder by /u/Sarnuxe0 on 2024-08-04 17:00:21.

I just bought a new laptop and I want to download all the games I want and I have a lot of documents + I will do video editing and I am in search of a good SSD.

I would like to have a good ssd that will last the lifespan of this laptop and then some, and I have seen deals on a dram-less SSD (Crucial P3 - 200 EUR, Lexar NM790 - 250 EUR) but the SSDs with dram (WD SN850X - 295 EUR) are quite a bit more expensive than the P3 (95 EUR more).

I want really good lifespan but the price difference is big (1/2 of the price of the P3 more). Is it worth it? What will the lifespan be of the SSD with dram and the dram-less one?

Whats going to be the difference in TBW?

Btw - I am looking at amazon.de prices...

7700
 
 
The original post: /r/datahoarder by /u/aglareb on 2024-08-04 20:31:55.

Hi everybody,

I recently (2 days ago) got into homelab and data hoarding by purchasing a used Intel S2600GX blade server and a Compellent SC2 JBOD.

I've been having trouble figuring out the specs of the JBOD; it does not have any product labels that I can discern. It does have two Compellent SC2 EMM controllers in it, and while there are plenty for sale second-hand, I can not find any manuals for them online.

Are these things related to the Compellent SC220 or Compellent SC200 controllers? Does anyone have a manual for these? Is there a minimum number of drives I need to put in this thing (it has 24 slots)? Is there a maximum storage this thing can address?

I'd appreciate any advice you guys have.

All the best,

aglareb

view more: ‹ prev next ›