It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
3601
 
 
The original post: /r/datahoarder by /u/BlueeWaater on 2025-02-11 05:13:41.

Hey there, anyone knows working tools or repos to scrape entire subreddits? please lmk <3

3602
 
 
The original post: /r/datahoarder by /u/Slouchingtowardsbeth on 2025-02-11 05:03:54.

I am interested in downloading an educational sandbox so my kids can access the internet but only educational stuff. Especially useful for when we are overseas in places where it's difficult to access the internet anyway. What would you suggest I add to this? Wikipedia, Khan Academy Lite, Gutenberg, what else? Thanks for any ideas.

3603
 
 
The original post: /r/datahoarder by /u/users626 on 2025-02-11 04:47:48.

I purchased an aluminum Orico enclosure for my 20tb seagate ironwolf drive I just got to start digitizing my physical movie library. I’ve been having issues where makemkv will tell me writing has just failed, writing has timed out, or it just won’t work. I’ve attributed it to the external drive as if I write to my SSD inside the pc it works fine. When transferring files from the internal SSD to the external drive sometimes it takes minutes, sometimes more than an hour, some times not at all. A lot of the time it will max out writing at 5MB/s. The disk is reading healthy so I’m left with trying a new drive enclosure but everything I’m seeing on Amazon is some whatever name that comes with a warning “this item is frequently returned”. They all seem shoddy, like I’ll experience the same issue and have to go through a repeat process of return-rebuy. I can’t justify a QNAP TR04 at the moment, although I think I would eventually get one after I hit three drives. I only have one drive but I feel like that’s also the only real option.

What is a reliable drive enclosure that you can recommend so I can replace this and not have to go through this repeatedly?

3604
 
 
The original post: /r/datahoarder by /u/-ThatGingerKid- on 2025-02-11 03:54:17.

I've got roughly 30,000 images of my wife's from the last several years that I'm trying to sort through so I can put the photos on our Immich server. Problem is, the naming scheme for the memes she's downloaded or screenshotted over the years is so similar to the naming scheme for the photos on the various devices she's used, I have no idea how to simplify the process of separating the two. Any ideas?

3605
 
 
The original post: /r/datahoarder by /u/Euphoric-World6339 on 2025-02-11 03:35:47.

Context: I am losing faith to the waybackmachine site, as in my experience I have noticed that its not saving images as it used. Like its hard to find deleted DeviantArt works as they won't load in the "new" versions of the DeviantArt site (Anything after 2021 as I can't see the artworks uploaded after that year.) and have noticed that they're kind of in trouble with the law right now. So just a precaution: Is they're a way to save the archived sites locally? Like if they're not using wifi to be functional.

3606
 
 
The original post: /r/datahoarder by /u/fmillion on 2025-02-11 03:34:19.

Title is the gist of it, but here are a few specifics:

  • Ideal size is 4 bays. 6 is also OK. But 8 and beyond will probably make the case bigger than I'm hoping for.
  • Bays should allow SAS drives. I plan to wire the trays to an LSI card using SFF to 4x "SATA" cable(s). Some NAS enclosure bays don't have the notch punched out so SAS drives can't he physically installed.
  • Trayless design is strongly preferred. I partly want to use this as a portable multipurpose NAS so being able to swap drives quickly without needing to unscrew/screw trays would be very useful.
    • A suitable alternative would be tool-less trays where the drives can be swapped without screwdrivers.
  • Support a Mini ITX board with a heatsink/fan.
  • Ideally, an internal 2.5" SSD bay for the boot drive. I can use an internal USB header in a pinch though.
  • PSU should be able to easily handle all 4 drives easily.
  • No need for GPU support, I'm using the single PCIe slot for the SAS HBA.
  • Price - ideally no more than $100 but would go higher if it's got enough cool features.

Thoughts?

3607
 
 
The original post: /r/datahoarder by /u/the_auti on 2025-02-11 03:04:47.

So I know there is Ceph/Ozone/Minio/Gluster/Garage/Etc out there

I have used them all. They all seem to fall short for a SMB Production or Homelab application.

I have started developing a simple object store that implements core required functionality without the complexities of ceph... (since it is the only one that works)

Would anyone be interested in something like this?

Please see my implementation plan and progress.

# Distributed S3-Compatible Storage Implementation Plan

## Phase 1: Core Infrastructure Setup

### 1.1 Project Setup

- [x] Initialize Go project structure

- [x] Set up dependency management (go modules)

- [x] Create project documentation

- [x] Set up logging framework

- [x] Configure development environment

### 1.2 Gateway Service Implementation

- [x] Create basic service structure

- [x] Implement health checking

- [x] Create S3-compatible API endpoints

- [x] Basic operations (GET, PUT, DELETE)

- [x] Metadata operations

- [x] Data storage/retrieval with proper ETag generation

- [x] HeadObject operation

- [x] Multipart upload support

- [x] Bucket operations

- [x] Bucket creation

- [x] Bucket deletion verification

- [x] Implement request routing

- [x] Router integration with retries and failover

- [x] Placement strategy for data distribution

- [x] Parallel replication with configurable MinWrite

- [x] Add authentication system

- [x] Basic AWS v4 credential validation

- [x] Complete AWS v4 signature verification

- [x] Create connection pool management

### 1.3 Metadata Service

- [x] Design metadata schema

- [x] Implement basic CRUD operations

- [x] Add cluster state management

- [x] Create node registry system

- [x] Set up etcd integration

- [x] Cluster configuration

- [x] Connection management

## Phase 2: Data Node Implementation

### 2.1 Storage Management

- [x] Create drive management system

- [x] Drive discovery

- [x] Space allocation

- [x] Health monitoring

- [x] Actual data storage implementation

- [x] Implement data chunking

- [x] Chunk size optimization (8MB)

- [x] Data validation with SHA-256 checksums

- [x] Actual chunking implementation with manifest files

- [x] Add basic failure handling

- [x] Drive failure detection

- [x] State persistence and recovery

- [x] Error handling for storage operations

- [x] Data recovery procedures

### 2.2 Data Node Service

- [x] Implement node API structure

- [x] Health reporting

- [x] Data transfer endpoints

- [x] Management operations

- [x] Add storage statistics

- [x] Basic metrics

- [x] Detailed storage reporting

- [x] Create maintenance operations

- [x] Implement integrity checking

### 2.3 Replication System

- [x] Create replication manager structure

- [x] Task queue system

- [x] Synchronous 2-node replication

- [x] Asynchronous 3rd node replication

- [x] Implement replication queue

- [x] Add failure recovery

- [x] Recovery manager with exponential backoff

- [x] Parallel recovery with worker pools

- [x] Error handling and logging

- [x] Create consistency checker

- [x] Periodic consistency verification

- [x] Checksum-based validation

- [x] Automatic repair scheduling

## Phase 3: Distribution and Routing

### 3.1 Data Distribution

- [x] Implement consistent hashing

- [x] Virtual nodes for better distribution

- [x] Node addition/removal handling

- [x] Key-based node selection

- [x] Create placement strategy

- [x] Initial data placement

- [x] Replica placement with configurable factor

- [x] Write validation with minCopy support

- [x] Add rebalancing logic

- [x] Data distribution optimization

- [x] Capacity checking

- [x] Metadata updates

- [x] Implement node scaling

- [x] Basic node addition

- [x] Basic node removal

- [x] Dynamic scaling with data rebalancing

- [x] Create data migration tools

- [x] Efficient streaming transfers

- [x] Checksum verification

- [x] Progress tracking

- [x] Failure handling

### 3.2 Request Routing

- [x] Implement routing logic

- [x] Route requests based on placement strategy

- [x] Handle read/write request routing differently

- [x] Support for bulk operations

- [x] Add load balancing

- [x] Monitor node load metrics

- [x] Dynamic request distribution

- [x] Backpressure handling

- [x] Create failure detection

- [x] Health check system

- [x] Timeout handling

- [x] Error categorization

- [x] Add automatic failover

- [x] Node failure handling

- [x] Request redirection

- [x] Recovery coordination

- [x] Implement retry mechanisms

- [x] Configurable retry policies

- [x] Circuit breaker pattern

- [x] Fallback strategies

## Phase 4: Consistency and Recovery

### 4.1 Consistency Implementation

- [x] Set up quorum operations

- [x] Implement eventual consistency

- [x] Add version tracking

- [x] Create conflict resolution

- [x] Add repair mechanisms

### 4.2 Recovery Systems

- [x] Implement node recovery

- [x] Create data repair tools

- [x] Add consistency verification

- [x] Implement backup systems

- [x] Create disaster recovery procedures

## Phase 5: Management and Monitoring

### 5.1 Administration Interface

- [x] Create management API

- [x] Implement cluster operations

- [x] Add node management

- [x] Create user management

- [x] Add policy management

### 5.2 Monitoring System

- [x] Set up metrics collection

- [x] Performance metrics

- [x] Health metrics

- [x] Usage metrics

- [x] Implement alerting

- [x] Create monitoring dashboard

- [x] Add audit logging

## Phase 6: Testing and Deployment

### 6.1 Testing Implementation

- [x] Create initial unit tests for storage

- [-] Create remaining unit tests

- [x] Router tests (router_test.go)

- [x] Distribution tests (hash_ring_test.go, placement_test.go)

- [x] Storage pool tests (pool_test.go)

- [x] Metadata store tests (store_test.go)

- [x] Replication manager tests (manager_test.go)

- [x] Admin handlers tests (handlers_test.go)

- [x] Config package tests (config_test.go, types_test.go, credentials_test.go)

- [x] Monitoring package tests

- [x] Metrics tests (metrics_test.go)

- [x] Health check tests (health_test.go)

- [x] Usage statistics tests (usage_test.go)

- [x] Alert management tests (alerts_test.go)

- [x] Dashboard configuration tests (dashboard_test.go)

- [x] Monitoring system tests (monitoring_test.go)

- [x] Gateway package tests

- [x] Authentication tests (auth_test.go)

- [x] Core gateway tests (gateway_test.go)

- [x] Test helpers and mocks (test_helpers.go)

- [ ] Implement integration tests

- [ ] Add performance tests

- [ ] Create chaos testing

- [ ] Implement load testing

### 6.2 Deployment

- [x] Create Makefile for building and running

- [x] Add configuration management

- [ ] Implement CI/CD pipeline

- [ ] Create container images

- [x] Write deployment documentation

## Phase 7: Documentation and Optimization

### 7.1 Documentation

- [x] Create initial README

- [x] Write basic deployment guides

- [ ] Create API documentation

- [ ] Add troubleshooting guides

- [x] Create architecture documentation

- [ ] Write detailed user guides

### 7.2 Optimization

- [ ] Perform performance tuning

- [ ] Optimize resource usage

- [ ] Improve error handling

- [ ] Enhance security

- [ ] Add performance monitoring

## Technical Specifications

### Storage Requirements

- Total Capacity: 150TB+

- Object Size Range: 4MB - 250MB

- Replication Factor: 3x

- Write Confirmation: 2/3 nodes

- Nodes: 3 initial (1 remote)

- Drives per Node: 10

### API Requirements

- S3-compatible API

- Support for standard S3 operations

- Authentication/Authorization

- Multipart upload support

### Performance Goals

- Write latency: Confirmation after 2/3 nodes

- Read consistency: Eventually consistent

- Scalability: Support for node addition/removal

- Availability: Tolerant to single node failure

Feel free to tear me apart and tell me I am stupid or if you would prefer, as well as I would. Provide some constructive feedback.

3608
 
 
The original post: /r/datahoarder by /u/Zuluuk1 on 2025-02-11 02:16:57.

I ran out of space from my 3x12tb cluster. I need to buy something that's 12tb or bigger and I can't seem to find anything that is from a reputable company. I tried ebay, but really want to avoid if I can, sometimes they carry no warranty and priced similar to stores that have 2-3 years warranty.

I was considering to take my parity drive and turn it into my data drive just to have that extra space. It's such a bad idea though.

Is 12tb refurbished drive running out? Should I wait a bit longer to look for something a bit bigger to allow them to be retired from the data centers?

The American has plenty of places who sell refurbished drives.

What are you doing doing?

I live in Ireland, most if not all charge a 30€ premium for delivery.

Please share any decent store that offers decent warranty and price.

3609
 
 
The original post: /r/datahoarder by /u/SuperWog7 on 2025-02-11 00:21:08.

Hello everyone, I know several people already asked questions like this but I actually tried for an hour and didn’t find any way of extracting the 3D glb of this ring : https://www.bulgari.com/ar-ae/AN859006.html I tried looking at network and stuff, found nothing than CORS restriction for the link that may actually contain the 3D glb file. Am I doing something wrong ?

3610
 
 
The original post: /r/datahoarder by /u/JaVelin-X- on 2025-02-11 00:14:30.

Nothing ground breaking where but I'm worried to doesn't work lie I think it does. I generate a lot of files these days but I only need to keep them for a few years. I've been buying small 1 or 2 tb drives to offload my computers on to. So I decided to give google drive a try ( I already use it but not local folders or rather drive for desktop until now)

I have a working folder locally that syncs with good drive. when I think it's time I Move all the files on the google drive (the actual synced folder) to a new folder (for longer term storage) and delete all the local files. and start over again.

Today I did exactly that and I'm not sure what's going on as the (drive) files I moved to a new folder starting winding up in the trash folder (in drive). I think its before the move operation was done cooking. is that right? I thought the move operation would be almost instantaneous.

I panicked a bit and started restoring some of the trash folder files but I'm taking a breath. It un nerves me there isn't a feedback mechanism progress to indicate what is happening.

I'm leery of the mostly do it all for you backup programs. I'm much more a pick the apple up and put it in a new box myself person.

I figured you folks on here would really know how this works

Oh ... it looks like Sync is restoring my local drive now to boot :/

3611
 
 
The original post: /r/datahoarder by /u/Misaria on 2025-02-11 00:04:47.

I'm amazed how good VHS looks after all these years; didn't expect that!

Seems like my tapes are still in good condition because I was expecting something blurry and distorted.

Though I need some help if anyone can clear it up for me.

I'm using VirtualDub2 and it defaults to capturing PAL in 50fps.

I read that you should capture in 25fps and then deinterlace it by doubling the frames.

Now I read that you should capture in 50fps and deinterlace it down to 25fps.

Which one is it?

I started capturing in 50fps, captured a couple of tapes, and today I deleted the results because I thought I was doing it wrong.

I've now recaptured one of the tapes and two others in 25fps but maybe I've messed up.

3612
 
 
The original post: /r/datahoarder by /u/Blackwater_7 on 2025-02-10 22:24:09.

lowkey i was having discomfort with my low remaining space but now i cleared some trash and wow it feels like i bought a new 8tb drive lol now thinking what can i download next

i know hoarding feels good but sometimes you just need to take out the trash you will feel better trust me

however if your content is 100% curated and important ofc this doesnt apply to you

3613
 
 
The original post: /r/datahoarder by /u/Rich-Junket4755 on 2025-02-10 21:48:26.

I mainly use Windows PC for my gaming, photography, video editing, etc.

I don't carry my Macbook Air anymore.

I take large file videos when I travel. I plan to bring my iPad so I can transfer from SD Card, to iPad Pro (256 GB), then to my WD External Hard Drive.

Is there a new hard drive format I should be using that I'm not aware of. Or an alternative solution?

I have a Insta 360. So I take long ass videos. I only have 2x128GB and 1x64GB so I anticipate I'll run out of room fast so I need to transfer during my travel. I'm not worried about photography - I should have lots of space.

Thanks in advance.

3614
 
 
The original post: /r/datahoarder by /u/ZucchiniBeautiful493 on 2025-02-10 21:20:55.
3615
 
 
The original post: /r/datahoarder by /u/JLJFan9499 on 2025-02-10 21:16:23.

Started with buying external 4TB USB hard drive, and now I ordered a NAS and 6TB hard drive for starters for 377,38 euros. I had been told before that whatever you post online stays there, and now realized it isn't true. Mainly going to collect various media. Games, movies, pdf-files, music etc. Stuff that I care for. Also going to preserve my own creative results so that they will be accessible in the future.

Never imagined I would start doing this but anything is possible.

3616
 
 
The original post: /r/datahoarder by /u/Raven_Drakeaurd on 2025-02-10 21:12:03.

I know it can be rather impractical, but I have a specific project in mind that would require such a thing.

Any and all advice is appreciated!

3617
 
 
The original post: /r/datahoarder by /u/Ind1goJoe on 2025-02-10 20:40:24.

I’m looking to upgrade my storage solution for photography and videography. Currently Im at a mix of cloud storage and a single Seagate 1TB External HDD (I know its horrible but thats why I’m here).

My ideal workflow would be to get an external SSD, currently looking at a Samsung T7 Shield 2TB, that I can edit from and bring on the go, then offload that to a storage solution at home with some level of backup/redundancy. I know I want at least 2 backups that aren’t cloud based, and I don’t mind physically plugging in to offload my files when I need to.

I do want to keep the cost reasonable but I do want it to be automated to some degree. I don’t want to have to be plugging and unplugging multiple drives and physically managing all of the backups if I can avoid it. And I don’t necessarily need a NAS, as I will never really need to access files from outside my home or be in a situation where a DAS solution would be impossible. In a perfect world I would sit at my desk, connect my laptop and my SSD, and let some software copy it to at least 2 independent locations, and that would be it. Then I wipe my portable drive and rinse and repeat.

So what would be my best solution with this? Im hoping to keep the cost, aside from the portable SSD, around $300 or less, but if spending a little more is worthwhile, its not totally out of the question. The ability to upgrade in the future would be nice as well, but my main concern is just getting SOMETHING for now that’s better than what Im using.

3618
 
 
The original post: /r/datahoarder by /u/manzurfahim on 2025-02-10 20:35:14.

I always buy drives from serverpartdeals and they always ship the drive very securely, never had a problem. Recently, I do not see many drives from them, but a few from other companies as mentioned above.

Just wondering if you have purchased drives from any of the sellers other than serverpartdeals and how their packaging is? I don't live in the US, so I buy drive online and get them shipped to a courier company who then ships it to me in Asia, so good packaging is necessary.

3619
 
 
The original post: /r/datahoarder by /u/BetterProphet5585 on 2025-02-10 20:11:35.

I have around 20TB of photos, nested inside folders based on year and month of acquisition, while hoarding them I didn't really pay attention if they were duplicates.

I would like something local and free, possibly open-source - I have basic programming skills and know how to run stuff from a terminal, in case.

I only know or heard of:

  • dupeGuru
  • Czkawka

But I never used them.

Know that since the photos come from different devices and drives, their metadata might have gotten skewed so the tool would have to be able to spot duplicates based on image content and not data.

My main concerns:

  • tool not based only on metadata
  • tool able to go through nested folders (YearFolder/MonthFolder/photo.jpg
  • tool able to go through different formats, .HEIC included (in case this is impossible I would just convert all the photos with another tool)

Do you know a tool that can help me?

3620
 
 
The original post: /r/datahoarder by /u/No_Gur_7422 on 2025-02-10 20:10:56.

I want to extract some photographs of old documents from the website of the National Library of Ireland, but I can't make Dezoomify do it. How should I go about it?

3621
 
 
The original post: /r/datahoarder by /u/ArchonOSX on 2025-02-10 19:56:26.

Good article from Mother Jones by way of Grist original post:

https://www.motherjones.com/politics/2025/02/federal-researchers-science-archive-critical-climate-data-trump-war-dei-resist/

https://grist.org/politics/the-scramble-to-save-critical-climate-data-from-trumps-war-on-dei/

Datahoarders are assisting this effort.

Thank you for your efforts supporting democracy.

Happy Day!

3622
 
 
The original post: /r/datahoarder by /u/MaybeARunnerTomorrow on 2025-02-10 18:06:49.

Pretty much the title!

I have inherited probably 20+ boxes of family photos - of all different shapes and sizes. I have the storage space sorted out for it, but looking for some feedback or advice on what scanners are decent?

I was looking at the Epson FastFoto-FF-680W - but it does have rollers and I've seen people complain about it leaving marks or residue on some images? My local photo lab does use this for their uploads and storage customers too.

I do already have a flatbed scanner, and plan on using that for some older images (and newspaper articles), but wasn't sure if there were better options out there.

3623
 
 
The original post: /r/datahoarder by /u/Daemonix00 on 2025-02-10 17:51:59.

Why would you not choose the WD? Specs and warranty seem very similar?? No?

Its going to be 2 of them (2Tb model) in mirror serving 24/7, OS, VMs, and Containers on a full SSD/HBA PC build, these disks are not going to be abused but they need to live loooong :)

https://preview.redd.it/lv0886vsncie1.png?width=862&format=png&auto=webp&s=340938a80c3329200ab8327d660e347e87f6ab4d

https://preview.redd.it/995q35vsncie1.png?width=808&format=png&auto=webp&s=a9f699400248f5a3b96ee3b50ec11f3e3a171736

3624
 
 
The original post: /r/datahoarder by /u/Sea_Mud5315 on 2025-02-10 17:46:25.

I will be taking Amtrak with a suitcase and a backpack to my parents home to retrieve my server and bring it back to my apartment in a different state. I figure it’s best to remove the drives from the case and package each of them individually. I was thinking of just using bubble wrap and tape to package them, and then throwing all of them into my book bag to store in footwell or in my lap for the ride, while placing the case with the other components into the suitcase. Any thoughts/suggestions?

3625
 
 
The original post: /r/datahoarder by /u/FamilyFuneralFun on 2025-02-10 16:21:51.

I’ve seen quite a few debates between the two. I’m sure this is based on both being budget friendly but what is your take on it in regard to specs and overall performance?

view more: ‹ prev next ›