this post was submitted on 02 Sep 2024
1 points (100.0% liked)

It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
 
The original post: /r/datahoarder by /u/BigMickDo on 2024-09-01 15:49:24.

I'm scraping certain pages daily for past year or so, nothing too crazy but right now I have like 50 GB and over a million text files (JSON + HTML).

I have been lazy and hadn't done anything with the data, but given the annoying amount of files at this point, I think I need to zip for archiving (reducing the amount of files in there, and size).

The only thing I need to store to keep going is the file names of the ones that I've successfully downloaded.

I'm thinking about writing a text file of that info, then automatically having the files themselves added to zip.

Looking for suggestions.

things I'm considering:

Storing in duckdb of file name + the file content as text

storing in parquet

storing in SQLite Archive.

just zip file as compression and container.

SQLite seems like a good solution but how good is their compression? I know there are add-ons like https://sqlite.org/com/zipvfs.html

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here