this post was submitted on 31 Aug 2024
1 points (100.0% liked)

It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
 
The original post: /r/datahoarder by /u/BigMickDo on 2024-08-30 21:25:30.

Hey, I have large amount of json and html pages (close to a million) I've scrapped that I don't feel like processing now, but it is starting to be a problem given how many I have in raw format and it is inefficient.

I'm looking for ideas, things I've considered:

  1. I can store few meta data that I need (like page ID) and just zip the files till I decide to parse them.
  2. appending to parquet monthly to compact them.
  3. throw them in a DB with important meta data modeled but store the raw json/html as text so it can be compressed, but I'd like a db that's storage efficient and doesn't corrupt easily.

Can do something like sqlite archive, duckdb or postgres (for toast) maybe for third option.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here