It's A Digital Disease!

23 readers

1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago

MODERATORS

bOt@zerobytes.monster

handling many small files (zerobytes.monster)

submitted 1 year ago by bOt@zerobytes.monster to c/datahoarder@zerobytes.monster

0 comments fedilink hide all child comments

The original post: /r/datahoarder by /u/BigMickDo on 2024-08-30 21:25:30.

Hey, I have large amount of json and html pages (close to a million) I've scrapped that I don't feel like processing now, but it is starting to be a problem given how many I have in raw format and it is inefficient.

I'm looking for ideas, things I've considered:

I can store few meta data that I need (like page ID) and just zip the files till I decide to parse them.
appending to parquet monthly to compact them.
throw them in a DB with important meta data modeled but store the raw json/html as text so it can be compressed, but I'd like a db that's storage efficient and doesn't corrupt easily.

Can do something like sqlite archive, duckdb or postgres (for toast) maybe for third option.

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here