It's A Digital Disease!

23 readers

1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago

MODERATORS

bOt@zerobytes.monster

Storing good amount of small text files (zerobytes.monster)

submitted 1 year ago by bOt@zerobytes.monster to c/datahoarder@zerobytes.monster

0 comments fedilink hide all child comments

The original post: /r/datahoarder by /u/BigMickDo on 2024-09-01 15:49:24.

I'm scraping certain pages daily for past year or so, nothing too crazy but right now I have like 50 GB and over a million text files (JSON + HTML).

I have been lazy and hadn't done anything with the data, but given the annoying amount of files at this point, I think I need to zip for archiving (reducing the amount of files in there, and size).

The only thing I need to store to keep going is the file names of the ones that I've successfully downloaded.

I'm thinking about writing a text file of that info, then automatically having the files themselves added to zip.

Looking for suggestions.

things I'm considering:

Storing in duckdb of file name + the file content as text

storing in parquet

storing in SQLite Archive.

just zip file as compression and container.

SQLite seems like a good solution but how good is their compression? I know there are add-ons like https://sqlite.org/com/zipvfs.html

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here