I'm scraping certain pages daily for past year or so, nothing too crazy but right now I have like 50 GB and over a million text files (JSON + HTML).
I have been lazy and hadn't done anything with the data, but given the annoying amount of files at this point, I think I need to zip for archiving (reducing the amount of files in there, and size).
The only thing I need to store to keep going is the file names of the ones that I've successfully downloaded.
I'm thinking about writing a text file of that info, then automatically having the files themselves added to zip.
Looking for suggestions.
things I'm considering:
Storing in duckdb of file name + the file content as text
storing in parquet
storing in SQLite Archive.
just zip file as compression and container.
SQLite seems like a good solution but how good is their compression? I know there are add-ons like https://sqlite.org/com/zipvfs.html