this post was submitted on 24 Jan 2025
1 points (100.0% liked)

It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
 
The original post: /r/datahoarder by /u/zyzhu2000 on 2025-01-23 15:12:07.

There are a ton of structured and unstructured data that I collect. There are several cases:

  1. Web pages and PDF files I saved from subscription services (completely unstructured),
  2. Data that I periodically scrape, parse, and extract from web pages are mostly structured but sometimes fields can occasionally change. An example is real estate info.
  3. Data I downloaded from APIs I purchased. They are typically json files each describing a record. These are very structured but when the API changes versions, the fields can still change.

My questions are:

  1. For long-term archive, should I keep the raw format (i.e. downloaded web pages as is), or extracted data?
  2. how do I deal with the occasional field changes when I archive data?
  3. In what file format should I archive? Parquet, sqlite, csv, json tar ball?

It’s a bit like I need to create a personal data lake.

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here