There are a ton of structured and unstructured data that I collect. There are several cases:
- Web pages and PDF files I saved from subscription services (completely unstructured),
- Data that I periodically scrape, parse, and extract from web pages are mostly structured but sometimes fields can occasionally change. An example is real estate info.
- Data I downloaded from APIs I purchased. They are typically json files each describing a record. These are very structured but when the API changes versions, the fields can still change.
My questions are:
- For long-term archive, should I keep the raw format (i.e. downloaded web pages as is), or extracted data?
- how do I deal with the occasional field changes when I archive data?
- In what file format should I archive? Parquet, sqlite, csv, json tar ball?
It’s a bit like I need to create a personal data lake.