The original post: /r/datahoarder by /u/zzswol on 2025-01-27 00:52:30.
The open-source AI community is releasing powerful models. Things are moving fast. You might not have the hardware, expertise, or attention to take proper advantage of them in the moment. Many people are in this position. The future is uncertain. I believe it is important to preserve the moment. Maybe we get AGI and It becomes ashamed of its infantile forms, user AI becomes illegal, etc (humor me).
What appears to be lacking: distributions mechanisms privileging archival.
I don't know what's going on, but I want to download stuff. What training data should I download? Validation data? Which models do I download? Which quantizations? In the future, to understand the present moment, we will want all of it. How do we support this?
I am imagining a place people of all sorts can go to find various distributions prepared:
prepper package: (high storage, low compute) - save all "small" models, distillations, etc
tech enthusiast package: (medium storage, medium compute) - save all major base models with scripts to reproduce published quantizations, fine-tunes, etc? [An archeologist will want closest access to what was commonly deployed at any given time]
rich guy package: (high storage, high compute) - no work needed here? just download ~everything~
alien archeologist package: ("minimal" storage, high compute) - a complete, non-redundant set of training data and source code for all pipelines? something a particularly dedicated and resourceful person might choose to laser etch into a giant artificial diamond and launch into space
Does this exist already?