It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
3976
 
 
The original post: /r/datahoarder by /u/djnron on 2025-02-01 21:05:16.

https://zinebakery.com/assets/homemade-zines/bakeshop-zines/DIYWebArchiving-DombrowskiKijasKreymerWalshVisconti-V4.pdf

Yeah so this is probably known here kind of a manual for archiving, anyways maybe it is helpfulfor some folks.

3977
 
 
The original post: /r/datahoarder by /u/paperedbones on 2025-02-01 20:17:49.

First time poster, long time lurker. Recently read an article about Reddit deteriorating, eroded by a fresh wave of bot influx. This may be the usual doomsaying hysteria, but it did lead me to consider - amid all the other hijinks afoot within the US government - that it would be prudent to have a back up method by which the talented & knowledgeable individuals on this subreddit may share their skills with one another in the event of "something happening" to Reddit, eventually.

Basically, suspecting that the enshittification and censorship of the internet is soon to reach new levels of intensity, how can this community & its knowledgebase be backed up?

So this is the question: is there an active Discord server? Does anyone here recommend any other communities where this kind of knowledge is shared?

Personally, I'm not big on small talk and find most of the chatter in most Discord servers inane and needless, but recognize the usefulness of having a network of intelligent skillful people as a sort of brain trust. Haha Maybe the idea is self-defeating: if a server exists, it needs to be active, but if there's isn't anything urgent to say or ask, a lot of activity will generally be rubbish chitchat, and if there's too much rubbish chitchat, most people valuing quality exchanges will eventually just leave the server? But maybe I'm mistaken.

I imagine many of you feel similarly, and it would be a loss to all of us if our major means of idea exchange (ie this subreddit?) ever collapsed into oblivion. Anyway...your thoughts?

3978
 
 
The original post: /r/datahoarder by /u/Blood_Wraith7777 on 2025-02-01 19:43:59.
3979
 
 
The original post: /r/datahoarder by /u/VeryConsciousWater on 2025-02-01 19:32:44.

Good morning r/DataHoarder,

Many of you have probably seen me working on the CDC datasets archive, but those thread have gotten a bit cluttered and I have a lot of people to notify, so I'm making this a new post.

Over the past several days I've been archiving and uploading a copy of all public datasets formerly available at data.cdc.gov, as of 2025-01-28. This does not include webpages themselves, as those have already largely been archived by projects like EOTArchive and the Wayback Machine.

This upload is now complete and available at https://archive.org/details/20250128-cdc-datasets. For seeders use the file "full-20250128-cdc-datasets-USETHIS.torrent" included in the files or the magnet at the end of this post.

For more context have a look at this post and this post.

Thank you to everyone who requested this important data, and particularly to those who have offered to mirror it. I'll ping everyone who has requested notice ~~in a comment~~, unless you DMed me requesting notice in which case I'll respond to your message.

Happy hoarding everyone!

Brief ETA: Reddit is really not a fan of bulk pinging apparently, so I'll have to go back through the thread to notify everyone. That'll take some time, so apologies for that.

Torrent mirror:

magnet:?xt=urn:btih:3bf9d780d838b6bbc977e9cc6a9530e70ec49732&dn=20250128-cdc-datasets&tr=udp%3A%2F%2Ftracker.0x7c0.com%3A6969%2Fannounce&tr=udp%3A%2F%2Fexodus.desync.com%3A6969%2Fannounce&tr=udp%3A%2F%2Fexplodie.org%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.free-tracker.ga%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.qu.ax%3A6969%2Fannounce&tr=http%3A%2F%2Fopen.tracker.cl%3A1337%2Fannounce&tr=udp%3A%2F%2Fns-1.x-fins.com%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.bittor.pw%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker-udp.gbitt.info%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.ololosh.space%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.demonii.com%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.tiny-vps.com%3A6969%2Fannounce&tr=udp%3A%2F%2Fopen.stealth.si%3A80%2Fannounce&tr=udp%3A%2F%2Fopen.dstud.io%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.dler.org%3A6969%2Fannounce&tr=udp%3A%2F%2Fopentracker.io%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce&tr=udp%3A%2F%2Ftracker.dump.cl%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.theoks.net%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.torrent.eu.org%3A451%2Fannounce

3980
 
 
The original post: /r/datahoarder by /u/PickleGambino on 2025-02-01 19:15:23.

I have a video that I saved to the Internet Archive using RecoverMyVideo. I saw a Reddit post with this same question 6 years ago, but the link that someone posted to this tool for saving videos didn't work anymore.

3981
 
 
The original post: /r/datahoarder by /u/gummytoejam on 2025-02-01 18:09:28.
3982
 
 
The original post: /r/datahoarder by /u/itscalledabelgiandip on 2025-02-01 17:44:22.

I've been increasingly concerned about things getting deleted from the National Archives Catalog so I made a series of python scripts for scraping and monitoring changes. The tool scrapes the Catalog API, parses the returned JSON, writes the metadata to a PostgreSQL DB, and compares the newly scraped data against the previously scraped data for changes. It does not scrape the actual files (I don't have that much free disk space!) but it does scrape the S3 object URLs so you could add another step to download them as well.

I run this as a flow in a Windmill docker container along with a separate docker container for PostgreSQL 17. Windmill allows you to schedule the python scripts to run in order and stops if there's an error and can send error messages to your chosen notification tool. But you could tweak the the python scripts to run manually without Windmill.

If you're more interested in bulk data you can get a snapshot directly from the AWS Registry of Open Data and read more about the snapshot here. You can also directly get the digital objects from the public S3 bucket.

This is my first time creating a GitHub repository so I'm open to any and all feedback!

https://github.com/registraroversight/national-archives-catalog-change-monitor

3983
 
 
The original post: /r/datahoarder by /u/storytracer on 2025-02-01 16:55:27.

I'm currently mirroring all FTP and HTTP file servers of the US federal government I can find. Here's the current status of all downloads. Please let me know if you come across any other sites, I will add them to the download list! I have 150TB of storage available and can get more if necessary.

3984
 
 
The original post: /r/datahoarder by /u/verticalfuzz on 2025-02-01 13:30:44.

Has anyone archived the data at https://webbook.nist.gov/chemistry/ ?

Can someone help me figure out how, or preferably, do it and share it? I have some storage space but no idea how to archive stuff. This data is very important for research and the chemical/engineering/water/pharma industries.

I believe this may be the same data: https://catalog.data.gov/dataset/nist-chemistry-webbook-srd-69-de237

3985
 
 
The original post: /r/datahoarder by /u/igmkjp1 on 2025-02-01 13:29:38.

See title.

3986
 
 
The original post: /r/datahoarder by /u/juliacakes on 2025-02-01 13:23:49.

I’m checking on mobile on chrome and safari.

3987
 
 
The original post: /r/datahoarder by /u/Scienceyall on 2025-02-01 12:29:28.

My algorithm found you. I feel better knowing you exist. Your efforts will not be for nothing Winston Smith.

3988
 
 
The original post: /r/datahoarder by /u/galamsmsmsm on 2025-02-01 09:51:10.

With the way things are going, I wouldn't be surprised if Internet Archive became a target for censorship. Does anyone know if there are backups hosted in other countries or plans to move their data?

In a 2016 blog post, they mentioned that they were planning to host a copy of the archive in Canada and that they have partial copies hosted in Egypt and the Netherlands. Is that still relevant information?

3989
 
 
The original post: /r/datahoarder by /u/quinyd on 2025-02-01 08:40:05.

https://brickshelf.com/ is shutting down March 1st.

I’m not well versed in scraping it would be sad to see so many Lego albums be deleted and there’s lots of custom instructions on there too.

3990
 
 
The original post: /r/datahoarder by /u/MeepMeep2000 on 2025-02-01 08:21:55.

Hey,

I currently run 2x12TB and want to add more storage.

My main options are:

Buy 2x16TB and make two mirrors, for a total of 28TB

OR

Buy 2x12TB and make a raidz1 for a total of 36TB

Obviously the second option is not only cheaper but also provides more storage.

The problem is, that the second option will lock me more into the 12TB, while the first allows me to more easily extend with 16TB Drives in the future.

Is it still worth it to go with 12TB drives or will prices of higher capacity drives drop quickly enough to already start with a 16TB array?

3991
 
 
The original post: /r/datahoarder by /u/ladycaviar on 2025-02-01 07:50:52.

Never thought I'd have to think this, much less say it, but to all those of you who save humanity's data, I salute you

you all are heroes in a super weird world

3992
 
 
The original post: /r/datahoarder by /u/CiaIsMyWaifu on 2025-02-01 06:46:52.

I always remember hearing storage was really expensive, and with mechanical drives growing up, higher capacities being more likely to give out with a lot of use. How is storage in current era and fail rates? I'm still using about 4TB between two drives.

3993
 
 
The original post: /r/datahoarder by /u/DangDoood on 2025-02-01 04:18:39.

Y’all probably feel so justified right now… it’s like being a survivalist/doomsday packer and the zombie apocalypse just happens.

Appreciate y’all

(And of course this is ignoring the genuine fear, insecurity, and worries people are experiencing)

3994
 
 
The original post: /r/datahoarder by /u/Narrow-Task on 2025-02-01 04:00:46.

Hi fellow hoarders, I noticed the detailed data downloads from the census bureau (the ftp site) is down right now. Is this a coincidence or just routine maintenance?

https://www2.census.gov/geo/tiger/TIGER2024/

I would like to save all of this down as I use it for a lot of personal and professional work. And it's just cool.

3995
 
 
The original post: /r/datahoarder by /u/future__fires on 2025-02-01 03:16:30.

First post here. I’ve been lurking for a while until I had enough money saved to build a serious setup but with the CDC website going down I guess I’ve run out of time. Climate data is extremely important to me and I don’t even know where to start archiving or what is important but I expect information on climate change will be sufficiently inconvenient to the Trump admin that it’ll come down soon as well. I’ve also considered the fact that a lot of climate data is kept by universities and that will be harder for the White House to remove. I feel overwhelmed. If anyone could give me ideas on where to start or if climate data is stored in enough places and by enough different entities that it will be around for a while. Also just generally, what do I do? I don’t have the money for terabytes of storage space. I’ve got a desktop PC with about 1TB and a laptop.

3996
 
 
The original post: /r/datahoarder by /u/DROP_DAT_DURKA_DURK on 2025-02-01 03:09:40.
3997
 
 
The original post: /r/datahoarder by /u/Emotional_Bunch_799 on 2025-02-01 02:58:32.

I worked in infectious diseases field, and I think the following sites are high risk of being scrubbed.

We need help archiving the following. They require large amount of storage space due to all their databases:

NIH National Library of Medicine: https://www.ncbi.nlm.nih.gov/

NIH National Institute of Allergy and Infectious Diseases: https://www.niaid.nih.gov/

FDA and their databases: https://www.fda.gov/

FDA site has been noticeably slower and some pages are unresponsive.

Thank you and I'll donate to organizations that are fighting this!✊

3998
 
 
The original post: /r/datahoarder by /u/uinstitches on 2025-02-01 02:24:14.

the last 1080p story I saved was January 6, and all 35 stories I've ripped since then are 720p. very disappointing as if I knew I would have screen recorded. has Instagram blocked apps from ripping stories at max bitrate?

what apps or websites are u guys using?

3999
 
 
The original post: /r/datahoarder by /u/Megathreadd on 2025-02-01 01:04:02.
4000
 
 
The original post: /r/datahoarder by /u/dominionman on 2025-02-01 01:03:46.

Is there any group organizing an effort to create a shadow instance of "vital sites and information"? I would be willing to bet that many of us have at least some spare space and the ability to host things like cdc.screwfascists.com or whatever to make sure that things are continued. Maybe this could be the beginning of a trusted decentralized register of scientific and historical data. Not to step on Wikipedia's toes.

view more: ‹ prev next ›