It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
3751
 
 
The original post: /r/datahoarder by /u/deja_geek on 2025-02-07 07:13:34.

I'm going to start by saying I know these are refurbished/renewed/recertified drives. I'm not here to debate if buying refurb drives are a good purchase or not.

If you are looking into buying MDD drives from GoHardDrive, the listing on their site shows an example of the hard drive. The 12TB Enterprise shows what appears to be a relabeled Segate Exos drive. I've ordered 6 of these 12TB drives. Despite having the same model number, 5 of them are relabeled Seagate Barracuda Pros and 1 of them is a Seagate Exos X16. The Exos reports .01TB less then the Barracuda Pros. This can be problematic if you are setting up or replacing a disk in a ZFS pool

While I'll personally continue to buy from GoHardDrive, I find it a bit of a problem when they give the same model number to two very different drives.

X16 on the left, Barracuda on the right

X16 on the left, Barracuda on the right

3752
 
 
The original post: /r/datahoarder by /u/Aniconomics on 2025-02-07 06:09:25.

I have gPodder which means I can download entire podcasts using their RSS feed Links. But I also want to download from specific topics. Like If I search “Gordon Ramsay” on Podchaser, there will be a total of 109 episodes discussing Gordon Ramsay. Is there a way to bulk download all these episodes off Podchaser without manually downloading each episode individually or Inputting each podcasts RSS link into gPodder to download that one specific episode?

3753
 
 
The original post: /r/datahoarder by /u/Houyhnhnm776 on 2025-02-07 05:39:54.
3754
 
 
The original post: /r/datahoarder by /u/thefool00 on 2025-02-07 02:55:55.

I have an annoyance that I thought the fine people of this board may be uniquely experienced to assist with. I have a HDD on my play rig that sits on my desk next to me. When it spins up it makes this consistent "scratch scratch" noise every 3 or so seconds. It's not the death click, it's the noise they make under heavy RW operations (going to call it a scratch), but it's very quick, like two consistent scratches within a second, always 2, then 3 seconds of silence, then scratch scratch again. Drive is healthy (tested), fairly new, works great. Anyone had this happen before that they could solve, or any troubleshooting tips?

3755
 
 
The original post: /r/datahoarder by /u/didyousayboop on 2025-02-07 02:21:55.

Link: https://blog.archive.org/2025/02/06/update-on-the-2024-2025-end-of-term-web-archive/

For those concerned about the data being hosted in the U.S., note the paragraph about Filecoin. Also, see this post about the Internet Archive's presence in Canada.

Full text:

Every four years, before and after the U.S. presidential election, a team of libraries and research organizations, including the Internet Archive, work together to preserve material from U.S. government websites during the transition of administrations.

These “End of Term” (EOT) Web Archive projects have been completed for term transitions in 2004200820122016, and 2020, with 2024 well underway. The effort preserves a record of the U.S. government as it changes over time for historical and research purposes.

With two-thirds of the process complete, the 2024/2025 EOT crawl has collected more than 500 terabytes of material, including more than 100 million unique web pages. All this information, produced by the U.S. government—the largest publisher in the world—is preserved and available for public access at the Internet Archive.

“Access by the people to the records and output of the government is critical,” said Mark Graham, director of the Internet Archive’s Wayback Machine and a participant in the EOT Web Archive project. “Much of the material published by the government has health, safety, security and education benefits for us all.”

The EOT Web Archive project is part of the Internet Archive’s daily routine of recording what’s happening on the web. For more than 25 years, the Internet Archive has worked to preserve material from web-based social media platforms, news sources, governments, and elsewhere across the web. Access to these preserved web pages is provided by the Wayback Machine. “It’s just part of what we do day in and day out,” Graham said. 

To support the EOT Web Archive project, the Internet Archive devotes staff and technical infrastructure to focus on preserving U.S. government sites. The web archives are based on seed lists of government websites and nominations from the general public. Coverage includes websites in the .gov and .mil web domains, as well as government websites hosted on .org, .edu, and other top level domains. 

The Internet Archive provides a variety of discovery and access interfaces to help the public search and understand the material, including APIs and a full text index of the collection. Researchers, journalists, students, and citizens from across the political spectrum rely on these archives to help understand changes on policy, regulations, staffing and other dimensions of the U.S. government. 

As an added layer of preservation, the 2024/2025 EOT Web Archive will be uploaded to the Filecoin network for long-term storage, where previous term archives are already stored. While separate from the EOT collaboration, this effort is part of the Internet Archive’s Democracy’s Library project. Filecoin Foundation (FF) and Filecoin Foundation for the Decentralized Web (FFDW) support Democracy’s Library to ensure public access to government research and publications worldwide.

According to Graham, the large volume of material in the 2024/2025 EOT crawl is because the team gets better with experience every term, and an increasing use of the web as a publishing platform means more material to archive. He also credits the EOT Web Archive’s success to the support and collaboration from its partners.

Web archiving is more than just preserving history—it’s about ensuring access to information for future generations.The End of Term Web Archive serves to safeguard versions of government websites that might otherwise be lost. By preserving this information and making it accessible, the EOT Web Archive has empowered researchers, journalists and citizens to trace the evolution of government policies and decisions.

More questions? Visit https://eotarchive.org/ to learn more about the End of Term Web Archive.

If you think a URL is missing from The End of Term Web Archive's list of URLs to crawl, nominate it here: https://digital2.library.unt.edu/nomination/eth2024/about/


For information about datasets, see here.

For more data rescue efforts, see here.

For what you can do right now to help, go here.


Updates from the End of Term Web Archive on Bluesky: https://bsky.app/profile/eotarchive.org

Updates from the Internet Archive on Bluesky: https://bsky.app/profile/archive.org

Updates from Brewster Kahle (the founder and chair of the Internet Archive) on Bluesky: https://bsky.app/profile/brewster.kahle.org

3756
 
 
The original post: /r/datahoarder by /u/Private_Mandella on 2025-02-07 01:27:30.

I'm trying to build out my own personal library that efficiently replicates as much knowledge as I can fit in there. I know a lot of people approach this from many different directions. Mirroring libgen or scihub is too big a project for me right now, so I'd like to have textbooks that would mostly recreate any degree you could get at most large US universities. I expect this would end up being around ~1000 textbooks and handbooks total by the end of it.

I started trying to map out how I would do this and it is a lot of work. Collating syllabi and book recommendations with prerequisites is a lot of work, especially given the number of departments in a typical university. OpenSyllabus is great but it not clear if their API would be able to help me. I've contacted them about pricing for self-learners but they haven't gotten back to me.

There are lots of piecemeal examples of what I want, such as this math roadmap but I don't know if anyone has aggregated something approximating what I want.

Does anyone know if something like this exists? If not I'll start building it out, but it's going to take awhile.

3757
 
 
The original post: /r/datahoarder by /u/maybehelp244 on 2025-02-07 00:50:32.

The Demographic and Health Surveys (DHS) Program has collected, analyzed, and disseminated accurate and representative data on population, health, HIV, and nutrition through more than 400 surveys in over 90 countries for over 40 years. The project is funded by USAID and currently under a stop work order. Virtually all USAID staff involved have been fired, furloughed or put on admin leave. The contractor, ICF, is also under a stop work order which means all surveys have been stopped and all other activities have been put on hold including trainings, datasets download registrations, and support for researchers needing assistance with analysis of the data. The website is still up and can be accessed here: https://www.dhsprogram.com/ with limited functionality due to the stop work order. It is extremely unfortunate what is going on with the project as the data is considered the gold standard of international public health in developing countries.

3758
 
 
The original post: /r/datahoarder by /u/detroitcityy on 2025-02-07 00:39:57.

Hey, I was doing some research about the silver fast software being better than epson software is that true? I basically scan cd booklets to print them on t shirt's I heard about the iT8 and the scratch and dust removal feature does this stuff for work MacBook Pro m4?

3759
 
 
The original post: /r/datahoarder by /u/didyousayboop on 2025-02-07 00:21:31.

The blog post is here: https://lil.law.harvard.edu/blog/2025/02/06/announcing-data-gov-archive/

Here's the full text:

Announcing the Data.gov Archive

Today we released our archive of data.gov on Source Cooperative. The 16TB collection includes over 311,000 datasets harvested during 2024 and 2025, a complete archive of federal public datasets linked by data.gov. It will be updated daily as new datasets are added to data.gov.

This is the first release in our new data vault project to preserve and authenticate vital public datasets for academic research, policymaking, and public use.

We’ve built this project on our long-standing commitment to preserving government records and making public information available to everyone. Libraries play an essential role in safeguarding the integrity of digital information. By preserving detailed metadata and establishing digital signatures for authenticity and provenance, we make it easier for researchers and the public to cite and access the information they need over time.

In addition to the data collection, we are releasing open source software and documentation for replicating our work and creating similar repositories. With these tools, we aim not only to preserve knowledge ourselves but also to empower others to save and access the data that matters to them.

For suggestions and collaboration on future releases, please contact us at [lil@law.harvard.edu](mailto:lil@law.harvard.edu).

This project builds on our work with the Perma.cc web archiving tool used by courts, law journals, and law firms; the Caselaw Access Project, sharing all precedential cases of the United States; and our research on Century Scale Storage. This work is made possible with support from the Filecoin Foundation for the Decentralized Web and the Rockefeller Brothers Fund.

You can follow the Library Innovation on Bluesky here.


Edit (2025-02-07 at 01:30 UTC):

u/lyndamkellam, a university data librarian, makes an important caveat here.

3760
 
 
The original post: /r/datahoarder by /u/Ok-Scientist-4165 on 2025-02-06 23:50:01.

Does anyone have a good system / device / process for scanning and OCRing multiple thousands of handwritten pages?

3761
 
 
The original post: /r/datahoarder by /u/lyndamkellam on 2025-02-06 23:45:56.

This was just posted by the Harvard Library Innovation Lab. https://lil.law.harvard.edu/blog/2025/02/06/announcing-data-gov-archive/ Note the Data Limitations: "data.gov includes multiple kinds of datasets, including some that link to actual data files, such as CSV files, and some that link to HTML landing pages. Our process runs a "shallow crawl" that collects only the directly linked files. Datasets that link only to a landing page will need to be collected separately."

3762
 
 
The original post: /r/datahoarder by /u/OneChrononOfPlancks on 2025-02-06 23:35:58.

Seems to affect larger packages, the torrents are always truncated.

Does anybody know the technical explanation why this is happening?

3763
 
 
The original post: /r/datahoarder by /u/brokewash on 2025-02-06 20:24:55.
3764
 
 
The original post: /r/datahoarder by /u/bobiversus on 2025-02-06 14:59:21.

https://www.bloomberg.com/news/articles/2025-02-06/hoarders-rush-to-save-us-health-data-after-string-of-trump-orders

https://archive.ph/TrYet (thanks, evildad53)

However, from the sounds of their efforts and sleuthing skills, these patients only have contracted Level 1 DataHoarding. They have not yet progressed to Level 5.

3765
 
 
The original post: /r/datahoarder by /u/Budget_Worldliness42 on 2025-02-06 14:44:42.

https://www.rollingstone.com/politics/politics-news/trump-national-archives-maralago-fbi-1234606476/

Apparently this has long time been a goal of his. How much do we know about what information we can save? What other potential targets should we be looking at in terms of valuable information that could be taken down? I'd love to hear your thoughts.

3766
 
 
The original post: /r/datahoarder by /u/jessie15273 on 2025-02-06 12:07:18.

The idea of the dissolution of the department of education is a scary thought for me. I'm getting a few drives and will be hoarding a bunch of educational content. Have a little baby. Don't want her to be uneducated if things go really south with education!

Going to sift through threads for good drives and get cracking on a variety of curriculum. There's so many schooling resources available. I want to collect as much of a variety as I can. Can you think of anything else that would be useful? Things that work with schooling, like resources to be used with already developed materials. Wikipedia is on the list!

I want to be able to boot it all up on a single computer without internet. All of k-12.

3767
 
 
The original post: /r/datahoarder by /u/CalculatingLao on 2025-02-06 11:17:52.

Over the last few weeks this sub has basically just become a US politics news sub. Every day it's just arguments about politics, predictions about oncoming doom, and people just linking random news stories in what seems to be attempted karma farming.

Can we just have a pinned mega thread to contain it all in one place, and cut down on the spam?

I get that this is one of the most exciting things to happen for a lot of hoarders, and people are excited to put their skills and scripts to the test. However, not everyone lives in America.

3768
 
 
The original post: /r/datahoarder by /u/batmaniac77 on 2025-02-06 19:13:20.
3769
 
 
The original post: /r/datahoarder by /u/Legit_TheGamingwithc on 2025-02-06 18:17:09.

Decided to start backing up off site and stuff. These 2 both seem pretty cool. Which one is better to use

3770
 
 
The original post: /r/datahoarder by /u/EducationalOcelot4 on 2025-02-06 18:04:34.

Is there a current copy available somewhere? the best i've found was Jan 25th and a LOT has happened since.

Bonus points if it's a torrent, all i keep finding are direct downloads and that's impractical.

3771
 
 
The original post: /r/datahoarder by /u/boxed_knives on 2025-02-06 17:57:58.

Some context:

I plugged my LaCie Rugged Mini into my MacBook Pro (via a Satechi Type-C Pro Hub Adapter) to transfer some folders over and it functioned as normal. When I pressed eject, I was denied as "another program was using it" (which wasn't true). Just as I opened Disk Utility, my LaCie finally decided to eject itself, though I did so through DU just to be safe.

I decided I plug it in again. As waited for it to show up on my desktop, I noticed the LED light from the hard drive disappear. I quickly opened DU to find my LaCie listed as "Uninitialized Disk". It disappeared and I then received a message stating that my hard drive had been ejected unsafely.

I plugged it in once again and it showed up and appeared to function as normal again.

Still, I'm slightly concerned as I don't think I've ever experienced something like this.

3772
 
 
The original post: /r/datahoarder by /u/DueStranger on 2025-02-06 17:31:00.

I tried using HTTrack but get "Bad Request" (400) errors. Not sure what I'm doing wrong. This is the site: https://acquisitiongateway.gov/periodic-table

Errors message

Not sure why I'm being downvoted for asking this but if there's another way to backup this site, I'd love to hear your thoughts. I played with the settings from a few YT videos but nothing works.

3773
 
 
The original post: /r/datahoarder by /u/Sorry_Advantage_590 on 2025-02-06 17:00:56.

I won't pretend to be a talented coder because I am not nor will I act as though I have full expertise on encryption/data erasure but why isn't there a feature in veracrypt that allows for the destruction of data? Veracrypt offers a hidden volume that allows you to have some plausible deniability but what if there was a feature that could erase data when a certain password is input. That way if you are compelled or forced to give a password there would be no data to give because it's erased? Just wondering if such a feature is possible? It would be cool nonetheless.

3774
 
 
The original post: /r/datahoarder by /u/kwarner04 on 2025-02-06 16:58:43.

It appears most of the reports and things people are posting online about all the spending are all a result of building queries based on the data posted at USASpending.gov. It's still up now, but as more people have started digging, I expect lots of finger pointing at both sides of the aisle...and wouldn't be surprised if it gets harder to get.

Turns out, you can download a copy of the database so I went ahead and grabbed a copy.

Created a torrent to make it easy to replicate and share:

magnet:?xt=urn:btih:4GFCPALVPXB5HYPPRA5AZWFM3AG5YIAP&dn=usaspending-db_20250106.zip&xl=156276262643&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce

It's pretty slow uploading, so if you want to directly download the file, you can do so here: https://files.usaspending.gov/database_download/usaspending-db_20250106.zip

Probably easier to download and then just seed today & tomorrow...it wasn't super fast even on a 2 gig fiber connection...took about 8 hours. It's 145 GB and then expands to over 1.5TB PostgreSQL database. Here's a link to the directions they provide to decompress the backups: https://files.usaspending.gov/database_download/usaspending-db-setup.pdf

Normally, they require you to login to actually view the download link, but figured the folks here would appreciate not having to login. If you do want to check it out and verify, feel free: https://onevoicecrm.my.site.com/usaspending/s/database-download

PS...if anyone else has any recommendations on open source (non-piracy) torrent trackers, I'll gladly add to those as well.

3775
 
 
The original post: /r/datahoarder by /u/mrspooky84 on 2025-02-06 16:50:49.
view more: ‹ prev next ›