It's A Digital Disease!

23 readers
1 users here now

This is a sub that aims at bringing data hoarders together to share their passion with like minded people.

founded 2 years ago
MODERATORS
3676
 
 
The original post: /r/datahoarder by /u/shane_dev on 2025-02-09 10:31:36.
3677
 
 
The original post: /r/datahoarder by /u/Iamboringaf on 2025-02-09 07:29:20.

Maybe even storage size on most of modern phones allows you to keep library large enough that you can't read them all if given a lifetime to do so. Or do you plan to share books to others?

3678
 
 
The original post: /r/datahoarder by /u/Alystan2 on 2025-02-09 03:51:27.

Hi everyone, I am planning my storage setup and would like to hear your advices/ideas/suggestions/criticisms.

The use case: AI model storage (diffusion, LLMs maybe other models in the future), local code repo, some scraping and personal data potentially used for training.

Requirements:

  • 24 TB expendable to 48 TB or more in the future.
  • no need for it to be online all the time (happy to turn it on/off when downloading/retrieving data).
  • some form of backup/redundancy in case of drive failure/lost sectors.
  • cost effective.

Initial plan:

  • second hand server rack (Dell 730?) with 8x3.5 bay
  • TrueNAS SCALE in RAIDZ2
  • 4x8TB HDD 5400 RPM (is is worth getting the MTBF 1M or 2M?)

Current cost estimate: server ~300 USD, drives 4*150 = 600 USD = ~900 USD (note that I do not live in the US, but I translated the prices I can reach to USD for convenience)

Any feedback?

3679
 
 
The original post: /r/datahoarder by /u/PreparationHbomb on 2025-02-09 03:40:08.
3680
 
 
The original post: /r/datahoarder by /u/Universal-Magnet on 2025-02-09 03:21:00.

Let’s say I have 1 hard drive and 1 backup. One of these hard drives fails, so I go to Best Buy and buy a new hard drive, then transfer everything over from the back-up to the new drive. What are the chances my 1 remaining hard drive is going to fail within the 45 minutes it takes me to buy a new hard drive and transfer everything over?

3681
 
 
The original post: /r/datahoarder by /u/tamerlein3 on 2025-02-09 03:20:52.

Hi, I have a bunch of HDDs that I would like to cold store data. Ideally the workflow is to boot it up once every month or 2, add some data (files), let it rebalance or heal bad sectors, copy off anything I need, then shut down again. What system is ideal for this type of workload? Open to all recommendations

I’ve played with Ceph, zfs and seems they both assume it’s always on.

Would prefer some sort of distributed system also with some fault toleration (eg, I can recover from one lost drive)

3682
 
 
The original post: /r/datahoarder by /u/futrfantastic on 2025-02-09 02:49:54.

Original Title: Bought a used Apple Superdrive for $5, but it only reaches 3.9X ripping a DVD compared to my HP 550s, which hits the close to advertised 7.8X (but is loud). Does this point to a faulty drive or is this known about the Superdrive?

3683
 
 
The original post: /r/datahoarder by /u/kitkat2628 on 2025-02-09 00:38:30.

Hi, I’m trying to find & play some deleted YouTube videos but it’s not working for me & I wanted to check if anyone can find a way to make any of the videos play. I’ve been able to find the videos on Archive but it doesn’t play. These are the links - https://youtube.com/watch?v=Qgni3OPac3Q, https://youtube.com/watch?v=Dpmt5Fc6NZo, https://youtube.com/watch?v=UVGuJO_WurI Thank you!

3684
 
 
The original post: /r/datahoarder by /u/RedFox0008 on 2025-02-09 00:18:52.

I have been using a BEYIMEI PCIe 1X SATA Card 10 Ports, 6 Gbps PCI Express 3.0 to SATA 3.0 Controller Expansion Card on my old motherboard, but when I switched to my new AM5 DDR5 Asus board, I found that it is too old for the new board apparently, as a reviewer on amazon had the same problem with a different newer motherboard, stating that if your mobo is new enough to run DDR5, then the card won't work.

I have 4 slots in my mobo, and I have 9 various sized HDDs and an optical drive, non-RAID, or any other configuration other than just lettered drives in Windows.

What PCIe card should I be looking for for my setup, as I have no idea what is or is not compatible with the new AM5/DDR5 boards?

3685
 
 
The original post: /r/datahoarder by /u/Intelligent-War-128 on 2025-02-09 00:07:24.
3686
 
 
The original post: /r/datahoarder by /u/evildad53 on 2025-02-09 00:03:06.

"A Heise investigation of used Seagate data center-grade hard drives that are being sold as new has suggested that the drives originated from Chinese cryptocurrency mining farms that used them to mine Chia several years ago... According to the report, these drives — many with 15,000 to 50,000 hours of prior use — had their internal records altered to appear unused." https://www.tomshardware.com/pc-components/hdds/seagates-fraudulent-hard-drives-scandal-deepens-as-clues-point-at-chinese-chia-mining-farms

3687
 
 
The original post: /r/datahoarder by /u/wickedplayer494 on 2025-02-08 23:25:50.
3688
 
 
The original post: /r/datahoarder by /u/OctoLiam on 2025-02-08 23:13:28.

Hi, I've recently got into Data Hoarding by accident and really enjoying it however I've hit a bit of an impasse.

I got this PC roughly around a year ago, back then I didn't have any idea on what I wanted to do with it so I just slapped on Windows 11 pro with Jellyfin and continued on. I then ended up getting into Docker Containers using docker desktop and now have roughly 30 or so containers running.

On this PC I only have one 8TB drive with no backups so far but I do want to change that as I was thinking of getting a couple enterprise manufacturer recertified drives and have some sort of RAID config. However the thing that has been holding me back is 1. The case and where to store the drives and 2. The OS, I've heard that most people recommend unRAID for this use case, I've never really touched any Linux OS's before however I am willing to learn and try.

Although I'm pretty happy to lose all the media like movies and tv shows on the 8TB drive I don't really want to lose some things on the SSD including 1. The Jellyfin server data which at this point I think is me being more in denial considering it's quite difficult to move Jellyfin from a windows instance to a container instance. 2. The containers data but this one won't be too bad to lose.

The hardware for the PC is:

  • Intel Core i5 12th gen 6 core CPU
  • Netac 512GB SSD
  • 8TB HDD (I think it is a WD Red)
  • Silverstone Fara 311 case
  • Silverstone ET550 550w
  • H610M-E Motherboard
  • 16GB DDR4 RAM

Is moving to unRAID wise for me and will there be any hardware changes that I need to make to do so? Thank you in advance!

3689
 
 
The original post: /r/datahoarder by /u/BuyHighValueWomanNow on 2025-02-08 23:10:03.

I know reddit allows you to download your data. But, are there any applications to allow users to reassemble the data into readable context, and possibly publish the data in some reddit-like context?

3690
 
 
The original post: /r/datahoarder by /u/jumpycan on 2025-02-08 22:51:24.

My dad has 500-1000 tapes of tv recordings from the 80s-90s. I’m interested in digitizing it but have no idea where to start. I assume I need a converter and a VCR? Any help w specifics would be appreciated.

3691
 
 
The original post: /r/datahoarder by /u/Dr4g0nSqare on 2025-02-08 22:38:00.

The End of Term archive is primarily focused on federal sites. They explicitly state that state governments are out of scope and I assume organizations that receive federal grants are also out of scope.

I have noticed the last couple days that the archive warrior jobs on my vm are getting less and less frequent for the US government project, but I think there are more things at risk than just what's explicitly owned by the federal government.

I would like to enumerate a list of potential sites that might be affected by this administration that are out of scope of the end of term archive.

Things like states that recently flipped, environmental research (especially in the Gulf of Mexico and Alaska) , and civil rights organizations that may lose funding, and anything else people can think of.

3692
 
 
The original post: /r/datahoarder by /u/Crazy_Dubs_Cartoons on 2025-02-08 20:23:13.

So, there is a LOT of stuff out there, but what is it that you deem worthy to be archived and copied on multiple external HDD or data tapes for super long-term conservation?

And how do you do it? I really want to know :)

TLDR: I optimize as much as possible so that what I see worthy among music, movies, books\manga\comics, games. own works, for what I discover to be shitty, I keep for each category a TXT file with the names of those shitty products. I purchase new external HDDs each 2 years and labeling each one with a year label, each HDD having a full copy of the previous ones, updating each as required. When I'll be dead, those will be my heritage, if no offspring of my own, a notary will have to find an heir with specific requirements to be worthy of the content.


I deem worthy the following (not in any specific order of importance):

  1. music by real musicians (any genre, from jazz fusion to trash metal, never caring about notoriety of the artist\ensembles but how good the tracks are, if I like only 1 song of a certain artist\ensemble, so be it), exception is jazz fusion albums... quality has to be at least 160kbps Opus equivalent, Lossless I don't really care about but I analyze each music track with Audacity to check if the spectrum is above a certain threshold of quality
  2. books\comic books\manga that are NOT mainstream slop that is passed as masterpieces (the only "mainstream" I archive are the really good classics, like Punisher Max for instance)... quality has to be readable, the more compressed but still neat the better
  3. Movies of any year and any genre, but NO LONG LASTING TV SHOWS, expect for selected episodes of anthology shows (example: "1950s Twilight Zone) or very short series that are 6 episodes max... my sweet spot is 720p x265, great compression\quality ratio
  4. selected artworks by online artists but no galeries, lot of erotica too, very little drawn\painted porn (exeption: I really like a certain character so I'll have a folder with the best of artwork featuring that one character)
  5. my own digital paintings (folders by year, since 2020), I keep them both on TIFF for possible reworking\conservation, both on JPG to share those on the web
  6. offline wikipedia and wikitaxi to browse it, updating yearly
  7. my own voice actings of originally non-english or muted original animation videos, I recut them to make new plot out of the video footage, using AI to change my voice to other characters (results are damn good), I keep them at 480p x264 for compatibility reasons
  8. My "movie recuts" of animated series (like Samurai 7, Elfen Leid etc...) or the few live action episodical, non-anthology series where I meticolously removed all the filler (even those 3-4 seconds of pointless lingering during a scene) to create what I deem "viewer's time-respecting recuts")...
  9. videogames of any era, for any console, indie games too, even hentai games I recently downloaded en-masse from a certain forum (some of those XXX games are as good or even better as official games, you know you grew when you rate XXX games regarding gameplay, story, art style, involvement etc... instead for the mere act of sex lol... and funny thing is, the less a game weights, the better it is regarding art style and gameplay, amazing pixel-art arcade-style games)
  10. my own personal "writer's vocabularies" I have been writing since back when I was 16 years old (33 in of Aprile 2025)... did you know there are about 120 synonyms for the word "slow" in English 0_0

My way to archive is like this:

  1. On 4\8 terabyte external HDDS, each kept in a zipped pouch and then stored inside a dampless drawer
  2. Main folder for each category (games, movies, books etc...)
  3. Subfolders specific to year, console, genre etc... (some examples of folder structure: games > PC > 1999 or games > Emuatlion > PS2, Movies > Masterpieces regardless of age, Movies > Pre 2000s > Folder for Year before 2000s)
  4. If something I archive I will eventually watch\read it\play it etc... and if that SUCKS, to save space for the really worthy stuff, I keep a txt file in each main type folder called "Shit that Sucks" and write the titles of those there (ex, why the hell would I keep a 40 gygabyte game that happens to be is a waste of data due to how objectively shitty it is... I'll just drop the title of it in the txt file while deleting those 40 gygabytes of space)

I use Crystal Disk Info to monitor my HDD life.

My view is to keep the HDDs as heritage for the latter generation, buying new HDDs each 2 years and labeling each HDD with date of purchase.

if I die with no offspring, I'll have a notary to keep them and give them only to someone that has specific qualities I ask about in my testament.

3693
 
 
The original post: /r/datahoarder by /u/tzrp95 on 2025-02-08 17:18:37.

Got a new HDD. Started a full format. Lasted several hours. Last I checked it it was nearing completion. Then after like 20 mins or so, the monitor was closed (automatically) and I tripped over the cables and accidentally remvoed the power cable.

I think it was done formatting? By my time calculations over my last check. The HDD works fine.

If it wasnt done formatting or interrupted middle then it wouldnt work no?

3694
3695
 
 
The original post: /r/datahoarder by /u/Sophia-512 on 2025-02-08 16:21:04.

How can I backup a subreddit? I’ve seen red arc as a recommended solution but I’m not clear on how to use it

3696
 
 
The original post: /r/datahoarder by /u/WorldTraveller101 on 2025-02-08 14:32:37.

Demo: https://youtu.be/8cB8TwJmcjk

https://preview.redd.it/4bop6aq2oxhe1.png?width=3442&format=png&auto=webp&s=e306cfbdf91ede146330e1686184642e46f5a1e4

https://preview.redd.it/3gz9uovkoxhe1.png?width=3442&format=png&auto=webp&s=1d51cc18d2a237590a987e169bf1446a5dabfb32

I’m excited to present BookLore, a self-hosted web application designed to streamline the process of managing and reading books. As someone who loves reading but found it challenging to organize and access my books across different devices, I wanted to create a solution that made it easy to store, manage, and read books directly from the browser.

The core idea behind BookLore is simplicity. You just need to add your books to a folder, and BookLore takes care of the rest. It supports popular formats like PDF and EPUB, and once the books are uploaded, the app organizes them, making it easy to find and enjoy them from any device, anywhere, as long as you have a browser.

Currently, the app is in its early stages of development, and I have exciting plans for its future. I aim to release BookLore in the coming months, and it will be fully open-source and hosted on GitHub, so anyone can contribute or deploy it themselves.

I’m looking forward to hearing your thoughts and feedback! If you have suggestions, feature requests, or any feedback on how the app can improve, feel free to let me know. I’m open to all ideas as I work to make BookLore the best book management and reading platform it can be.

Thanks for checking it out, and stay tuned for updates!

3697
 
 
The original post: /r/datahoarder by /u/signalwarrant on 2025-02-08 14:03:39.

This is a bit of code I have developed to use with the Crawl4ai python package (GitHub - unclecode/crawl4ai: 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper). It works well for crawling sitemaps.xml, just give it the link to the sitemap you want to crawl.

You can get any sites sitemap.xml by looking in the robots.txt file (Example: cnn.com/robots.txt). At some point I'll dump this on Github but wanted to share sooner than later. Use at your own risk.

Shows progress: X/Y URLs completed

Retries failed URLs only once

Logs failed URLs separately

Writes clean Markdown output

Respects request delays

Logs failed URLs to logfile.txt

Streams results into multiple files (max 20MB each, this is the file limit for uploads to chatgpt)

Change these values in the code below to fit your needs.

SITEMAP_URL = "https://www.cnn.com/sitemap.xml" # Change this to your sitemap URL

MAX_DEPTH = 10 # Limit recursion depth

BATCH_SIZE = 1 # Number of concurrent crawls

REQUEST_DELAY = 1 # Delay between requests (seconds)

MAX_FILE_SIZE_MB = 20 # Max file size before creating a new one

OUTPUT_DIR = "cnn" # Directory to store multiple output files

RETRY_LIMIT = 1 # Retry failed URLs once

LOG_FILE = os.path.join(OUTPUT_DIR, "crawler_log.txt") # Log file for general logging

ERROR_LOG_FILE = os.path.join(OUTPUT_DIR, "logfile.txt") # Log file for failed URLs

import asyncio
import json
import os
import xml.etree.ElementTree as ET
from urllib.parse import urljoin, urlparse
import aiohttp
from aiofiles import open as aio_open
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, CacheMode
from crawl4ai.content_filter_strategy import PruningContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Configuration
SITEMAP_URL = "https://www.cnn.com/sitemap.xml"  # Change this to your sitemap URL
MAX_DEPTH = 10  # Limit recursion depth
BATCH_SIZE = 1  # Number of concurrent crawls
REQUEST_DELAY = 1  # Delay between requests (seconds)
MAX_FILE_SIZE_MB = 20  # Max file size before creating a new one
OUTPUT_DIR = "cnn"  # Directory to store multiple output files
RETRY_LIMIT = 1  # Retry failed URLs once
LOG_FILE = os.path.join(OUTPUT_DIR, "crawler_log.txt")  # Log file for general logging
ERROR_LOG_FILE = os.path.join(OUTPUT_DIR, "logfile.txt")  # Log file for failed URLs

# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)

async def log_message(message, file_path=LOG_FILE):
    """Log messages to a log file and print them to the console."""
    async with aio_open(file_path, "a", encoding="utf-8") as f:
        await f.write(message + "\n")
    print(message)

async def fetch_sitemap(sitemap_url):
    """Fetch and parse sitemap.xml to extract all URLs."""
    try:
        async with aiohttp.ClientSession() as session:
            async with session.get(sitemap_url) as response:
                if response.status == 200:
                    xml_content = await response.text()
                    root = ET.fromstring(xml_content)
                    urls = [elem.text for elem in root.findall(".//{http://www.sitemaps.org/schemas/sitemap/0.9}loc")]

                    if not urls:
                        await log_message("❌ No URLs found in the sitemap.")
                    return urls
                else:
                    await log_message(f"❌ Failed to fetch sitemap: HTTP {response.status}")
                    return []
    except Exception as e:
        await log_message(f"❌ Error fetching sitemap: {str(e)}")
        return []

async def get_file_size(file_path):
    """Returns the file size in MB."""
    if os.path.exists(file_path):
        return os.path.getsize(file_path) / (1024 * 1024)  # Convert bytes to MB
    return 0

async def get_new_file_path(file_prefix, extension):
    """Generates a new file path when the current file exceeds the max size."""
    index = 1
    while True:
        file_path = os.path.join(OUTPUT_DIR, f"{file_prefix}_{index}.{extension}")
        if not os.path.exists(file_path) or await get_file_size(file_path) < MAX_FILE_SIZE_MB:
            return file_path
        index += 1

async def write_to_file(data, file_prefix, extension):
    """Writes a single JSON object as a line to a file, ensuring size limit."""
    file_path = await get_new_file_path(file_prefix, extension)
    async with aio_open(file_path, "a", encoding="utf-8") as f:
        await f.write(json.dumps(data, ensure_ascii=False) + "\n")

async def write_to_txt(data, file_prefix):
    """Writes extracted content to a TXT file while managing file size."""
    file_path = await get_new_file_path(file_prefix, "txt")
    async with aio_open(file_path, "a", encoding="utf-8") as f:
        await f.write(f"URL: {data['url']}\nTitle: {data['title']}\nContent:\n{data['content']}\n\n{'='*80}\n\n")

async def write_failed_url(url):
    """Logs failed URLs to a separate error log file."""
    async with aio_open(ERROR_LOG_FILE, "a", encoding="utf-8") as f:
        await f.write(url + "\n")

async def crawl_url(url, depth, semaphore, visited_urls, queue, total_urls, completed_urls, retry_count=0):
    """Crawls a single URL, handles retries, logs failed URLs, and extracts child links."""
    async with semaphore:
        await asyncio.sleep(REQUEST_DELAY)  # Rate limiting
        run_config = CrawlerRunConfig(
            cache_mode=CacheMode.BYPASS,
            markdown_generator=DefaultMarkdownGenerator(
                content_filter=PruningContentFilter(threshold=0.5, threshold_type="fixed")
            ),
            stream=True,
            remove_overlay_elements=True,
            exclude_social_media_links=True,
            process_iframes=True,
        )

        async with AsyncWebCrawler() as crawler:
            try:
                result = await crawler.arun(url=url, config=run_config)
                if result.success:
                    data = {
                        "url": result.url,
                        "title": result.markdown_v2.raw_markdown.split("\n")[0] if result.markdown_v2.raw_markdown else "No Title",
                        "content": result.markdown_v2.fit_markdown,
                    }

                    # Save extracted data
                    await write_to_file(data, "sitemap_data", "jsonl")
                    await write_to_txt(data, "sitemap_data")

                    completed_urls[0] += 1  # Increment completed count
                    await log_message(f"✅ {completed_urls[0]}/{total_urls} - Successfully crawled: {url}")

                    # Extract and queue child pages
                    for link in result.links.get("internal", []):
                        href = link["href"]
                        absolute_url = urljoin(url, href)  # Convert to absolute URL
                        if absolute_url not in visited_urls:
                            queue.append((absolute_url, depth + 1))
                else:
                    await log_message(f"⚠️ Failed to extract content from: {url}")

            except Exception as e:
                if retry_count < RETRY_LIMIT:
                    await log_message(f"🔄 Retrying {url} (Attempt {retry_count + 1}/{RETRY_LIMIT}) due to error: {str(e)}")
                    await crawl_url(url, depth, semaphore, visited_urls, queue, total_urls, completed_urls, retry_count + 1)
                else:
                    await log_message(f"❌ Skipping {url} after {RETRY_LIMIT} failed attempts.")
                    await write_failed_url(url)

async def crawl_sitemap_urls(urls, max_depth=MAX_DEPTH, batch_size=BATCH_SIZE):
    """Crawls all URLs from the sitemap and follows child links up to max depth."""
    if not urls:
        await log_message("❌ No URLs to crawl. Exiting.")
        return

    total_urls = len(urls)  # Total number of URLs to process
    completed_urls = [0]  # Mutable count of completed URLs
    visited_urls = set()
    queue = [(url, 0) for url in urls]
    semaphore = asyncio.Semaphore(batch_size)  # Concurrency control

    while queue:
        tasks = []
        batch = queue[:batch_size]
        queue = queue[batch_size:]

        for url, depth in batch:
            if url in visited_urls or depth >= max_depth:
                continue
            visited_urls.add(url)
            tasks.append(crawl_url(url, depth, semaphore, visited_urls, queue, total_urls, completed_urls))

        await asyncio.gather(*tasks)

async def main():
    # Clear previous logs
    async with aio_open(LOG_FILE, "w") as f:
        await f.write("")
    async with aio_open(ERROR_LOG_FILE, "w") as f:
        await f.write("")

    # Fetch URLs from the sitemap
    urls = await fetch_sitemap(SITEMAP_URL)

    if not urls:
        await log_message("❌ Exiting: No valid URLs found in the sitemap.")
        return

    await log_message(f"✅ Found {len(urls)} pages in the sitemap. Starting crawl...")

    # Start crawling
    await crawl_sitemap_urls(urls)

    await log_message(f"✅ Crawling complete! Files stored in {OUTPUT_DIR}")

# Execute
asyncio.run(main())

3698
 
 
The original post: /r/datahoarder by /u/DevilishGod on 2025-02-08 14:00:28.
3699
 
 
The original post: /r/datahoarder by /u/AyaanMAG on 2025-02-08 12:11:12.

Hello folks, I come from a third world country and backups in dollars are expensive for me, i have a total of 14TB of storage and approximately 4TB of that is important and irreplaceable. 99$ is close to the highest I can pay for yearly backups

Backblaze B2 comes out to about >$200/yr with 12 months * 4 TB * $6 per TB per month

Amazon deep glacier sounds somewhere appealing but if i can't gather the exorbitant $400 or so for retrieval I'll be locked out of my data

Backblaze personal backup isn't something I want to use for two reasons: 1] It requires a windows client however it's doable if i switch away from my headless linux server install and use proxmox instead. 2] My other ick with it is the lack of guarantee of privacy, i would like to use something like rclone or borg or restic or whatever other software to encrypt my data before it touches their servers.

Idrive sounds like a somewhat appealing option with third party client support and $99 per year for 5TB but I've heard bad things about it with unreliable backups

What can I do?

3700
 
 
The original post: /r/datahoarder by /u/gob_spaffer on 2025-02-08 11:44:31.

I am new to data hoarding and have been wanting to archive my stuff for long term storage. I bought a suitcase of various discs from Japan. All HTL in-organic. I have Victor "M-DISC", Sony 128gb 4 layer, Mitsubishi "MABL"....

Anyway I just started out, and I'm 5 discs in, no big deal, but I realised my software is set to burn at max speed, which for these discs seems to be around 5.8x

It's only when I investigated the Japanese packaging later that I realised it says speed 2-4x.

I'm using an asus bw-16d1hu-u. The discs I have burned so far have passed the verification check.

TLDR: Burning 2-4x discs at 6x - have I fucked up? Should I re-write those at 4x?

Does writing at a higher speed for example impact the ability for the laser to properly etch the surface which may show up in long term aging vs. burning at a slower rate?

view more: ‹ prev next ›