this post was submitted on 13 Jan 2026
656 points (98.7% liked)

Selfhosted

54576 readers
1026 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

  7. No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

top 50 comments
sorted by: hot top controversial new old
[–] ICastFist@programming.dev 5 points 18 hours ago (1 children)

What's the size difference when you remove the porn stuff from the torrent?

[–] spicehoarder@lemmy.zip 11 points 16 hours ago

Willing to bet a 90% size reduction

[–] Butterphinger@lemmy.zip 5 points 1 day ago

grabs external

[–] inspxtr@lemmy.world 3 points 22 hours ago

Very cool! Do you know how your project may compare with arctic shift ? For those more interested in research with reddit data, is there benefit of one vs another?

[–] Mubelotix@jlai.lu 1 points 22 hours ago

I do not consent for this

[–] offspec@lemmy.world 54 points 2 days ago (3 children)

It would be neat for someone to migrate this data set to a Lemmy instance

[–] JackbyDev@programming.dev 3 points 18 hours ago

Lemmit already existed and was annoying as hell. It was the first account I remember blocking.

[–] TeddE@lemmy.world 22 points 1 day ago (2 children)

It would be inviting a lawsuit for sure. I like the essence of the idea, but it's probably more trouble than it's worth for all but the most fanatic.

[–] floquant@lemmy.dbzer0.com 9 points 1 day ago* (last edited 1 day ago) (1 children)

Is it though? That is (or was, and should be again) publicly accessible information that was created over the years by random internet users. I refuse the notion that an American company can "own it" just because they ran the servers. Sure they can hold copyright for their frontend and backend code, name and whatever. But posts and comments, no way.

Of course it would be dumb for someone under US jurisdiction but we'll see how much an international DMCA claim is worth considering the current relations anyway.

[–] TeddE@lemmy.world 3 points 1 day ago (1 children)

They don't own it, the individual posters own the content of their own posts, however, from the reddit terms of service:

When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit.

And with each of those rights granted, Reddit's lawyers can defend those rights. So no, they don't own it "just because they ran the servers" - they own specific rights to copy granted to them by each poster.

(I don't like this arrangement, but ignorance of the terms of service isn't going to help someone who uploaded a full copy of the works they have extensive rights to) On this subject I think there needs to be an extensive overhaul to narrow what terms you can extend to the general public. The problem is I straight up don't trust anyone currently in power to make such a change to have our interests in mind.

[–] Mavytan@feddit.nl 3 points 22 hours ago (1 children)

I'm not at all familiar with legalese, but wouldn't 'non-exclusive' in that statement mean that you, and others permitted by you, can redistribute the content as you see fit? Meaning that copying and redistributing reddit content doesn't necessarily violate reddit's terms of service but does violate the user's copyright?

[–] tatterdemalion@programming.dev 3 points 19 hours ago

Yeah so at worst you could get sued by some random reddit users that don't want their post history hosted on your site.

Given how little traction artists and authors have had with suing AI companies for blatant copyright infringement, I kinda doubt it would go anywhere.

[–] Olgratin_Magmatoe@slrpnk.net 6 points 1 day ago* (last edited 1 day ago) (3 children)

Might be easiest to set up an instance in a country that doesn't give a fuck about western IP law, then others can federate to it.

So yeah, fanatic levels of effort.

[–] fennesz12@feddit.dk 6 points 1 day ago (1 children)

Brb, setting up a Lemmy server in Red Star OS

[–] MonkeMischief@lemmy.today 5 points 1 day ago* (last edited 1 day ago) (1 children)
[–] A_Random_Idiot@lemmy.world 3 points 1 day ago

The chances are pretty high that is probably Kims computer, arent they?

[–] floquant@lemmy.dbzer0.com 3 points 1 day ago* (last edited 1 day ago) (2 children)

Post and comments are not Reddit's IP anyway :3

[–] Buddahriffic@lemmy.world 2 points 1 day ago* (last edited 1 day ago)

They might have set up the user agreement for it. Stackexchange did and their whole business model was about catching businesses where some worker copy/pasted code from a stackexchange answer and getting a settlement out of it.

I agree with you in principle (hell, I'd even take it further and think only trademarks should be protected, other than maybe a short period for copyright and patent protection, like a few years), but the legal system might disagree.

Edit: I'd also make trademarks non-transferrable and apply to individuals rather than corporations, so they can go back to representing quality rather than business decisions. Especially when some new entity that never had any relation to the original trademark user just throws some money at them or their estate to buy the trust associated with the trademark.

load more comments (1 replies)
[–] 19_84@lemmy.dbzer0.com 3 points 1 day ago

this is one reason i support tor deployment out of the box 😋

[–] cyberpunk007@lemmy.ca 10 points 1 day ago

Now this is a good idea.

[–] vane@lemmy.world 5 points 1 day ago* (last edited 1 day ago) (1 children)

How long it takes to download this 3TB torrent ?

[–] HugeNerd@lemmy.ca -2 points 17 hours ago

Boring. I want the Kuro5hin site. That was actually good and hysterically funny at the best times. ASCII reenactment players of Michael Crawford anyone?

[–] lautan@lemmy.ca 12 points 1 day ago

Thanks. This is great for mining data and urls.

[–] breakingcups@lemmy.world 121 points 2 days ago (2 children)

Just so you're aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

Not to detract from your project, which looks cool!

[–] 19_84@lemmy.dbzer0.com 138 points 2 days ago (18 children)

Yes I used AI, English is not my first language. Thank you for the kind words!

load more comments (18 replies)
load more comments (1 replies)
[–] a1studmuffin@aussie.zone 52 points 2 days ago

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

so kinda like kiwix but for reddit. That is so cool

[–] frongt@lemmy.zip 54 points 2 days ago (4 children)

And only a 3.28 TB database? Oh, because it's compressed. Includes comments too, though.

load more comments (4 replies)
[–] BigDiction@lemmy.world 15 points 2 days ago

You should be very proud of this project!! Thank you for sharing.

[–] Tiger@sh.itjust.works 26 points 2 days ago (3 children)

What is the timing of the dataset, up through which date in time?

[–] 19_84@lemmy.dbzer0.com 41 points 2 days ago (1 children)

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

load more comments (1 replies)
load more comments (2 replies)
load more comments
view more: next ›