this post was submitted on 13 Jan 2026
656 points (98.7% liked)

Selfhosted

54576 readers
789 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

  1. Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.

  2. No spam posting.

  3. Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.

  4. Don't duplicate the full text of your blog or github here. Just post the link for folks to click.

  5. Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).

  6. No trolling.

  7. No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago
MODERATORS
 

Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

  • USB drive / local folder (just open the HTML files)
  • Home server on your LAN
  • Tor hidden service (2 commands, no port forwarding needed)
  • VPS with HTTPS
  • GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

(page 2) 50 comments
sorted by: hot top controversial new old
[–] SteveCC@lemmy.world 36 points 2 days ago (1 children)

Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

[–] 19_84@lemmy.dbzer0.com 9 points 2 days ago

thank you!!! i built on great ideas from others! i cant take all the credit 😋

[–] tanisnikana@lemmy.world 29 points 2 days ago (1 children)

Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

[–] 19_84@lemmy.dbzer0.com 15 points 2 days ago

the great part is that since everything is built it is easy to support any additional data! there is even an issue template to submit new data source! https://github.com/19-84/redd-archiver/blob/main/.github/ISSUE_TEMPLATE/submit-data-source.yml

[–] 19_84@lemmy.dbzer0.com 22 points 2 days ago (2 children)

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

[–] elbarto777@lemmy.world 10 points 2 days ago* (last edited 2 days ago)

Anyone doing this will be banned in that platform.

[–] Bazell@lemmy.zip 5 points 2 days ago* (last edited 2 days ago)

We can't share this on Reddit, but we can share this on other platforms. Basically, what you have done is you scraped tons of data for AI learning. Something like "create your own AI Redditor" . And greedy Reddit management will dislike it very much even if you will tell them that this is for the cultural inheritance. Your work is great anyway. Sadly, that I do not have enough free space to load and store all this data.

[–] MedicPigBabySaver@lemmy.world 19 points 2 days ago (1 children)

Fuck Reddit and Fuck Spez.

[–] muusemuuse@sh.itjust.works 6 points 2 days ago (1 children)

You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

[–] elbarto777@lemmy.world 3 points 2 days ago (1 children)

Where would it be hosted so that Conde Nast lawyers can't touch it?

[–] muusemuuse@sh.itjust.works 1 points 1 day ago (3 children)

What would they say? It’s information that’s freely available, no payment required, no accounts to simply read it, no copyrights, where’s the legal in hosting a duplicate of the content?

[–] limelight79@lemmy.world 1 points 1 day ago (1 children)

It might fall under the same concept that recipes do - you can't copyright a recipe, but a collection of recipes (such as a book) is copyrightable.

In any case, they have a lot more money to pay lawyers than you or I do, I'll bet, so even if you are right, that doesn't mean you'll have the money to actually win.

load more comments (1 replies)
load more comments (2 replies)
[–] avidamoeba@lemmy.ca 9 points 2 days ago (1 children)

How does this compare to redarc? It seems to be similar.

[–] 19_84@lemmy.dbzer0.com 14 points 2 days ago (1 children)

redarc uses reactjs to serve the web app, redd-archiver uses a hybrid architecture that combines static page generation with postgres search via flask. is more like a hybrid static site generator with web app capabilities through docker and flask. the static pages with sorted indexes can be viewed offline and served on hosts like github and codeberg pages.

[–] avidamoeba@lemmy.ca 1 points 1 day ago (1 children)

Is there difference in how much storage space is needed between the two approaches?

[–] 19_84@lemmy.dbzer0.com 2 points 1 day ago

redd-archiver will take up more disk space because the database exists along with the static html

[–] K3can@lemmy.radio 1 points 1 day ago (1 children)

Can anyone figure out what the minimum process is to just use the SSG function? I'm having a really hard time trying to understand the documentation.

[–] 19_84@lemmy.dbzer0.com 1 points 1 day ago (2 children)

did you check the quickstart?

load more comments (2 replies)
[–] Howlinghowler110th@kbin.earth 6 points 2 days ago (1 children)

I think this is a good use case for AI and Impressed with it. wish the instructions were more clear how to set up though.

[–] 19_84@lemmy.dbzer0.com 8 points 2 days ago

thank you! the instruction are little overwhelming, check out the quickstart if you haven't yet! https://github.com/19-84/redd-archiver/blob/main/QUICKSTART.md

load more comments
view more: ‹ prev next ›