Selfhosted

58124 readers

891 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.
No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz

670

Self-host Reddit – 2.38B posts, works offline, yours forever (github.com)

submitted 2 months ago by 19_84@lemmy.dbzer0.com to c/selfhosted@lemmy.world

113 comments fedilink hide all child comments

Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

USB drive / local folder (just open the HTML files)
Home server on your LAN
Tor hidden service (2 commands, no port forwarding needed)
VPS with HTTPS
GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

(page 2) 50 comments

sorted by: hot top controversial new old

[–] inspxtr@lemmy.world 3 points 2 months ago

Very cool! Do you know how your project may compare with arctic shift ? For those more interested in research with reddit data, is there benefit of one vs another?

[–] K3can@lemmy.radio 1 points 2 months ago (1 children)

Can anyone figure out what the minimum process is to just use the SSG function? I'm having a really hard time trying to understand the documentation.

[–] 19_84@lemmy.dbzer0.com 1 points 2 months ago (1 children)

did you check the quickstart?

[–] K3can@lemmy.radio 1 points 2 months ago (1 children)

Yes, both the standalone quickstart and the quickstart section of the readme (which are both different).

Is it possible to get the static sites without spinning up a DB backend?

load more comments (1 replies)

[–] Seefoo@lemmy.world 1 points 2 months ago

Does this decompress the files preemptively and leave them? Or is it only decompressing as a post/subreddit is accessed? Basically i am wondering what kind of storage footprint would be required to search through this

load more comments