Selfhosted
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
Rules:
-
Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
-
No spam posting.
-
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
-
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
-
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
-
No trolling.
Resources:
- selfh.st Newsletter and index of selfhosted software and apps
- awesome-selfhosted software
- awesome-sysadmin resources
- Self-Hosted Podcast from Jupiter Broadcasting
Any issues on the community? Report it using the report flag.
Questions? DM the mods!
view the rest of the comments
At risk of getting more technical, ik_llama.cpp has a good built in webui:
https://github.com/ikawrakow/ik_llama.cpp/
Getting more technical, its also way better than ollama. You can run models way smarter than ollama can on the same hardware.
For reference, I'm running GLM-4 (667 GB of raw weights) on a single RTX 3090/Ryzen gaming rig, at reading speed, with pretty low quantization distortion.
And if you want a 'look this up on the internet for me' assistant (which you need for them to be truly useful), you need another docker project as well.
...That's just how LLM self hosting is now. It's simply too hardware intense and ad hoc to be easy and smart and cheap. You can indeed host a small 'default' LLM without much tinkering, but its going to be pretty dumb, and pretty slow on ollama defaults.
Ollama does have some features that make it easier to use for a first-time user, including:
Calculating automatically how many layers can fit in VRAM and loading that many layers and splitting between main memory/CPU and VRAM/GPU. llama.cpp can't do that automatically yet.
Automatically unloading the model from VRAM after a period of inactivity.
I had an easier time setting up ollama than other stuff, and OP does apparently already have it set up.
Yeah. But it also messes stuff up from the llama.cpp baseline, and hides or doesn't support some features/optimizations, and definitely doesn't support the more efficient iq_k quants of ik_llama.cpp and its specialzied MoE offloading.
And that's not even getting into the various controversies around ollama (like broken GGUFs or indications they're going closed source in some form).
...It just depends on how much performance you want to squeeze out, and how much time you want to spend on the endeavor. Small LLMs are kinda marginal though, so IMO its important if you really want to try; otherwise one is probably better off spending a few bucks on an API that doesn't log requests.
Bet. Looking into that now. Thanks!
I believe I have 11g of vram, so I should be good to run decent models from what I’ve been told by the other AIs.
In case I miss your reply, assuming a 3080 + 64 GB of RAM, you want the IQ4_KSS (or IQ3_KS, for more RAM for tabs and stuff) version of this:
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
Part of it will run on your GPU, part will live in system RAM, but ik_llama.cpp does the quantizations split and GPU offloading in a particularly efficient way for these kind of 'MoE' models. Follow the instructions on that page.
If you 'only' have 32GB RAM or less, that's tricker, and the next question is what kind of speeds do you want. But it's probably best to wait a few days and see how Qwen3 80B looks when it comes out. Or just go with the IQ4_K version of this: https://huggingface.co/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF
And you don't strickly need the hyper optimization of ik_llama.cpp for a small model like Qwen3 30B. Something easier like lm studio or the llama.cpp docker image would be fine.
Alternatively, you could try to squeeze Gemma 27B into that 11GB VRAM, but it would be tight.
How much system RAM, and what kind? DDR5?
ik doesn't have great documentation, so it'd be a lot easier for me to just point you places, heh.