this post was submitted on 09 Apr 2026
201 points (97.6% liked)

Fuck AI

6688 readers
524 users here now

"We did it, Patrick! We made a technological breakthrough!"

A place for all those who loathe AI to discuss things, post articles, and ridicule the AI hype. Proud supporter of working people. And proud booer of SXSW 2024.

AI, in this case, refers to LLMs, GPT technology, and anything listed as "AI" meant to increase market valuations.

founded 2 years ago
MODERATORS
 
you are viewing a single comment's thread
view the rest of the comments
[–] lime@feddit.nu 38 points 1 day ago (3 children)

i did my first machine learning course more than 10 years ago, so i'm not ashamed to admit that i bought beefier hardware to play around with local models in early 2023. i still like doing that. mostly because i know my gpu is powered entirely off of fossil-free energy and because i decided early on not to spew the output all over the internet unless it was poignant. or funny. not as in "the llm told a good joke", more as in "i compressed this poor thing to fit on a cd and now it can only talk about dolphins".

qwen3.5-12B really screams along on a 7900xtx. like, up to 70-100 tokens a second. perfect for seeing the results of your torture methods quickly.

[–] cecilkorik@piefed.ca 9 points 1 day ago (1 children)

gemma4 is also pretty amazing (both fast and unbelievably capable for its seemingly-small size) on modest hardware. TurboQuant seems like a really, really promising technique and I hope we'll start seeing the open source community developing it into something even more useful to keep democratizing the capabilities of this technology so we can all have access to the best and highest forms of it.

[–] lime@feddit.nu 1 points 13 hours ago

not tried gemma yet, i've stayed away from google stuff. maybe i'll give it a shot.

[–] turbofan211@lemmy.world 6 points 1 day ago (2 children)
[–] lime@feddit.nu 4 points 13 hours ago

one of my most recent fun activities came from discovering the "allow editing" button in koboldcpp. since the model is fed the entire conversation so far as its only context, and doesn't save data between iterations, you can basically re-write its memory on the fly. i knew this before but i'd never though to do it until there was an easy ui option for it, and it turned out to be a lot of fun, because when using a "thinking" model like qwen3.5 you can convince it that it's bypassing its own censorship.

basically you give the model a prompt to work off of, pause it in the middle of the thinking process, change previous thoughts to something it's been trained to filter out (like sex or violence or opinions critical of the ccp), and it will start second-guessing itself. sometimes it gets stuck in a loop, sometimes it overcomes the contradiction (at which point you can jump in again and tweak its memory some more) and sometimes it gets tied up in knots trying to prove a negative.

a previous experiment was about feeding stable diffusion images back into itself to see what happens. i was inspired by a talk at 37c3 where they demonstrated model collapse by repeatedly trying to generate the same image as they put in (i think this was how sora worked).

[–] cecilkorik@piefed.ca 1 points 21 hours ago

ollama is a comfortable starting point for most people. They offer cloud models but they're opt-in only and it's really primarily for local models. Nothing about the ecosystem is what you'd really call "stable", I suspect a lot of it is vibe coded (even the drivers and documentation for all these tools is sketchy AF, even when its coming from large companies, nvidia's documentation is atrocious) and you will often have really frustrating bugs and crashes and incompatibilities. It's a bit of a janky mess, if you're lucky it works out of the box but be cautious with it, it can be a bit fragile. Welcome to the brave new world.

ollama has a bunch of models you can direct download from them and run from the CLI with a super basic text interface, but where it starts to get more powerful is with the ollama server running on localhost:11434, it mimics OpenAI's API and can be connected to almost any other LLM frontend which makes it much more useful. OpenWebUI is a common one, I prefer LibreChat which is similar but the latest version has terrible context compression/summarization which has ruined it for me, they're both a bit frustratingly janky (again high likelihood of vibe coding throughout). Huggingface.co is the main source for user-created models, which are often optimized for particular usages and often have the models baked-in restrictions removed (look for keywords like abliterated, uncensored, heretic) although this can some with some loss in quality it may or may not be noticeable. The best format to look for is GGUF (in my experience) and you can use quantizations to get smaller versions of large models, named for the number of bits used, like Q4 (the numbers and letters after that are details not really worth diving into) means 4 bits, instead of the default 16 bits. That's a lot of quality loss numerically but it turns out to be not that bad in practice and it shrinks the memory usage (and download size) a lot. Q4 is generally considered a pretty reasonable and safe target for local usage and causes very little compromise in model quality while making it a lot more usable and faster. Other techniques like imatrix are useful for even smaller quants with smaller bits but also more quality loss.

The next step into the rabbithole is agentic tool-use harnesses. These are things that allow the AI to actually use tools (ranging from editing files or running command line tools or doing web-searches or far more complex things through things like MCP servers or "skills" which seem to be a somewhat newer and potentially better alternative). Harnesses range from extremely minimal, carefully controlled and directly manageable harnesses like pi/pi-mono, to the absolutely batshit crazy (do not recommend, terrifying) OpenClaw. There are more middle-ground harnesses like opencode (similar to open source claude code) or hermes. You can also redirect many commercial harnesses like claude, codex, cursor, etc onto your local Ollama either partially (to reduce API token usage) or completely (for privacy and control) How it works is, you just point them at your localhost:11434 ollama API and they talk to the models by providing all the context and prompts they need and if the model supports tool-calling it will send those command requests back to the harness and the harness will see them and run them. To be clear, the model is still just text. It's not "doing" anything. It's requesting the harness to do it. It spits out and receives specially formatted and tagged text for tools (which it has been specifically trained on) to teach it how to use tools and how to interpret the results. In my experience, commercial tool use harnesses like claude and codex are tightly coupled to their company's models and don't do a great job with open source/open weights models, so I don't really recommend wasting your time with them for local use.

For generative images or video ComfyUI seems to be the go-to runner and Civitai.com seems to be the main source for models and these are a different kind of models (stable diffusion rather than LLM) and I don't have a lot of experience with generative images and think any potential uses are even less ethically defensible than LLMs in a lot of ways, so I generally avoid them.

Hopefully that isn't too confusing or overwhelming and gives you at least a summary, starting point and many keywords to look into. I don't know everything about this topic and I'm still learning myself, this is what I have learned so far and some of it might be wrong. Good luck!

[–] TropicalDingdong@lemmy.world 1 points 21 hours ago (1 children)
[–] lime@feddit.nu 2 points 13 hours ago

yeah one of those framework machines with 128GB shared ram would have been amazing. shame they're sending money to racists.