LocalLLaMA

3450 readers
19 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
1
 
 

Built on Qwen, these models incorporate our latest advances in post-training techniques. MindLink demonstrates strong performance across various common benchmarks and is widely applicable in diverse AI scenarios.

72B 32B

2
3
4
20
submitted 3 days ago* (last edited 3 days ago) by Eyekaytee@aussie.zone to c/localllama@sh.itjust.works
 
 

GLM-4.5-Air is the lightweight variant of our latest flagship model family, also purpose-built for agent-centric applications. Like GLM-4.5, it adopts the Mixture-of-Experts (MoE) architecture but with a more compact parameter size. GLM-4.5-Air also supports hybrid inference modes, offering a "thinking mode" for advanced reasoning and tool use, and a "non-thinking mode" for real-time interaction. Users can control the reasoning behaviour with the reasoning enabled boolean. Learn more in our docs

Blog post: https://z.ai/blog/glm-4.5

Hugging Face:

https://huggingface.co/zai-org/GLM-4.5

https://huggingface.co/zai-org/GLM-4.5-Air

5
 
 

I've been following the work that went into this video for a couple of months and have grown to love Level1Techs.

Check out their forum and especially Ubergarm

6
 
 

Mistral AI bets on transparency by making its environmental impact public The French artificial intelligence startup, along with Ademe and Carbone 4, has published a study on the impact, particularly on CO₂ emissions, of training and the use of its models.

7
20
submitted 1 week ago* (last edited 1 week ago) by Smokeydope@lemmy.world to c/localllama@sh.itjust.works
8
 
 

What’s new in Le Chat.

Deep Research mode: Lightning fast, structured research reports on even the most complex topics.

Voice mode: Talk to Le Chat instead of typing with our new Voxtral model.

Natively multilingual reasoning: Tap into thoughtful answers, powered by our reasoning model — Magistral.

Projects: Organize your conversations into context-rich folders.

Advanced image editing directly in Le Chat, in partnership with Black Forest Labs.

9
 
 
10
11
 
 

cross-posted from: https://ani.social/post/16779655

GPU VRAM Price (€) Bandwidth (TB/s) TFLOP16 €/GB €/TB/s €/TFLOP16
NVIDIA H200 NVL 141GB 36284 4.89 1671 257 7423 21
NVIDIA RTX PRO 6000 Blackwell 96GB 8450 1.79 126.0 88 4720 67
NVIDIA RTX 5090 32GB 2299 1.79 104.8 71 1284 22
AMD RADEON 9070XT 16GB 665 0.6446 97.32 41 1031 7
AMD RADEON 9070 16GB 619 0.6446 72.25 38 960 8.5
AMD RADEON 9060XT 16GB 382 0.3223 51.28 23 1186 7.45

This post is part "hear me out" and part asking for advice.

Looking at the table above AI gpus are a pure scam, and it would make much more sense to (atleast looking at this) to use gaming gpus instead, either trough a frankenstein of pcie switches or high bandwith network.

so my question is if somebody has build a similar setup and what their experience has been. And what the expected overhead performance hit is and if it can be made up for by having just way more raw peformance for the same price.

12
13
14
 
 

In brief

  • In late summer 2025, a publicly developed large language model (LLM) will be released — co-created by researchers at EPFL, ETH Zurich, and the Swiss National Supercomputing Centre (CSCS).

  • This LLM will be fully open: This openness is designed to support broad adoption and foster innovation across science, society, and industry.

  • A defining feature of the model is its multilingual fluency in over 1,000 languages.

15
 
 

Recently I've been experimenting with Claude and feeling the burn on the premium API usage. I wanted to know how much cheaper my local llm was in terms of cost-per-token output.

Claude Sonnet is a good reference with 15$ per 1 million tokens out, so I wanted to know comparatively how many tokens 15$ worth electricity powering my rig would generate.

(These calculations are just simple raw token generation by the way, in real world theres cost in initial hardware, ongoing maintenance as parts fail, and human time to setup thats much harder to factor into the equation)

So how does one even calculate such a thing? Well, you need to know

  1. how many watts your inference rig consumes at load
  2. how many tokens on average it can generate per second while inferencing (with context relatively filled up, we want conservative estimates)
  3. cost of electric you pay on the utility bill in kilowatts-per-hour

Once you have those constants you can extrapolate how many kilowatt-hours worth of runtime 15$ in electric buys then figure out the total amount of tokens you would expect to generate over that time given the TPS.

The numbers shown in the screenshot are for a fully loaded into vram model on the ol' 1070ti 8gb. But even with partially offloaded numbers for 22-32b models at 1-3tps its still a better deal overall.

I plan to offer the calculator as a tool on my site and release it under a permissive license like gpl if anyone is interested.

16
 
 

I have an unused dell optiplex 7010 i wanted to use as a base for an interference rig.

My idea was to get a 3060, a pci riser and 500w power supply just for the gpu. Mechanically speaking i had the idea of making a backpack of sorts on the side panel, to fit both the gpu and the extra power supply since unfortunately it's an sff machine.

What's making me weary of going through is the specs of the 7010 itself: it's a ddr3 system with a 3rd gen i7-3770. I have the feeling that as soon as it ends up offloading some of the model into system ram is going to slow down to a crawl. (Using koboldcpp, if that matters.)

Do you think it's even worth going through?

Edit: i may have found a thinkcenter that uses ddr4 and that i can buy if i manage to sell the 7010. Though i still don't know if it will be good enough.

17
18
26
Homelab upgrade WIP (infosec.pub)
submitted 1 month ago* (last edited 1 month ago) by Smokeydope@lemmy.world to c/localllama@sh.itjust.works
 
 

Theres a lot more to this stuff than I thought there would be when starting out. I spent the day familiarizing with how to take apart my pc and swap gpus .Trying to piece everything together.

Apparently in order for PC to startup right it needs a graphical driver. I thought the existance of a HDMI port on the motherboard implied the existance of onboard graphics but apparently only special CPUs have that capability. My ryzen 5 2600 doesnt. The p100 Tesla does not have graphical display capabilities. So ive hit a snag where the PC isnt starting up due to not finding a graphical interface output.

I'm going to try to run multiple GPU cards together on pcie. Hope I can mix amd Rx 580 and nvidia tesla on same board fingers crossed please work.

My motherboard thankfully supports 4x4x4x4 pcie x16 bifurcation which isa very lucky break I didnt know going into this 🙏

Strangely other configs for splitting 16x lanes like 8x8 or 8x4x4 arent in my bios for some reason? So I'm planning to get a 4x bifurcstion board and plug both cards in and hope that the amd one is recognized!

According to one source The performance loss for using 4x lanes for GPUs doing the compute i m doing is 10-15 % surprisingly tolerable actually.

I never really had to think about how pcie lanes work or how to allocate them properly before.

For now I'm using two power supplies one built into the desktop and the new 850e corsair psu. I choose this one as it should work with 2-3 GPUs while being in my price range.

Also the new 12v-2x6 port supports like 600w enough for the tesla and comes with a dual pcie split which was required for the power cable adapter for Tesla. so it all worked out nicely for a clean wire solution.

Sadly I fucked up a little. The pcie release press plastic thing on the motherboard was brittle and I fat thumbed it too hard while having problems removing the GPU initially so it snapped off. I dont know if that's something fixable. It doesnt seem to affect the security of the connection too bad fortunately. I intend to grt a pcie riser extensions cable so there won't be much force on the now slightly loosened pcieconnection. Ill have the gpu and bifurcation board layed out nicely on the homelab table while testing, get them mounted somewhere nicely once I get it all working.

I need to figure out a external GPU mount system. I see people use server racks or nut and bolt meta chassis. I could get a thin plate of copper the size of the desktops glass window as a base/heatsink?

19
 
 

I've recently been writing fiction and using an AI as a critic/editor to help me tighten things up (as I'm not a particularly skilled prose writer myself). Currently the two ways I've been trying are just writing text in a basic editor and then either saving files to add to a hosted LLM or copy pasting into a local one. Or using pycharm and AI integration plugins for it.

Neither is particularly satisfactory and I'm wondering if anyone knows of a good setup for this (preferably open source but not neccesary), integration with at least one of ollama or open-router would be needed.

Edit: Thanks for the recommendations everyone, lots of things for me to check out when I get the time!

20
26
submitted 1 month ago* (last edited 1 month ago) by splendoruranium to c/localllama@sh.itjust.works
 
 

I'm looking to locally generate voiceovers from text and also try to generate audiobooks. Does anyone have experience with sherpa-onnx? There also appear to be two separate frontends for Kokoro specifically dedicated for audiobook creation, but they appear to both be abandoned. Or am I barking up the completely wrong tree?
Thanks!

21
 
 

It seems mistral finally released their own version of a small 3.1 2503 with CoT reasoning pattern embedding. Before this the best CoT finetune of Small was DeepHermes with deepseeks r1 distill patterns. According to the technical report, mistral baked their own reasoning patterns for this one so its not just another deepseek distill finetune.

HuggingFace

Blog

Magistral technical research academic paper

22
 
 

I'm limited to 24GB of VRAM, and I need pretty large context for my use-case (20k+). I tried "Qwen3-14B-GGUF:Q6_K_XL," but it doesn't seem to like calling tools more than a couple times, no matter how I prompt it.

Tried using "SuperThoughts-CoT-14B-16k-o1-QwQ-i1-GGUF:Q6_K" and "DeepSeek-R1-Distill-Qwen-14B-GGUF:Q6_K_L," but Ollama or LangGraph gives me an error saying these don't support tool calling.

23
 
 
  • It seems like it'll be the best local model that can be ran fast if you have a lot of RAM and medium VRAM.
  • It uses a shared expert (like deepseek and llama4) so it'll be even faster on partial offloaded setups.
  • There is a ton of options for fine tuning or training from one of their many partially trainined checkpoints.
  • I'm hoping for a good reasoning finetune. Hoping Nous does it.
  • It has a unique voice because it has very little synthetic data in it.

llama.CPP support is in the works, and hopefully won't take too long since it's architecture is reused from other models llamacpp already supports.

Are y'all as excited as I am? Also is there any other upcoming release that you're excited for?

24
15
submitted 1 month ago* (last edited 1 month ago) by HumanPerson@sh.itjust.works to c/localllama@sh.itjust.works
 
 

I just set up a new dedicated AI server that is quite fast by my standards. I have it running with OpenWebUI and would like to integrate it with other services. I think it would be cool to have something like copilot where I can be writing code in a text editor and have it add a readme function or something like that. I have also used some RAG stuff and like it, but I think it would be cool to have a RAG that can access live data, like having the most up to date docker compose file and nginx configs for when I ask it about server stuff. So, what are you integrating your AI stuff with, and how can I get started?

25
 
 

Hello. Our community, c/localllama, has always been and continues to be a safe haven for those who wish to learn about the creation and local usage of 'artificial intelligence' machine learning models to enrich their daily lives and provide a fun hobby to dabble in. We come together to apply this new computational technology in ways that protect our privacy and build upon a collective effort to better understand how this can help humanity as an open source technology stack.

Unfortunately, we have been recieving an uptick in negative interactions by those outside our community recently. This is largely due to the current political tensions caused by our association with the popular and powerful tech companies who pioneered modern machine learning models for buisiness and profit, as well as unsavory techbro individuals who care more about money than ethics. These users of models continue to create animosity for the entire field of machine learning and all associated through their illegal stealing of private data to train base models and very real threats to disrupt the economy by destroying jobs through automation.

There are legitimate criticisms to be had. The cost in creating models, how the art they produce is devoid of the soulful touch of human creativity, and how corporations are attempting to disrupt lives for profit instead of enrich them.

I did not want to be heavy handed with censorship/mod actions prior to this post because I believe that echo chambers are bad and genuine understanding requires discussion between multiple conflicting perspectives.

However, a lot of these negative comments we receive lately aren't made in good faith with valid criticisms against the corporations or technologies used with an intimate understanding of them. No, instead its base level mud slinging by people with emotionally charged vendettas making nasty comments of no substance. Common examples are comparing models to NFTs, namecalling our community members as blind zelots for thinking models could ever be used to help people, and spreading misinformation with cherry picked unreliable sources to manipulatively exaggerate enviromental impact/resource consumption used.

While I am against echo chambers, I am also against our community being harassed and dragged down by bad actors who just don't understand what we do or how this works. You guys shouldn't have to be subjected to the same brain rot antagonism with every post made here.

So Im updating guidelines by adding some rules I intend to enforce. Im still debating whether or not to retroactively remove infringing comments from previous post, but be sure any new post and comments made will be enforced based on the following guidelines.

RULES: Rule: No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Reason: More or less self explanatory, personal character attacks and childish mudslinging against community members are toxic.

Rule: No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Reason: This is a piss poor whataboutism argument. It claims something that is blaitantly untrue while attempting to discredit the entire field by stapling the animosity everyone has with crypto/NFT onto ML. Models already do more than cryptocurrency ever has. Models can generate text, pictures, audio. Models can view/read/hear text, pictures, and audio. Models may simulate aspects of cognitive thought patterns to attempt to speculate or reason through a given problem. Once they are trained they can be copied and locally hosted for many thousands of years which factors into initial energy cost vs power consumed over time equations.

Rule: No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Reason: There are grains of truth to the reductionist statement that llms rely on mathematical statistics and probability for their outputs. The same can be said for humans and the statistical patterns in our own language and how our neurons come together to predict the next word in the sentence we type out. Its the intricate complexity in the process and the way information is processed that makes all the diffence. ML models have an entire college course worth of advanced mathematics and STEM concepts to create hyperdimensional matrixes to plot the relationship of information, intricate hidden translation layers made of perceptrons connecting billions of parameters into vast abstraction mappings. There were also some major innovations and discoveries made in the 2000s which made modern model training possible that we didn't have in the early days of computing. all of that is a little more complicated than what your phones autocorrect does, and the people who make the lazy reductionist comparison just dont care about the nuances.

Rule: No implying that models are devoid of purpose or potential for enriching peoples lives.

Reason: Models are tools with great potential for helping people through the creation of accessability software for the disabled and enabling doctors to better heal the sick through advanced medical diagnostic techniques. The percieved harm models are capable of causing such as job displacement is rooted in our flawed late stage capitalist human society pressures for increased profit margins at the expense of everyone and everything.

If you have any proposals for rule additions or wording changes I will hear you out in the comments. Thank you for choosing to browse and contribute to this space.

view more: next ›