this post was submitted on 09 Aug 2025
13 points (88.2% liked)

LocalLLaMA

3503 readers
203 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
 

Total noob to this space, correct me if I'm wrong. I'm looking at getting new hardware for inference and I'm open to AMD, NVIDIA or even Apple Silicon.

It feels like consumer hardware comparatively gives you more value generating images than trying to run chatbots. Like, the models you can run at home are just dumb to talk to. But they can generate images of comparable quality to online services if you're willing to wait a bit longer.

Like, GPT OSS 120b, assuming you can spare 80GB of memory, is still not GPT 5. But Flux Shnell is still Flux Shnel, right? So if diffusion is the thing, NVIDIA wins right now.

Other options might even be better for other uses, but chatbots are comparatively hard to justify. Maybe for more specific cases like code completion with zero latency or building a voice assistant, I guess.

Am I too off the mark?

top 13 comments
sorted by: hot top controversial new old
[–] Toes@ani.social 4 points 22 hours ago (1 children)

Framework has an AI machine on the market.

I haven't used it myself but perhaps it's worth looking into for your project.

https://frame.work/gb/en/desktop?tab=machine-learning

[–] rkd@sh.itjust.works 3 points 21 hours ago (2 children)

I'm aware of it, seems cool. But I don't think AMD fully supports the ML data types that can be used in diffusion and therefore it's slower than NVIDIA.

[–] domi@lemmy.secnd.me 1 points 13 hours ago* (last edited 13 hours ago) (1 children)

Slower? Yes. But the alternative to a Framework Desktop for home use is a 30-40k Nvidia GPU, so I'm fine with slow.

Not to mention that it is more than fast enough for common use cases: https://github.com/geerlingguy/ollama-benchmark/issues/21#issuecomment-3164570956

[–] rkd@sh.itjust.works 1 points 10 hours ago

For image generation, you don't need that much memory. That's the trade-off, I believe. Get NVIDIA with 16GB VRAM to run Flux and have something like 96GB of RAM for GPT OSS 120b. Or you give up on fast image generation and just do AMD Max+ 395 like you said or Apple Silicon.

[–] Toes@ani.social 2 points 21 hours ago (1 children)

I wonder if that's a limitation of mesa?

Could it be possible with amdvlk?

[–] pepperfree@sh.itjust.works 3 points 11 hours ago

Lots of developer choose to write in CUDA as ROCm support back then is a mess.

[–] TheLeadenSea@sh.itjust.works 5 points 1 day ago

I run a 14B model that is not too dumb, and definitely worth having as an offline local backup. I also use my NVIDIA 4080 with 16GB VRAM for image and video generation of adequate quality, however. I'd still say you get better quality from the closed models in some areas, and many open models require far too much VRAM for consumer hardware, but in general all local usecases work well locally, just a bit worse that closed online models. Except voice, that can be just as good.

[–] pepperfree@sh.itjust.works 2 points 21 hours ago (1 children)

There is koboldcp-rocm fork. Koboldcpp itself has basic image generation. https://github.com/YellowRoseCx/koboldcpp-rocm

[–] tal@lemmy.today 3 points 17 hours ago* (last edited 16 hours ago) (1 children)

I'm not sure I follow.

What Koboldcpp does is set up to call out to an external generator for the images. It itself isn't providing the image generation computation.

Like, you create a prompt and hand it to koboldcpp. It then computes a textual response. Part of that response is a prompt intended for an image generative AI. It feeds the prompt to something like Stable Diffusion or ComfyUI, and that does the image generation. It takes the output and displays it inline with the text it's generated for you in the KoboldCPP web page. You run both the image generator and koboldcpp side-by-side.

What OP is complaining about is that he feels that consumer hardware


by which you're probably talking GPUs with up to about 24 GB of VRAM


don't have enough memory to run large LLM models, to have a chatbot on par with what typical cloud-based services are running. He is okay with the image generation side.

Llama.cpp can split a model across multiple GPUs. In theory, you can run quad 4090s or quad XT 7900 XTXs. Each of those has 24GB. Each of those is maybe $1k for the XT 7900 XTXs. I'm pretty sure that the 4090s used to go for $1.5k-$2k, but it looks like they're currently about $3k on Amazon. So $4k for the AMD route, and $12k for the Nvidia route. For the 7900s, that's about 1420 watts, disregarding the rest of the system. For the 4090s, 1800W. A standard US household circuit is 15A or 20A at 120V so 1800W or 2400W, so in the US, you're probably running close to circuit limitations. There are apparently some computer systems that use dual PSUs that can feed off multiple circuits. You'd need a power supply capable of feeding that, and given that this is considerably more heat than a lot of space heaters, probably cooling. That'll get you 96GB of VRAM (assuming, as is possible with llama.cpp, that your problem is one that can be split across multiple GPUs). Whether-or-not that's reasonable consumer hardware may be debatable, but unless you start going to dedicated AI compute hardware, which costs more, that's about what you have to work with.

There's also the approach that some people have used of using machines with unified memory to get more memory for the GPU. OP mentioned Apple hardware (like a Mac Studio, at up to 192GB) , and I mentioned the AMD AI Max machines (at up to 128GB) that Framework is selling. Those probably aren't going to be able to crunch as quickly as dedicated hardware for your given problem, but they're a way to get parallel compute hardware with a lot of memory for less money.

Running cloud-based will save money if whatever you're doing doesn't have your hardware constantly in use, since otherwise you're paying for idle hardware. That is, it's definitely going to be cheaper to run things sharing hardware if your use case is intermittent use, like a chatbot.

Llama.cpp does support clustering multiple machines (dunno about for training). I have not done this myself, and if you're thinking about buying hardware to do that, I'd probably look into whether what software you want to run can actually run in that kind of environment, and what kind of performance penalties you're looking at, but it's one possibility.

But I think that the short answer to "can I locally run hardware that can do what the bleeding edge of cloud-based services are doing", the answer is "yes, but not inexpensively". You are going to need to accept paying more for hardware than you would if you were sharing the cost with other people, performance tradeoffs if you're going to have less-beefy hardware doing the crunching, or quality-of-output tradeoffs if you want to use smaller models or otherwise limit memory usage.

I personally find running smaller models than GPT-5 locally to have value. But...OP might not; for him, the quality tradeoff might not be acceptable. I also don't mind things running more slowly, but that might not be an acceptable tradeoff for OP. I am not willing to pay for the kind of hardware that cloud-based commercial AI services are using to do AI compute, but it is possible to get that (well, barring any availability issues) if one throws enough money at it.

It's also possible to use something like vast.ai to rent remote cloud-based parallel compute hardware, if one is comfortable with trusting remote hardware and hoping that whoever is running that hardware isn't snorfling up data from one's compute job (I'd guess probably not, but one never knows, once one's data goes out into the broader world). That's not local, but it might be preferable to using ChatGPT or whatever service, which will be logging user chats.

[–] pepperfree@sh.itjust.works 2 points 12 hours ago* (last edited 11 hours ago) (1 children)

No, you can run sd, flux based model inside the koboldcpp. You can try it out using the original koboldcpp in google colab. It loads gguf model. Related discussion on Reddit: https://www.reddit.com/r/StableDiffusion/comments/1gsdygl/koboldcpp_now_supports_generating_images_locally/

Edit: Sorry, I kinda missed the point, maybe I'm sleepy when writing that comment. Yeah, I agree that LLM need big memory to run which is one of it's downside. I remember someone doing comparison that API with token based pricing is cheaper that to run it locally. But, running image generation locally is cheaper than API with step+megapixel pricing.

[–] tal@lemmy.today 2 points 10 hours ago

No, you can run sd, flux based model inside the koboldcpp.

Hmm. That does look like it. But I have a build out of git within the last two weeks or so, and the only backend image generation options it lists are AI Horde, KCPP/Forge/A1111, OpenAI/DALL-E, ComfyUI, and Pollinations.ai.

Maybe there's some compile-time option that needs to be set to build it in.

investigates

Hmm. I guess it just always has the embedded support active, and that's what "KCPP" is for; you need to configure it at http://localhost:5001/sdui, and then I guess you presumably choose "KCPP/Forge/A1111" as the endpoint. Still not clear where one plonks in the model, but it clearly is there. Sorry about that!

[–] tal@lemmy.today 2 points 22 hours ago* (last edited 22 hours ago) (1 children)

It feels like consumer hardware comparatively gives you more value generating images than trying to run chatbots

While I personally get more use out of the hardware that way, you're also posting this to an LLM community. You're probably going to get people who do use LLMs more.

I also don't think that "Image diffusion models small, LLM models large" is likely some sort of constant


I'm sure that the image generation people can make use of larger models


and the hardware is going to be a moving target. Those Framework Desktop machines have up to 128GB of unified memory, for example.

[–] rkd@sh.itjust.works 1 points 10 hours ago

That's a good point, but it seems that there are several ways to make models fit in smaller memory hardware. But there aren't many options to compensate for not having the ML data types that allows NVIDIA to be like 8x faster sometimes.