Technology

83330 readers

3663 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

478

Consumer hardware is no longer a priority for manufacturers (www.xda-developers.com)

submitted 1 month ago by throws_lemy@lemmy.nz to c/technology@lemmy.world

74 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] brucethemoose@lemmy.world 10 points 1 month ago* (last edited 1 month ago) (2 children)

This is not true. I have a single 3090 + 128GB CPU RAM (which wasn’t so expensive that long ago), and I can run GLM 4.6 350B at 6 tokens/sec, with measurably reasonable quantization quality. I can run sparser models like Stepfun 3.5, GLM Air or Minimax 2.1 much faster, and these are all better than the cheapest API models. I can batch Kimi Linear, Seed-OSS, Qwen3, and all sorts of models without any offloading for tons of speed.

…It’s not trivial to set up though. It’s definitely not turnkey. That’s the issue.

You can’t just do “ollama run” and expect good performance, as the local LLM scene is finicky and highly experimental. You have to compile forks and PRs, learn about sampling and chat formatting, perplexity and KL divergence, about quantization and MoEs and benchmarking. Everything is moving too fast, and is too performance sensitive, to make it that easy, unfortunately.

EDIT:

And if I were trying to get local LLMs setup today, for a lot of usage, I’d probably buy an AI Max 395 motherboard instead of a GPU. They aren’t horrendously priced, and they don’t slurp power like a 3090. 96GB VRAM is the perfect size for all those ~250B MoEs.

But if you go AMD, take all the finickiness for an Nvidia setup and multiply it by 10. You better know your way around pip and Linux, as if you don’t get it exactly right, performance will be horrendous, and many setups just won’t work anyway.

[+] melfie@lemy.lol 2 points 1 month ago* (last edited 1 week ago) (1 children)

[deleted]

[–] brucethemoose@lemmy.world 5 points 1 month ago* (last edited 1 month ago) (1 children)

I did find this calculator the other day

That calculator is total nonsense. Don't trust anything like that; at best, its obsolete the week after its posted.

I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores

Yeah, that's a huge caveat. AMD Blender might be better than you think though, and you can use your RTX 4060 on a Strix Halo motherboard just fine. The CPU itself is incredible for any kind of workstation workload.

along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM

So far, NPUs have been useless. Don't buy any of that marketing.

I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.

That's still 5 words/second. That's not a bad reading speed.

Whether its enough? That depends. GLM 350B without thinking is smarter than most models with thinking, so I end up with better answers faster.

But anyway, I'm get more like 20 tokens a second with models that aren't squeezed into my rig within an inch of their life. If you buy an HEDT/Server CPU with more RAM channels, it's even faster.

If you want to look into the bleeding edge, start with https://github.com/ikawrakow/ik_llama.cpp/

And all the models on huggingface with the ik tag: https://huggingface.co/models?other=ik_llama.cpp&sort=modified

You'll see instructions for running big models on a 4060 + RAM.

If you're trying to like batch process documents quickly (so no CPU offloading), look at exl3s instead: https://huggingface.co/models?num_parameters=min%3A12B%2Cmax%3A32B&sort=modified&search=exl3

And run them with this: https://github.com/theroyallab/tabbyAPI

[–] WhyJiffie@sh.itjust.works 1 points 1 month ago (1 children)

You can’t just do “ollama run” and expect good performance, as the local LLM scene is finicky and highly experimental. You have to compile forks and PRs, learn about sampling and chat formatting, perplexity and KL divergence, about quantization and MoEs and benchmarking. Everything is moving too fast, and is too performance sensitive, to make it that easy, unfortunately.

how do you have the time to figure all these out and keep being up to date? do you do this at work?

[–] brucethemoose@lemmy.world 1 points 1 month ago* (last edited 1 month ago)

As a hobby mostly, but its useful for work. I found LLMs fascinating even before the hype, when everyone was trying to get GPT-J finetunes named after Star Trek characters to run.

Reading my own quote, I was being a bit dramatic. But at the very least it is super important to grasp some basic concepts (like MoE CPU offloading, quantization, and specs of your own hardware), and watch for new releases in LocalLlama or whatever. You kinda do have to follow and test things, yes, as there's tons of FUD in open weights AI land.

As an example, stepfun 2.5 seems to be a great model for my hardware (single Nvidia GPU + 128GB CPU RAM), and it could have easily flown under the radar without following stuff. I also wouldn't know to run it with ik_llama.cpp instead of mainline llama.cpp, for a considerable speed/quality boost over (say) LM Studio.

If I were to google all this now, I'd probably still get links for setting up the Deepseek distillations from Tech Bro YouTubers. That series is now dreadfully slow and long obsolete.