LocalLLaMA

3926 readers

1 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works

Qwen3-Next-80B-A3B Thinking and Instruct land in llama.cpp (lemmy.ca)

submitted 2 weeks ago* (last edited 2 weeks ago) by panda_abyss@lemmy.ca to c/localllama@sh.itjust.works

7 comments fedilink hide all child comments

Unsloth has quants for both

This was a great day to check the news. I also saw that vllm has just added support for strix-halo.

you are viewing a single comment's thread
view the rest of the comments

[–] Dran_Arcana@lemmy.world 7 points 2 weeks ago (2 children)

That is correct, but you might be missing why this is useful. MoE models are great for CPU inference, which is considerably cheaper than GPU inference at scale. The qwen 30b_a3b MoE and 8b dense models were widely considered similar in quality. If you have the vram, the 8b would be faster. If you don't, then the 30b would be faster (as long as you had the ~19-22gb of ram required)

A very inexpensive used server with lots of memory channels but no gpu can do very cost-efficent inference in this scenario and loads of people are asking for this.

[–] SmokeyDope@piefed.social 2 points 2 weeks ago

Fantastic explaination, thank you

[–] avidamoeba@lemmy.ca 1 points 2 weeks ago (1 children)

So not an AMD AM5 dual-channel system. 😅

[–] Dran_Arcana@lemmy.world 2 points 2 weeks ago* (last edited 2 weeks ago)

Fast ddr5 /strix halo would probably be passable for a patient single-user but yeah not really the target audience here for sure.