LocalLLaMA

3918 readers

15 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago

MODERATORS

pax@sh.itjust.works

noneabove1182@sh.itjust.works

Smokeydope@lemmy.world

MonsterBug@sh.itjust.works

Qwen3-Next-80B-A3B Thinking and Instruct land in llama.cpp (lemmy.ca)

submitted 2 weeks ago* (last edited 2 weeks ago) by panda_abyss@lemmy.ca to c/localllama@sh.itjust.works

7 comments fedilink hide all child comments

Unsloth has quants for both

This was a great day to check the news. I also saw that vllm has just added support for strix-halo.

you are viewing a single comment's thread
view the rest of the comments

[–] avidamoeba@lemmy.ca 6 points 2 weeks ago (5 children)

How much RAM does something like that need?

[–] hendrik@palaver.p3x.de 3 points 2 weeks ago* (last edited 2 weeks ago) (4 children)

Don't MoE models just load into memory as every other model and it's just that they pick a subset of numbers to multiply by each step so they're faster? That'd make me think it'd need somewhere around 80GB of memory at 8bit or 160GB at full precision, or something like 50GB at the average llama.cpp Q4_K_M...

[–] Dran_Arcana@lemmy.world 7 points 2 weeks ago (3 children)

That is correct, but you might be missing why this is useful. MoE models are great for CPU inference, which is considerably cheaper than GPU inference at scale. The qwen 30b_a3b MoE and 8b dense models were widely considered similar in quality. If you have the vram, the 8b would be faster. If you don't, then the 30b would be faster (as long as you had the ~19-22gb of ram required)

A very inexpensive used server with lots of memory channels but no gpu can do very cost-efficent inference in this scenario and loads of people are asking for this.

[–] SmokeyDope@piefed.social 2 points 2 weeks ago

Fantastic explaination, thank you

load more comments (2 replies)