this post was submitted on 10 Apr 2024
28 points (93.8% liked)

LocalLLaMA

3647 readers
5 users here now

Welcome to LocalLLaMA! Here we discuss running and developing machine learning models at home. Lets explore cutting edge open source neural network technology together.

Get support from the community! Ask questions, share prompts, discuss benchmarks, get hyped at the latest and greatest model releases! Enjoy talking about our awesome hobby.

As ambassadors of the self-hosting machine learning community, we strive to support each other and share our enthusiasm in a positive constructive way.

Rules:

Rule 1 - No harassment or personal character attacks of community members. I.E no namecalling, no generalizing entire groups of people that make up our community, no baseless personal insults.

Rule 2 - No comparing artificial intelligence/machine learning models to cryptocurrency. I.E no comparing the usefulness of models to that of NFTs, no comparing the resource usage required to train a model is anything close to maintaining a blockchain/ mining for crypto, no implying its just a fad/bubble that will leave people with nothing of value when it burst.

Rule 3 - No comparing artificial intelligence/machine learning to simple text prediction algorithms. I.E statements such as "llms are basically just simple text predictions like what your phone keyboard autocorrect uses, and they're still using the same algorithms since <over 10 years ago>.

Rule 4 - No implying that models are devoid of purpose or potential for enriching peoples lives.

founded 2 years ago
MODERATORS
 

From Simon Willison: "Mistral tweet a link to a 281GB magnet BitTorrent of Mixtral 8x22B—their latest openly licensed model release, significantly larger than their previous best open model Mixtral 8x7B. I’ve not seen anyone get this running yet but it’s likely to perform extremely well, given how good the original Mixtral was."

you are viewing a single comment's thread
view the rest of the comments
[–] Fubarberry@sopuli.xyz 3 points 1 year ago (1 children)

281GB

That's huge, I'm guessing we'll need to use a giant swap file?

[–] TheHobbyist@lemmy.zip 6 points 1 year ago* (last edited 1 year ago) (1 children)

You're right, but the model is also not quantized so is likely to be in 16bit floats. If you quantize it you can get substantially smaller models which run faster though may be somewhat less accurate.

~~Knowing that the 4 bit quantized 8x7B model gets downscaled to 4.1GB, this might be roughly 3 times larger? So maybe 12GB? Let's see.~~

Edit: sorry those numbers were for Mistral 7B, not mixtral. For Mixtral, the quantized model size is 26GB (4 bits), so triple that would be roughly 78 GB. Luckily, being an MoE, not all of it has to be loaded simultaneously to the GPU.

From what I recall, it only uses 13B parameters at once, so if we compare that to codellama 13B, quantized to 4 bits, that is 7.4GB, so triple that would be 22GB, so would require a 24GB GPU. Someone double check if I misunderstood something.

24GB GPUs include the AMD 7900 XTX and the nvidia RTX 4090 (Ti), non-mobile.

[–] Audalin@lemmy.world 1 points 1 year ago* (last edited 1 year ago)

I thought MoEs had to be loaded entirely in the (V)RAM and the inference speedup was because you only need to use a fraction of layers to compute the next token (but the choice of layers can be different for each token, so you need them all ready; or keep moving data between the disk <-> RAM <-> VRAM and get reduced performance).