IK sounds promising! Will check it out to see if it can run in a container
doodlebob
I'll take a look at both tabby and vllm tomorrow
Hopefully there's cpu offload in the works so I can test those crazy models without too much fiddling in the future (server also has 128gb of ram)
Unfortunately i didn't set up nvlink, but ollama auto splits things for models which require it
I really just a "set and forget" model server lol (that's why I keep mentioning the auto offload)
Ollama integrates nicely with OWUI
omg, I'm retarded. Your comment made me start thinking about things and...I've been using q4 without knowing it... I assumed ollama ran the fp16 by default 😬
about vllm, yeah I see that you have to specify how much to offload manually which I wasn't a fan of. I have 4x 3090 in an ML server at the moment but I'm using those for all AI workloads so the VRAM is shared for TTS/STT/LLM/Image Gen
thats basically why I kind of really want auto offload
yeah, im currently running the gemma 27b model locally I recently took a look at vllm but the only reason i didnt want to switch is because it doesnt have automatic offloading (seems that it's a manual thing right now)
Just read the L1 post and I'm just now realizing this is mainly for running quants which I generally avoid
I guess I could spin it up just to mess around with it but probably wouldn't replace my main model
Thanks, will check that out!
I'm currently using ollama to serve llms, what's everyone using for these models?
I'm also using open webui as well and ollama seemed the easiest (at the time) to use in conjunction with that
Yeah, I went a little crazy with it and built out a server just for AI/ML stuff 😬
Looks to be 20gb of vram
The Gemma 27b model has been solid for me. Using chatterbox for TTS as well
I'm just gonna try vllm, seems like ik_llama.cpp doesnt have a quick docker method