You can probably run a 7b LLM comfortably in system RAM, maybe one of the smaller 13b ones.
Software to use
- https://github.com/ggerganov/llama.cpp - command line. Basic, flexible.
- https://github.com/LostRuins/koboldcpp - Precompiled llama.cpp with ui - easy to start with
Models
In general, you want small GGML models. https://huggingface.co/TheBloke has a lot of them. There are some superHOT version of models, but I'd avoid them for now. They're trained to handle bigger context sizes, but it seems that made them dumber too. There's a lot of new things coming out on bigger context lengths, so you should probably revisit that when you need it.
- https://huggingface.co/TheBloke/orca_mini_v2_13b-GGML - the q3_K_M.bin perhaps - might still be too big, depending on what you're running in the background
- https://huggingface.co/TheBloke/orca_mini_3B-GGML - very small model. Not sure how well it'll do
- https://huggingface.co/TheBloke/airoboros-7B-gpt4-1.4-GGML
- https://huggingface.co/TheBloke/vicuna-7B-v1.3-GGML
- https://huggingface.co/TheBloke/WizardLM-7B-V1.0-Uncensored-GGML
Each have different strengths, orca is supposed to be better at reasoning, airoboros is good at longer and more storylike answers, vicuna is a very good allrounder, wizardlm is also a notably good allrounder.
For training, there are some tricks like qlora, but results aren't impressive from what I've read. Also, training LLM's can be pretty difficult to get the results you want. You should probably start with just running them and get comfortable with that, maybe try few-shot prompts (prompts with a few examples of writing styles), and then go from there.