Would you share your experience of the used models? I have quite some experience with the larger models but the smaller ones tend to loop around or just fail on their tasks...
It depends a lot on the specific task and specific model (and version (and quant..)). For example, Phi 4 mini is good for math and logic, and the reasoning version is surprisingly good at tool calling/RAG, but the family sucks at everything else. Gemma 4 and Qwen 3.5 are well known for having fantastic general-purpose models in the 4B-9B range, but at the lower end they actually suck again because it's just scaled down, and so will probably loop 50% of the time at the smallest sizes. For the very small (350M-1.2B), LiquidAI's LFM 2.5 uses novel techniques to eliminate doom loops, and they have a vision variant - but it's still a tiny model, so don't try to code with it. And when you just want basic tool calling, even "old" models like llama 3.2, gemma 2, qwen 2.5, are good, fast and low-memory. If you do more searching you can find specific models that are the best at specific tasks in a given size range.
For any model where you notice looping, tune the LLM settings. Reduce temperature and top_p, increase presence/frequency penalty, reduce context size. If you have a specific task to do, fine-tuning is the absolute best way to both reduce memory usage and boost performance and quality. Remember that tiny models are not designed for 0-shot/1-shot, they need lots of specific instruction and context in the prompt, with multi-shot prompts having a dramatic effect on output quality. Try to keep your prompt to specific tasks. Think of small models as children, SOTA models as experienced professionals, and middle-of-the-road models as an average adult; you give the bigger ones more responsibility/agency, but more rules and guardrails to the little ones.
For coding you do want the biggest model you can fit, so this is where larger RAM shines (32GB+ iGPU). If you can fit a dense model, do that. MoE is ok but will perform better on narrower tasks. Use the bleeding edge forks of llamacpp for turboquant/etc and Multi-Token Prediction.
The last thing is quants. If you're running something that isn't the bare model (like an unsloth dynamic quant), model performance is gonna suffer the smaller you go, and smaller models will be much more affected. So try to max out the amount of memory you can dedicate to the model, and pick larger quants like Q6/Q8. You can quant the k/v cache but that also may have a negative effect. And again, if you can fine-tune for a task, you will gain much more performance and quality and reduce memory.
It depends a lot on the specific task and specific model (and version (and quant..)). For example, Phi 4 mini is good for math and logic, and the reasoning version is surprisingly good at tool calling/RAG, but the family sucks at everything else. Gemma 4 and Qwen 3.5 are well known for having fantastic general-purpose models in the 4B-9B range, but at the lower end they actually suck again because it's just scaled down, and so will probably loop 50% of the time at the smallest sizes. For the very small (350M-1.2B), LiquidAI's LFM 2.5 uses novel techniques to eliminate doom loops, and they have a vision variant - but it's still a tiny model, so don't try to code with it. And when you just want basic tool calling, even "old" models like llama 3.2, gemma 2, qwen 2.5, are good, fast and low-memory. If you do more searching you can find specific models that are the best at specific tasks in a given size range.
For any model where you notice looping, tune the LLM settings. Reduce temperature and top_p, increase presence/frequency penalty, reduce context size. If you have a specific task to do, fine-tuning is the absolute best way to both reduce memory usage and boost performance and quality. Remember that tiny models are not designed for 0-shot/1-shot, they need lots of specific instruction and context in the prompt, with multi-shot prompts having a dramatic effect on output quality. Try to keep your prompt to specific tasks. Think of small models as children, SOTA models as experienced professionals, and middle-of-the-road models as an average adult; you give the bigger ones more responsibility/agency, but more rules and guardrails to the little ones.
For coding you do want the biggest model you can fit, so this is where larger RAM shines (32GB+ iGPU). If you can fit a dense model, do that. MoE is ok but will perform better on narrower tasks. Use the bleeding edge forks of llamacpp for turboquant/etc and Multi-Token Prediction.
The last thing is quants. If you're running something that isn't the bare model (like an unsloth dynamic quant), model performance is gonna suffer the smaller you go, and smaller models will be much more affected. So try to max out the amount of memory you can dedicate to the model, and pick larger quants like Q6/Q8. You can quant the k/v cache but that also may have a negative effect. And again, if you can fine-tune for a task, you will gain much more performance and quality and reduce memory.