Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?
I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).
And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
I have 8GB VRAM but 32GB RAM. Qwen 3.6 35B runs nicely.
You should look at gemma-4-26B-A4B. 16+8=24gb and Q4 is about 16GB. Not much context left, but might run.
I have 8GB VRAM, but 32GB sys ram. I can run qwen 3.6 35B at 30 tok/s. I also use pi, and it's smart enough to extend itself(multishot and maybe a few tries)
For you, you could try gemma-4-26B-A4B
I think at 16 GB you'd struggle to run the regular development tools nowadays, forget about any interesting inference.
I got a 32GB of RAM and a 6GB VRAM card; tried both 27B and 35B, with pi. And it's a laptop. Speed isn't exactly a concern for me, I can enjoy the real life while the agent is doing its thing. And while they appear smart enough on the first glance, once it reads a file that's more than 100 lines it loses all memory of anything I asked it to do. The lack of failure state or any indication what might be wrong here is just frustrating. Guess local models aren't for me, unless I move to Silicon Valley and redeem my free MacBook at a local Startbucks.
[dead]
I suspect with those specs, you're not in the game right now for reliably using local models for code generation. The easiest way in is a MacBook with at least 32GB of RAM. This should be able to run a 4bit quantization of qwen 3.6 using the MLX format really well.