Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls an...

v3ss0n • yesterday at 12:21 PM • 7 replies • view on HN

Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.

Replies

xrd • yesterday at 12:48 PM

How do you run it? vllm? llama.cpp?

Can you share some parameters you enable tool calling and agentic usage?

Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?

I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.

It concocts some misleading paths, but the code often compiles, and I consider that a victory.

You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.

➕ show 1 reply

thot_experiment • yesterday at 5:57 PM

naw, i mean i prefer Qwen 3.6 to Gemma 90% of the time, especially the MoE with a light tune to make it's tone more claude-like, but Gemma 4 is definitely better in some cases and I think they're pretty close in general.

The difference basically boils down to Gemma 4 making more assumptions and Qwen 3.6 sticking closer to the prompt, if your prompt is bad or leaves things up to the imagination, Gemma will do a better job, if you need strict prompt adherence Qwen is better. Since local models are "dumb" i think it makes sense to prefer prompt adherence, but there are complex tasks that Gemma will complete much much faster than Qwen because it makes the right assumptions the first time and as a result even with slower inference requires way fewer turns.

My speculation is that this comes from google having a much better strategy for filtering their training data, I think this also shows up in the shape of the world knowledge of the models. Gemma's world knowledge seems deeper even though the models are of roughly equivalent size to the Qwen counterparts so it's mostly likely just concentrated in places that are more relevant to my queries.

Most notably in my testing, Gemma 4 31b is the ONLY local model that will tell me the significance of 1738 correctly. Even most flagship/cloud models answer with some hallucinatory nonsense.

BoredomIsFun • today at 1:57 PM

> Qwen 3.6 burns it to the ground.

Not for creative writing or NLP.

59nadir • yesterday at 1:04 PM

Counter-point: I built an agent that can only interface with Kakoune, a much less common and more challenging situation for an LLM to find itself in, and Gemma4-A4B 8bit quantized does remarkably better in actually figuring out how to get text in buffers than Qwen3.6-35B-A3B in a similar class as Gemma4 A4B.

Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.

➕ show 1 reply

lambda • yesterday at 12:49 PM

Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.

The Qwen models are quite solid though.

➕ show 1 reply

2ndorderthought • yesterday at 12:33 PM

Gemma4 is definitely not used for vibe/agentic coding. Not even worth trying. But its a different weight class.

blurbleblurble • yesterday at 5:43 PM

I agree but would add that gemma 4 is really nice at vibing though in ways qwen 3.6 could never.

Maybe it could be fun to hook them up via a2a protocol as left and right brain agents operating in tandem.

➕ show 1 reply

alt Hacker News

Replies