Gemma 4 26B really is an outlier in its weight class. In our little known, difficult to game bench...

gertlabs • last Monday at 4:27 PM • 5 replies • view on HN

Gemma 4 26B really is an outlier in its weight class.

In our little known, difficult to game benchmarks, it scored about as well as GPT 5.2 and Gemini 3 Pro Preview on one-shot coding problems. It had me re-reviewing our entire benchmarking methodology.

But it struggled in the other two sections of our benchmark: agentic coding and non-coding decision making. Tool use, iterative refinement, managing large contexts, and reasoning outside of coding brought the scores back down to reality. It actually performed worse when it had to use tools and a custom harness to write code for an eval vs getting the chance to one-shot it. No doubt it's been overfit on common harnesses and agentic benchmarks. But the main problem is likely scaling context on small models.

Still, incredible model, and incredible speed on an M-series Macbook. Benchmarks at https://gertlabs.com

Replies

seemaze • last Monday at 6:36 PM

Thats funny, it failed my usual ‘hello world’ benchmark for LLM’s:

“Write a single file web page that implements a 1 dimensional bin fitting calculator using the best fit decreasing algorithm. Allow the user to input bin size, item size, and item quantity.”

Qwen3.5, Nematron, Step 3.5, gpt-oss all passed first go..

prettyblocks • yesterday at 1:24 PM

For me the vision/OCR is much better than other models in weights class.

datadrivenangel • last Monday at 7:21 PM

Overall it's a very good open weights model! Notably I thought it makes more dumb coding mistakes than GPT-OSS on my M5, but it's fairly close overall.

iknowstuff • last Monday at 7:02 PM

Gemma 31B scoring below 26B-A4B?

➕ show 1 reply

neonstatic • yesterday at 5:13 AM

I have very mixed feelings about that model. I want to like it. It's very fast and seems to be fit for many uses. I strongly dislike its "personality", but it responds well to system prompts.

Unfortunately, my experience with it as a coding assistant is very poor. It doesn't understand libraries it seems to know about, it doesn't see root causes of problems I want it to solve, and it refuses to use MCP tools even when asked. It has a very strong fixation on the concept of time. Anything past January 2025, which I think is its knowledge cutoff, the model will label as "science fiction" or "their fantasy" and role play from there.

alt Hacker News

Replies