Early benchmarks show tremendous improvement over Kimi K2 Thinking, which didn't perform well o...

gertlabs • yesterday at 10:55 PM • 5 replies • view on HN

Early benchmarks show tremendous improvement over Kimi K2 Thinking, which didn't perform well on our benchmarks (and we do use best available quantization).

Kimi K2.6 is currently the top open weights model in one-shot coding reasoning, a little better than GLM 5.1, and still a strong contender against SOTA models from ~3 months ago (comparable to Gemini 3.1 Pro Preview).

Agentic tests are still running, check back tomorrow. Open weights models typically struggle with longer contexts in agentic workflows, but GLM 5.1 still handled them very well, so I'm curious how Kimi ends up. Both the old Kimi and the new model are on the slower side, so that's a consideration that makes them probably less usable for agentic coding work, regardless. The old Kimi K2 model was severely benchmaxxed, and was only really interesting in the context of generating more variation and temperature, not for solving hard problems. The new one is a much stronger generalist.

Overall, the field of open weights models is looking fantastic. A new near-frontier release every week, it seems.

Comprehensive, difficult to game benchmarks at https://gertlabs.com/?mode=oneshot_coding

Replies

esperent • today at 3:05 AM

I'm looking at your table now - is there a reason why you don't include cost? If Opus 4.7 is the winner but costs e.g. 5x as much, that's important information.

➕ show 1 reply

tmaly • today at 12:18 AM

How would K2.6 compare to Sonnet 4.6 both price and performance wise?

➕ show 1 reply

freely0085 • today at 4:29 AM

Can you add Qwen 3.6 max to the leaderboard?

➕ show 1 reply

knollimar • today at 2:26 AM

wait why compare 2.6 to 2 instead of to 2.5?

➕ show 1 reply

cmrdporcupine • yesterday at 11:02 PM

Surprised to see such variance per language

➕ show 1 reply

alt Hacker News

Replies