I'm glad we're seeing a shift towards objectively scored tests. We've been doing th...

gertlabs • today at 5:18 AM • 6 replies • view on HN

I'm glad we're seeing a shift towards objectively scored tests.

We've been doing this at scale at https://gertlabs.com/rankings, and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro.

GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.

Replies

tgv • today at 8:34 AM

This may be objectively scored, but it is not an indication of anyone's coding capabilities. This test measures which model almost accidentally came up with the best strategy (against other bots). This is not representative of coding. You would need to test 100 or more of such puzzles, widely spread across the puzzle spectrum, to get an idea which model is best at finding strategies involving an English dictionary.

➕ show 2 replies

Mashimo • today at 6:39 AM

Seems like in agentic work flow the qween flash and Deepseek flash models are quite good.

Fits with another comment from yesterday on here who said the flash models are just better at tool calling.

Planning with gpt55 and implementation with a flash model could be bang for the buck route.

veber-alex • today at 5:36 AM

In my experience benchmarks are pretty meaningless.

Not only is performance dependent on the language and tasks gives but also the prompts used and the expected results.

In my own internal tests it was really hard to judge whether GPT 5.5 or Opus 4.7 is the better model.

They have different styles and it's basically up to preference. There where even times where I gave the win to one model only to think about it more and change my mind.

At the end of the day I think I slightly prefer Opus 4.7.

➕ show 2 replies

cyanydeez • today at 10:08 AM

Curious, why can't you provide a measurement of context size for a human. Surely there must be enough science to make a good approximater.

bazlightyear • today at 5:39 AM

Are you tests and results open source?

➕ show 1 reply

refulgentis • today at 5:41 AM

Any thoughts on using it on Fireworks? It's extremely fast there.

➕ show 1 reply

alt Hacker News

Replies