Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

315 points • by bazlightyear • today at 4:05 AM • 176 comments • view on HN

Comments

These posts are going to be a constant for the next year, because there's no objective way to compare models (past low-level numbers like token generation speed, average reasoning token amount, # of parameters, active experts, etc). They're all quite different in a lot of ways, they're used for many different things by different people, and they're not deterministic. So you're constantly gonna see benchmarks and tests and proclamations of "THIS model beat THAT model!", with people racing around trying to find the best one.

But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way.

➕ show 7 replies

gertlabs • today at 5:18 AM

I'm glad we're seeing a shift towards objectively scored tests.

We've been doing this at scale at https://gertlabs.com/rankings, and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro.

GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.

➕ show 5 replies

ninjahawk1 • today at 5:55 AM

At the current rate, open sourced models are expected to surpass cloud models within a couple years based on a study I read a couple days ago.

Looking back at chatGPT and claude a couple years ago, very small Qwen models are basically equal in coding to what those cloud based models could do then. Also factoring in scaling laws, a 9b going to 18b is roughly a 40% increase, whereas 18b to 35b is 20%, I expect there will be a change of at least price in cloud based models.

Adobe used to be $600 per month, then it became $20 when distribution scaled.

➕ show 6 replies

gizmodo59 • today at 12:32 PM

“I did not wake up to be a loser. This loser attitude, makes no sense to me.” - Frontier Model Labs (original quote by Jensen in a podcast)

sieve • today at 5:40 AM

Kimi is really good.

I have been using Sonnet and others (DeepSeek, ChatGPT, MiniMax, Qwen) for my compiler/vm project and the Claude Pro plan is mostly unusable for any serious coding effort. So I use it in chat mode in the browser where it cannot needlessly read your entire project, and use Kimi on the OpenCode Go plan with pi.

Kimi consistently exceeded Sonnet on the C+Python project. Never had to worry about it doing anything other than what I asked it to do. GLM crapped the bed once or twice. Kimi never did.

➕ show 1 reply

slashdave • today at 4:59 AM

I was surprised by the ranking, until I read what the test was. Not horribly relevant for coding.

The current ranking of all tests makes more sense (well, except for how well Gemini does)

https://aicc.rayonnant.ai

➕ show 3 replies

magicalhippo • today at 4:43 AM

In a single challenge, measured by how performant the solution was.

Kimi K2.6 is definitely a frontier-sized model, so on the one hand it's not that surprising it's up there with the closed frontier models.

Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

➕ show 6 replies

yanis_t • today at 6:43 AM

Anecdotal, but having used Claude Code exclusively for the last several months, I was pleasantly surprised by how capable Pi + Kimi K2.6 is. It's also much faster (via OpenRouter), and at a fraction of the cost.

ponyous • today at 6:16 AM

Kimi is nowhere near GPT or Opus unfortunately. I really wish it was. I’m running evals where models have to generate code that produces 3D models and it’s obvious that it lacks spatial understanding and makes many more code errors before it succeeds.

Maybe it’s better in one particular case here and there and I think this blog post is example of that.

aykutseker • today at 5:21 AM

This seems less like Kimi is better at coding than Claude and more like Kimi found the right strategy for this particular game.

Still interesting though. The fact that an open weight model is close enough for that to matter is probably the real story.

childintime • today at 12:40 PM

Is there a lo-slop model that stands out when using Zig?

adrian_b • today at 9:53 AM

> Xiaomi confirming that weights for their newer V2.5 Pro model are dropping soon

This has already happened.

I have downloaded both the big Pro model and the smaller but multimodal MiMo-V2.5.

https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro

https://huggingface.co/XiaomiMiMo/MiMo-V2.5

https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-Base

https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Base

The download of MiMo-V2.5-Pro takes 963 GB, while that of MiMo-V2.5 takes 295 GB.

For comparison, the download of Kimi-K2.6 takes 555 GB.

codedokode • today at 9:52 AM

It's interesting that OpenAI promised to make AI accessible for everyone, but China actually did it.

jrecyclebin • today at 5:11 AM

I absolutely love Kimi's personality - some of the things it says are so out there! And it's been great for very focused, iterative work.

Its weakness is that it seems to yak on-and-on when it needs to plan out something big or read through and make sense of how to use a niche piece of a complex library. To the point where it can fill up its 256k window - and rack up a build. (No cache.) I have had better experience with GLM 5.1 in those cases.

Anyone out there relate?

➕ show 1 reply

_pdp_ • today at 9:27 AM

Kimi is capable model but it needs a very good harness. With a good harness it is a very capable model. But it can get into all kinds of issues (loops and such) something that frontier models do not.

As I said, you can blame the model, but it is nothing that the harness cannot take care of more deterministically.

➕ show 1 reply

zmmmmm • today at 7:07 AM

I've been switching across all different models this week with OpenCode and Pi - we're in an interesting place now because the open models are definitely "good enough" for a wide range of coding tasks and MUCH cheaper. They certainly aren't AS good, especially once you get into unfamiliar territory - custom enterprise frameworks etc where model knowledge falls off and general intelligence kicks in. But then, with time people will build up custom skills and agent files for those. And the open models will also get better.

I could easily see us in a place 2 years from now where this coding application is fully commoditised.

kmkrworks • today at 6:27 AM

I don't feel like this is an optimal way of comparing models. I really don't think any metric as of now has the ability to list down the best model as of now. It prioritizes tasks over the overall ability, and I don't even think it's possible to.

justech • today at 5:09 AM

I’ve been maining Kimi k2.6 through opencode go and openrouter for a week and I can say it’s the same experience as when I was maining Sonnet 3.5/4 late last year.

Not as good or as fast as Claude Code on Opus now but definitely enough for casual/hobby use. The best part is multiple choices for providers, if opencode gimps their service, I’ll switch

syntex • today at 10:51 AM

These benchmarks means very little. The real test is model + harness so agentic system that can fulfill given goals.

bazlightyear • today at 6:06 AM

BTW it looks like Kimi won the subsequent challenge too https://aicc.rayonnant.ai/challenges/hexquerques/

ajdegol • today at 9:31 AM

I’ve been wondering about potential regression in coding models.

The initial models were corrected by programmers which gave a very high quality feedback signal. Whereas with vibe coding on the rise, you’ll lose that signal.

elromulous • today at 4:51 AM

Is the site just slashdotted rn? Can anyone get to it?

➕ show 1 reply

PedroBatista • today at 4:45 AM

Great to know, but what was the cost both in terms of $$ and tokens used?

Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale.

Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.

➕ show 1 reply

Frannky • today at 4:49 AM

I have to try Kimi. I was looking for an alternative. If you have any experience, advice, please share. I saw Kimi is at the top of the Open Router ranking.

➕ show 2 replies

alex7o • today at 8:09 AM

I don't know about you but kimi 2.6 from the kimi subscription has been absolutely bad and useless for the past 1 week so I canceled my sub and stopped using it.

SomaticPirate • today at 5:20 AM

This seems to be testing the models on leetcode style prompts that also require the model to implement TCP calls to send the results. Interesting but probably not a apples to apples comparison. The fact only Grok qualified for the first one seems suspect

koala-news • today at 5:48 AM

In my opinion, this kind of comparison is not very meaningful.

warabe • today at 8:16 AM

I’m not trying to add fuel to the fire, but will OpenAI and Anthropic’s IPO go smoothly?

beering • today at 4:46 AM

I’m a little confused as to the setup. It was asking each model to one-shot a script and then the scripts faced off? Were the models given a computer environment? Or a test server to iterate against?

➕ show 1 reply

imrozim • today at 6:09 AM

Same experience here i use open router with claude as fallback for my startup. Is Kimi if close in quality the cost is difference hard to ignore

bjoli • today at 6:43 AM

As a musician, I find the butchering of musical notation on Kimis pricing page extremely off-putting.

muti • today at 7:13 AM

Doesn't seem like a very insightful result. Kimi won with the naive strategy. Other models didn't slide tiles at all or didn't demonstrate understanding of the rules, claiming words that lost points. A strategy that did nothing would beat them.

We know these models can solve much more difficult problems, something isn't right.

gherkinnn • today at 9:17 AM

I never looked in to the details of these benchmarks, I live with the assumptions that most benchmarks of any kind are gamed and useless.

What I do see in my own work and that of others around me, is that Claude consistently outperforms Gemini and to a lesser extent Codex.

With Claude eating tokens with declining return, concessions have to be made and Codex is a usable middle ground.

I use Kimi in Kagi's Assistant for non-code or generic programming questions and am quite happy with its no-bullshit responses.

jakemanger • today at 4:56 AM

What's the GPU VRAM requirements for this thing?

Awesome to have a open model that can compete, but damn it would be so much better if you could run it locally. Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

➕ show 2 replies

VeejayRampay • today at 11:34 AM

crazy how people on hacker news, who just gobble up anything if it's from openai or anthropic suddenly become monocled sceptics when chinese open models are "winning"

slopinthebag • today at 5:33 AM

Amazing. To me it feels like GLM 5.1, Kimi 2.6, DeepSeek 4 are all competitive both with each other and with the American models. Truly a great time to be alive.

I would like to see more effort making the flash variants work for coding. They are super economical to use to brute force boilerplate and drudgery, and I wonder just how good they can be with the right harness, if it provides the right UX for the steering they require.

As much as vibe coding has captured the zeitgeist, I think long term using them as tools to generate code at the hands of skilled developers makes more sense. Companies can only go so long spending obscene amounts of money for subpar unmaintainable code.

➕ show 1 reply

rvz • today at 5:28 AM

So we are now at the point where open weight models are rapidly catching up to the frontier models.

They are at best 30 days behind, and at worst case 2 months behind. The last issue is being able to run the best one on conventional hardware without a rack of GPUs.

The Macbooks, and Mac minis are behind on hardware but eventually in the next 2 years at worst will make it possible thanks to the advancements of the M-series machines.

All of this is why companies like Anthropic feel like they have to use "safety" to stop you from running local models on your machine and get you hooked on their casino wasting tokens with a slot machine named Claude.

pbreit • today at 5:06 AM

All my co-workers say Claude blows away Gemini. Is it really that good? How can I do Kimi?

➕ show 1 reply

wg0 • today at 6:15 AM

About 40% of stock market consists of about 7 or 8 companies. Those companies that are all into AI circular deals collectively trillions of dollars in valuations.

Now imagine a company burning 200,000/month on AI spend. Real numbers. Not every company is but some are.

Why such a company won't deploy an open weight model (Kimi 2.6 or Deepseek v4) on their own hardware (rented or otherwise) to save about 2.4 million dollars a year?

And these are the landmines Chinese cleverly did set up. Not saying intentionally or otherwise.

But end result is that good luck recouping your investments, you can pretty much kiss goodbye to any ROI. The bucket has a hole at the bottom and the bubble bust is guaranteed.

PS: Without open weight models too the economics do not make sense neither the code generated by these SOTA models is reliable enough to be deployed as is. Anyone claiming otherwise either hasn't worked on a real software stack with real users OR didn't use AI long enough to witness the AI slop and how hard it is to untangle or de-slopify the AI generated code therefore these trillion dollar valuations are absurd anyway.

PunchyHamster • today at 10:15 AM

That is not a programming challenge, the fuck

walrus01 • today at 5:38 AM

People thinking to self-host Kimi K2.6 had better be prepared for how big it is.

Q8 K XL quantization for instance is around 600GB on disk. I would bet about 700GB of VRAM needed.

Quantizations lower than Q8 are probably worthless for quality.

Or 2.05TB on disk for the full precision GGUF.

https://huggingface.co/unsloth/Kimi-K2.6-GGUF

If you can afford the hardware to run Kimi K2.6 at any decent speed for more than 1 simultaneous user, you probably have a whole team of people on staff who are already very familiar with how to benchmark it vs Claude, GPT-5.5, etc.

➕ show 1 reply

qakajjqj • today at 5:30 AM

Yes gimini is a programming application

ant6n • today at 6:20 AM

What I would like to see is a comparison of how well the models work in long running conversations:

  * do they lie and gaslight

  *  do they start breaking down on very long chats (forget old context, just get dumber)

  * do they constantly try to tell me how smart I am vs solving the problem (yes man)

  * do they follow conventions, parameters set out early in the prompts, or forget them

  * if they cant read a given file (like pdf), do they lie about it

  * is there a branch function to go back to earlier state of conversation

  * what is the quality of the presentation of results (structure, wording, excessive use of tables, appropriate use of headings)

  * how does the bot deal with user frustration (empathy?)

For example Chatpgt 5.5 is fairly smart, but presentation of results is kind of poor and unstructured, and unnecessarily long. It will break down on long conversations (the long answers dont help here), and it can’t deal with that except lying and gaslighting. It also has very little empathy, and mostly ignores user frustration. But at least theres branching, so one can go back without completely starting over.

Gemini doesnt feel quite as smart these days. It does well with very long conversations. Except it has bugs where all context gets lost or pruned, and it will lie and gaslight about it. Theres also no branching, so once context is lost you have to start over. Presentation is decent. Empathy is fairly good, except if users get frustrated, it gets more and more flustered and breaks down.

➕ show 1 reply

ibrahimhossain • today at 8:19 AM

[flagged]

surrTurr • today at 4:54 AM

[dead]

Rekindle8090 • today at 5:08 AM

[dead]

tim0414 • today at 7:19 AM

[dead]

chillfox • today at 6:59 AM

Meanwhile, I can’t get kimi k2.6 to edit a heredoc in a shell script without it fucking it up.

plexescor • today at 5:42 AM

I always though claude is the goat, but i guess its time to change the notion and try Kimi K2.6

alt Hacker News

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

Comments