I'm a bit skeptical. Cursor's benchmark finds that Cursor's model (Composer 2.5) is...

mdasen • today at 6:52 AM • 11 replies • view on HN

I'm a bit skeptical.

Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.

Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.

I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.

Replies

leerob • today at 5:03 PM

(I work at Cursor) When Composer 2.5 launched, we initially scored very competitively on AA's composite benchmark. I believe 3rd place overall. They have recently updated to use DeepSWE, which has more of a focus on very long-horizon tasks, and Composer isn't as good at those yet. We're aware and working on this for our next model.

Overall, some benchmarks show Composer doing well, others not so much. We think the model is very capable at the given price point. There's lots to improve! If you see any specific behaviors or places the model isn't very good, lmk here or can email me lrobinson at cursor.com.

➕ show 5 replies

CuriouslyC • today at 3:47 PM

Not hard to understand what's going on here. They RL'd around patterns in their data and specific capabilities, so of course they'd construct a benchmark that's aligned with the training set.

Ironically, their benchmark might be more accurate than artificial analysis for a narrow slice of things that Cursor's Eigencustomer is really interested in. Otherwise I'd take it as just another data point.

➕ show 1 reply

burmanm • today at 9:08 AM

DeepSWE is slightly flawed in the sense that is uses only its own harness and that causes issues on models that are not correctly supported by it. There's huge amount of evidence that the harness plays a big role in how these models work and yet DeepSWE entirely removes that (and has probably only tested that it works fine with some favourite model of them).

There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.

None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.

➕ show 1 reply

justachillguy • today at 11:04 AM

Naturally, given it’s their benchmark they have overfitted their model somewhat to it.

famouswaffles • today at 6:58 AM

Cursor sessions are pretty much what composer models are RL'd on. This bench and the training data are/should be basically the same distribution.

muzani • today at 9:23 AM

Anecdotally, I find Composer 2.5 to be useless. I do use light LLMs like Claude Haiku and some of Cursor's older free models, but Composer is negative productivity for me.

➕ show 1 reply

datadrivenangel • today at 7:33 AM

For lighter interactive agentic coding, where you type stuff into an IDE and a minute or three later get results back for review, composer 2.5 is honestly pretty great. The results get notably worse for larger tasks though.

➕ show 1 reply

WinstonSmith84 • today at 9:47 AM

that benchmark seems to match my experience. GPT 5.5 is significantly better than Opus 4.8, last time I tried composer 2.5 it was truly dumb, and Fable to me looks to be on par with GPT 5.5 but .. different overall ... The best is to have a LLM-peer-review between GPT and Opus (now Fable) for best outcome.

apothegm • today at 11:12 AM

Composer writes the worst, stupidest, most naive and straight up brains-dead code you could imagine. Fast and cheap is about all it’s got going for it. I mostly use it for “sort these lines alphabetically” and stuff that’s a smidge too complex for regex find/replace.

➕ show 1 reply

ciaf • today at 8:10 AM

By the same token, Fable 5 is given a score of 77 vs 76 for GPT 5.5

whazor • today at 8:33 AM

I mean, they train their model on their training data. So by it should score well on their own benchmark.

alt Hacker News

Replies