Huh, according to that model card this is a 137B total parameter model.
Performance doesn't seem that good:
- MAI-Code-1-Flash (137B-A5B) = 51% on SWE-bench pro
- Qwen3.6-35B-A3B = 49.5% on SWE-bench pro (https://huggingface.co/Qwen/Qwen3.6-35B-A3B)
They benchmark against Claude Haiku but Haiku is not good, it's worse than tiny open models you can run locally or via API at 10% the cost.
Qwen is definitely the model to beat as of Mid 2026. While I didn't benchmark with SWE as my use cases are OpenClaw [1]. I found both Qwen 3.6 35B A3B and more impressively Qwen 3.5 122B A10B starting to be competitive with closed flash models. The NVFP4 quant of the latter is what I'm running now on DGX.
[1] https://srinathh.medium.com/mid-size-local-models-are-now-co...
The take away is that this model is a smaller model that competes with Haiku, I would hope they come out with a "Sonnet" competing model, then Opus. I have been wondering why Microsoft is kind of "sleeping" on offering models they themselves have made on Copilot, maybe it was part of their deal with OpenAI? Not sure.
> 137B-A5B
Yeah, not a 5B param model as the earlier title implied!
So what other models use less than half of Haiku's tokens while providing higher success rate?
While I agree directionally, I'll caveat that "cost per token" != "cost per task". In the case of Qwen3.6 it tends to think 1.6x more than Haiku, so the cost of Haiku on the same tasks tends to only be about double. More detail from comparing their Artificial Analysis metrics:
Qwen3.6-35B-A3B vs Claude Haiku 4.5
reasoning mode · AA Intelligence Index v4.0
46.0 ┤ ↖ better — cheaper · smarter · faster
│
│
44.0 ┤ ╭─────╮
│ │ ● │ Qwen3.6-35B-A3B
│ ╰─────╯
42.0 ┤
│
│
40.0 ┤
│
│
38.0 ┤ ╭───╮
│ Claude Haiku 4.5 │ ○ │
│ ╰───╯
36.0 ┤
└┬─────────┬─────────┬─────────┬─────────┬────────┬
$200 $300 $400 $500 $600 $700
x → cost to run the index (USD) lower is better
y → AA intelligence index higher is better
bubble area = output speed (tokens / sec)
╭─────╮ ╭───╮
│ ● │ Qwen ~196 t/s │ ○ │ Haiku ~93 t/s
╰─────╯ ╰───╯
┌─────────────────────┬──────────┬──────────┬───────────┐
│ model │ AA index │ run cost │ out speed │
├─────────────────────┼──────────┼──────────┼───────────┤
│ Qwen3.6-35B-A3B ●│ 43.5 │ $280 │ 196 t/s │
│ Claude Haiku 4.5 ○│ 37.1 │ $620 │ 93 t/s │
└─────────────────────┴──────────┴──────────┴───────────┘
COST PER TOKEN ≠ COST PER TASK
output tokens per index run:
Haiku 4.5 87.3M (79.3M reasoning + 8.0M answer)
Qwen3.6 143.2M (131.7M reasoning + 11.5M answer)
→ Qwen emits 1.64× more output
── output speed (tokens / sec) ────────── raw rate · higher = faster
Qwen3.6 100% ~196 t/s
Haiku 4.5 ~47% ~93 t/s
→ Qwen ~2.1× faster per token
╎ 1.64× more tokens < 2.1× faster rate
▼
── solution speed (per finished answer) ── higher = faster
Qwen3.6 100%
Haiku 4.5 ~78%
→ Qwen ~1.3× FASTER to a solution
SCORECARD
intelligence cost / task speed to solution
Qwen3.6-35B-A3B 43.5 $280 ~1.3× faster
Claude Haiku 4.5 37.1 $620 (slower)
→ Qwen wins all three. The reasoning blow-up (1.64×) is smaller than
the raw-speed edge (2.1×), so Qwen stays ahead per task.[dead]
Dave Citron here, from the MAI team. Thanks for the feedback, we're getting the model card updated to call out 5B active parameters (137B total).
On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.