Dave Citron here, from the MAI team. Thanks for the feedback, we're getting the model card updated to call out 5B active parameters (137B total).
On benchmarks: in the same VS Code harness, MAI-Code-1-Flash scored 51.2% on SWE-bench Pro vs. Haiku's 35.2% which we see as a pretty big leap. But going forward, we'll include additional models in our benchmarks, including models like Qwen 3.6 and Gemma 4.
Hey Dave, I’d love to add your new model in the harness I’m going to opensource very soonish. Going to publish benchmarks on real world tasks.
Have you run it through DeepSWE? I understand that's probably a high ask for this class of model, but would be interesting to see regardless.
Even if it can't fully pass much, there are so many tests against most of the scenarios that you can get a fairly rich report beyond the pass@1 stat. See e.g. this DeepSWE report against the Minimax M3 model: https://entrpi.github.io/misc/deep-swe-minimax-m3/