It's live on openrouter now.
In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.
To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.
I love the idea of chat.md.
I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.
I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.
Could also be the provider that is bad. Happens way too often on OpenRouter.
Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.
Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.
Have you had good results with the other frontier models?