It's live on openrouter now. In my personal benchmark it's bad. So far the benchmark has...

pcwelder • yesterday at 5:28 PM • 4 replies • view on HN

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

Replies

data-ottawa • yesterday at 6:29 PM

Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?

➕ show 1 reply

manofmanysmiles • yesterday at 6:04 PM

I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

nolist_policy • yesterday at 6:08 PM

Could also be the provider that is bad. Happens way too often on OpenRouter.

➕ show 1 reply

sergiotapia • yesterday at 6:25 PM

Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.

➕ show 1 reply

alt Hacker News

Replies