logoalt Hacker News

pcwelderyesterday at 5:28 PM4 repliesview on HN

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md


Replies

data-ottawayesterday at 6:29 PM

Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?

show 1 reply
manofmanysmilesyesterday at 6:04 PM

I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

nolist_policyyesterday at 6:08 PM

Could also be the provider that is bad. Happens way too often on OpenRouter.

show 1 reply
sergiotapiayesterday at 6:25 PM

Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.

show 1 reply