We’ve been on the receiving end of this complaint with Semble. I think it is a valid complaint, but ...

stephantul • today at 4:34 AM • 0 replies • view on HN

We’ve been on the receiving end of this complaint with Semble. I think it is a valid complaint, but constructing a benchmark for this kind of thing is just very difficult and expensive because of the (harness) x (model) x (mcp/cli) combination.

With traditional ml/tooling, not showing benchmarks was usually a red flag. But for llm tooling, I’m not so sure.

alt Hacker News