logoalt Hacker News

stephantultoday at 4:34 AM0 repliesview on HN

We’ve been on the receiving end of this complaint with Semble. I think it is a valid complaint, but constructing a benchmark for this kind of thing is just very difficult and expensive because of the (harness) x (model) x (mcp/cli) combination.

With traditional ml/tooling, not showing benchmarks was usually a red flag. But for llm tooling, I’m not so sure.