My comment above wasn't meant to be rude. And you do have extensive benchmarks against grep etc so it's clear you understand the importance of that.
But I still think you're missing the harder but more important proof which is agent evals. Have you done any of that?
I would personally love to find tools in this space which can make agents more efficient and I do believe there's a scope for massive improvements compared to default workflows. But my evals with RTK and Headroom have made me wary that a tool can look like it should work, conceptually make sense, pass non-agentic benchmarks, and still make an actual agentic workflow worse.
It was directed at the parent who implied that we didn’t think about this.
I agree with your point about the evals and how you can get discontinuities: good search can be worse than bad search when agents can do many searches. We’re working on it