The core of the problem is that there are a million tools that make AI better, and no ways to measure whether AI is working better.
Big companies with popular products have it. They do something between normal product analytics and chatbot evals to figure out if users are being successful in their sessions. That's the job.
But any given dev, with between 3 and 50 sessions a day? Like, I have no idea what makes the LLM better. It's all vibes.
My company has a whole stack here. Preferred harnesses, preferred models, skills, the shape of our code, everything. There's gotta be a way to measure whether this setup is working for us, at 1 / 1-million-th the scale of a Claude Code.
And the effort to produce valid benchmarks is tremendous. You are probably right and that’s very annoying. We already had flame wars over frameworks and this is way worse, your vibes vs. my vibes. Who would thought non-deterministic outputs would lead us here?
There is an answer- these tools should benchmark by cost per correct answer - not just tokens saved.
> and no ways to measure whether AI is working better.
What I do with my product is I explicity tell you to ask your agent. I have real world examples and real world repositories that you can try with:
https://gitsense.com
https://github.com/gitsense/smart-ripgrep
https://github.com/gitsense/smart-codex
Token saving on average is not what I am mostly interested in though. I am more interested in knowing that the AI doesn't load unnecessary files in context, which can affect reasoning.
You can just ask the agent after a task how many files do you think was not read by knowing the files purpose first?