logoalt Hacker News

trjordanyesterday at 8:03 PM3 repliesview on HN

The core of the problem is that there are a million tools that make AI better, and no ways to measure whether AI is working better.

Big companies with popular products have it. They do something between normal product analytics and chatbot evals to figure out if users are being successful in their sessions. That's the job.

But any given dev, with between 3 and 50 sessions a day? Like, I have no idea what makes the LLM better. It's all vibes.

My company has a whole stack here. Preferred harnesses, preferred models, skills, the shape of our code, everything. There's gotta be a way to measure whether this setup is working for us, at 1 / 1-million-th the scale of a Claude Code.


Replies

sdesoltoday at 12:48 AM

> and no ways to measure whether AI is working better.

What I do with my product is I explicity tell you to ask your agent. I have real world examples and real world repositories that you can try with:

https://gitsense.com

https://github.com/gitsense/smart-ripgrep

https://github.com/gitsense/smart-codex

Token saving on average is not what I am mostly interested in though. I am more interested in knowing that the AI doesn't load unnecessary files in context, which can affect reasoning.

You can just ask the agent after a task how many files do you think was not read by knowing the files purpose first?

lackoftacticstoday at 12:13 AM

And the effort to produce valid benchmarks is tremendous. You are probably right and that’s very annoying. We already had flame wars over frameworks and this is way worse, your vibes vs. my vibes. Who would thought non-deterministic outputs would lead us here?

jahalayesterday at 9:37 PM

There is an answer- these tools should benchmark by cost per correct answer - not just tokens saved.