I did some evals with pi and GPT 5.5. I tested RTK on / headroom on / both on / both ...

esperent • today at 12:51 AM • 1 reply • view on HN

I did some evals with pi and GPT 5.5. I tested RTK on / headroom on / both on / both off (all with the standard pi system instructions and no AGENTS.md).

I forget the exact tests I used (a couple of the standard agent evals that people use, one python and one typescript because those are what I use).

I don't claim it was an exhaustive test, or even a good one. It's possible I could have spent a day or so tuning my AGENTS.md and the pi system prompt/tool instructions and gotten better results, because if there's one thing running evals taught me it's that subtle differences there can change the results a lot.

However, I got clearly better results with both off, enough to convince me to stop the tests immediately after 3 rounds.

The problem was that while context use did go down (sometimes), the number of turns to complete went up so the overall cost of the conversation was higher.

It's made me very aware of one thing: so many people are sharing these kind of tools, but either with zero evals (or suspiciously hard to reproduce), or in the case of this one, extensive benchmarks testing the wrong thing.

I'm sure this tool does use fewer tokens than grep, and the benchmarks prove it, but that's not what matters here. What matters is, does an agent using it get the same quality of work done more quickly and for lower cost?

Replies

zobzu • today at 1:03 AM

with AI the "they could so they never wondered if they should" will be a very frequent thing.

➕ show 2 replies

alt Hacker News

Replies