logoalt Hacker News

esperenttoday at 2:55 PM0 repliesview on HN

> LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice

Have you ever tried doing evals on moderately complex but bounded tasks?

I spent some time doing it when testing these "token reducing" tools like Headroom, RTK etc. as well as customizing my Pi tools. What I found interesting was that despite LLMs being deterministic, for a given toolset and prompt, the results were highly consistent for a given eval, across multiple models (I tested at the time using GPT 5.4 mini, 5.5, 5.3 codex, Gemini 3 flash, initially running sets of 5 evals on each task but once I realized how consistent the results were, dropping to sets of 3.

Aside: in my tests, RTK and Headroom made the overall context use higher for roughly equivalent results. The context use for those specific toolcalls went down but the number of model turns and overall context use went up.