logoalt Hacker News

CommieBobDoletoday at 1:42 PM2 repliesview on HN

This has always been a thing with IT advice, though - the more complex a system and the outcome, the harder it is to clearly define "better" or "worse". Add in the fact that LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice.

Heck, even the 'benchmarks' are mostly somebody's attempt to crystallize their vibes with varying amounts of success.


Replies

esperenttoday at 2:55 PM

> LLMs are intensely and emphatically non-deterministic and LLM guidance basically becomes gardening advice

Have you ever tried doing evals on moderately complex but bounded tasks?

I spent some time doing it when testing these "token reducing" tools like Headroom, RTK etc. as well as customizing my Pi tools. What I found interesting was that despite LLMs being deterministic, for a given toolset and prompt, the results were highly consistent for a given eval, across multiple models (I tested at the time using GPT 5.4 mini, 5.5, 5.3 codex, Gemini 3 flash, initially running sets of 5 evals on each task but once I realized how consistent the results were, dropping to sets of 3.

Aside: in my tests, RTK and Headroom made the overall context use higher for roughly equivalent results. The context use for those specific toolcalls went down but the number of model turns and overall context use went up.

dofmtoday at 1:54 PM

Gardening advice. Better analogy.