logoalt Hacker News

quantumleaperyesterday at 2:00 PM1 replyview on HN

How are you iterating on a system prompt and tool descriptions without an eval that gives you hard numbers for improvement or regression?


Replies

yogthosyesterday at 5:41 PM

I look at what the model is doing in the loop and whether the harness is catching cases such as the model having to write scripts to balance parens, whether it's trying to do the same thing over and over again, and all the other cases I explained in detail in the blog post.

Even without having hard numbers, it's pretty easy to see from the log whether the model is getting stuck or not.