logoalt Hacker News

charcircuittoday at 8:30 AM1 replyview on HN

A pretty simple one would be to have every model try and one shot every ticket your company has and then measure the acceptance rate of each model.


Replies

sam_goodytoday at 8:35 AM

Except that if you tried one-shotting your ticket twenty times at different hours of the day and different days of the week, you would have enough changes to make benchmarks even if you used the same model every time. Much moreso if you fiddled with the thinking or changed the prompt.

Because non-deterministic, because of constant updates and changes, and because the models are throttled according to number of users, releases, et al.

show 1 reply