Benchmarks are great, but I feel like there’s a better way this seems quite subjective.
What you really need is an objective benchmark
I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
> What you really need is an objective benchmark
"When are all the software engineers unemployed?"
I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.