I actually really like subjective benchmarks, so long as it's a human (ideally me) grading the results. LLM as judge never made much sense.
The issue is that you can't do unsupervised learning if you require humans.
The issue is that you can't do unsupervised learning if you require humans.