The issue is that you can't do unsupervised learning if you require humans.
LLMs grading the answers is relying on the LLM knowing the answer and not just hallucinating it. You also have issues if/when the model refuses to answer, or if it gets stuck in a loop (e.g. if running locally with a heavily quantized model).
I'm investigating/experimenting with using traditional NLP (stanza, spaCy, etc.) to try and grade the responses according to different metrics (is the response in first/second/third person?, is it written as poetry, prose, or drama? etc.). I'm also thinking about using information extraction and synonym detection to handle data queries and the like.
Obviously there are advantages to not having to do work yourself.
But for a benchmark with the goal of picking a model to replace a human on some task? I really think the human should judge which is best.
I haven’t gotten very far yet but I had an idea for a personalized benchmark tool that walks through your git history and helps you craft prompts for tasks that bugs or features already implemented by hand so you can compare how different LLMs would do it.