logoalt Hacker News

elitoday at 12:52 PM0 repliesview on HN

Obviously there are advantages to not having to do work yourself.

But for a benchmark with the goal of picking a model to replace a human on some task? I really think the human should judge which is best.

I haven’t gotten very far yet but I had an idea for a personalized benchmark tool that walks through your git history and helps you craft prompts for tasks that bugs or features already implemented by hand so you can compare how different LLMs would do it.