logoalt Hacker News

binarymaxtoday at 6:32 PM3 repliesview on HN

I disagree that evaluation is always a coding task. Evaluation is scrutiny for the person who wants the thing. It’s subjective. So, unless you’re evaluating something purely objective, such as an algorithm, I don’t see how a self contained, self “improving “ agent accomplishes the subjectivity constraint - as by design you are leaving out the subject.


Replies

NitpickLawyertoday at 7:15 PM

Sure. There will always be subjective tasks where the person who asks for something needs to give feedback. But even there we could come up with ways to make it easier / faster / better ux. (one example I saw my frontend colleagues do is use a fast model to create 9 versions of a component, in a grid. And they "at a glance" decide which one is "better", and use that going forwards).

OTOH, there's loads you can do for evaluation before a human even sees the artifact. Things like does the site load, does it behave the same, did anything major change on the happy path, etc etc. There's a recent-ish paper where instead of classic "LLM as a judge" they used LLMs to come up with rubrics, and other instances check original prompt + rubrics on a binary scale. Saw improvements in a lot of evaluations.

Then there's "evaluate by having an agent do it" for any documentation tracking. Say you have a project, you implement a feature, and document the changes. Then you can have an agent take that documentation and "try it out". Should give you much faster feedback loops.

ranyumetoday at 6:53 PM

In science there are ways to surface subjectivity (cannot be counted) into observable quantized phenomena. Take opinion polls for instance: "approval" of a political figure can mean many things and is subjective, but experts in the field make "approval" into a number through scientific methods. These methods are just an approximation and have many IFs, they're not perfect (and for presidential campaign analysis in particular they've been failing for reasons I won't clarify here), but they're useful nonetheless.

Another thing that get quantized is video preferences to maximize engagement.