> You cannot simply make a claim that (model + harness) X is better than Y, but then have no disc...

spider-mario • today at 2:50 PM • 3 replies • view on HN

> You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference in the output.

You definitely can in principle; that’s the entire point of the comment you are responding to. If one tool completes it in 10 minutes with little hand holding, and the other does it in one hour at 4× the cost and while needing a lot of steering, the former is arguably better even if the end result is the same.

Whether that’s specifically true and demonstrable of GPT and Claude is another question, but your blanket statement doesn’t hold as a general rule.

Replies

neosat • today at 2:58 PM

That's a fair callout and I agree my statement was too general in just mentioning 'output', as you correctly pointed out. To define 'better' you would indeed need to agree on the dimensions you would evaluate candidates against.

I think a more appropriate rephrasing would be 'You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference on dimensions you care about'. In the case of latest of claude code vs codex with gpt 5.5) both are similar enough in the dimensions people will care about in evaluating (vs. differing wildly in cost or time taken).

runako • today at 2:57 PM

This obviously correct take will get pushback, so let me add some other examples:

- which tool required more detailed goal-setting in the prompt?

- did one tool ask follow-up questions up front vs spread out over implementation?

- did either tool match existing coding styles?

- did either tool remind you about potential conflicts between what you asked it to build and other parts of the codebase?

There are a lot of ways to compare agents besides just the code. (Similarly, working engineers are not evaluated just on their code output.)

SiempreViernes • today at 4:41 PM

The colleague implicitly agreed that comparing the output was a valid way to settle the matter as they took part in the test, so they weren't using "better" in the way you propose.

alt Hacker News

Replies