I think benchmarks like this are too subjective and narrow to be useful. For example, whether a patc...

fiso64 • today at 7:07 AM • 1 reply • view on HN

I think benchmarks like this are too subjective and narrow to be useful. For example, whether a patch "bloats" the codebase really depends on the situation: If it's building a feature that will grow in the future, or refactoring code that has a long history of bugs, then a larger patch might in fact be good. It's not clear from the blog just how much context the LLM judge receives about the long term project goals and history. Benchmarks should be focused on evaluating the final result only. Maybe ask the coder to build a full app, or implement many new large features for an existing app in sequence, with a larger set of requirements, or have another LLM roleplay as the human to make the instructions a little more underspecified. When done, ask a reviewer harness to test the product for 5 hours, not the code. Count the number of bugs and weigh them by severity. "Taste" would then become an automatic consequence of correctness.

(Full disclosure, I'm not a software engineer.)

Replies

iLoveOncall • today at 8:16 AM

> Full disclosure, I'm not a software engineer

Then maybe you should abstain, because your comment is a complete load of nonsense.

Bad code is bad code regardless of the history or scope of the feature. Maintainability is important because you can never know if a feature will be built upon in the future or not.

Bloat is bad regardless, because it increases the overall complexity of the whole software development lifecycle, for the whole team, forever (or until refactored out): It's harder to keep track of the code and how it works to write new requirements, it's harder to write, it's harder to read and review, it's harder to debug, etc.

You can write extremely poor code that has no bugs, it doesn't make it tasteful. This is simply a ridiculous statement.

➕ show 1 reply

alt Hacker News

Replies