Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like ...

bloppe • today at 7:13 AM • 8 replies • view on HN

Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.

Replies

bartread • today at 11:00 AM

I agree. Also good for small changes that need to be applied consistently across an entire codebase.

I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).

Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.

➕ show 2 replies

sigmoid10 • today at 8:47 AM

Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.

jakozaur • today at 11:16 AM

Build systems are tested by CompileBench (Quesma's benchmark).

Disclaimer: I'm the founder.

slashdev • today at 1:12 PM

Generating big chunks code is all I do, all day.

I don't write code by hand any more, neither at work, nor for side projects.

I work mostly in Rust and TypeScript at a developer tools company.

➕ show 1 reply

Bombthecat • today at 9:43 AM

Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.

It's amazing! Saves hours of work!

I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!

seunosewa • today at 1:07 PM

Create it!

philbitt • today at 7:21 PM

[dead]

d0963319287 • today at 2:31 PM

[flagged]

alt Hacker News

Replies