Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.
Probably want to look at SWE bench pro or terminal bench 2. They cover these longer horizon tasks that need more than just writing a bit of code in one file. And SWE bench pro in particular it is not yet saturated like many other common benchmarks. Normal SWE and LCB are not really useful anymore because they are already being gamed hard so the developers can quote high numbers in a repo readme or press release.
Build systems are tested by CompileBench (Quesma's benchmark).
Disclaimer: I'm the founder.
Generating big chunks code is all I do, all day.
I don't write code by hand any more, neither at work, nor for side projects.
I work mostly in Rust and TypeScript at a developer tools company.
Oh yes! I let my environments now be built by agents via kubectl / helm and let them debug issues.
It's amazing! Saves hours of work!
I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!
Create it!
[dead]
[flagged]
I agree. Also good for small changes that need to be applied consistently across an entire codebase.
I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).
Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.