logoalt Hacker News

agent5raviyesterday at 5:05 PM3 repliesview on HN

The resolve rate numbers are interesting but I keep coming back to the regression question. In my experience doing code review on a real codebase, the hard part of maintenance is not fixing the thing that broke. It is understanding whether your fix preserves the invariants the original author had in mind but did not write down.

A benchmark that checks CI pass/fail captures the first part. It cannot capture the second. An agent that makes CI green by weakening an assertion or bypassing a check will score well here but create a time bomb.

The monorepo point from yuyuqueen hits this. When the agent can see the full dependency graph, it is less likely to fix something locally while breaking a downstream assumption. The biggest maintenance failures I have seen are not wrong logic. They are fixes that are locally correct but violate an unwritten contract between components.


Replies

rekornodeyesterday at 9:13 PM

CI pass/fail captures regression, but there's a layer beneath it that benchmarks can't touch: what exactly did the agent submit to each external API, and can you prove it after the fact? In the benchmark context this doesn't matter everything runs locally. In production it does. The agent calls a third-party service at 2am, the service claims it returned an error, your agent retried and billed you twice. Your logs say one thing, their logs say another. The integrity problem isn't just "did the code work" it's "what was the exact request/response pair, timestamped, by whom, provably." CI solves the first. Something else has to solve the second.

westurneryesterday at 5:32 PM

> It is understanding whether your fix preserves the invariants the original author had in mind but did not write down.

This may also be the limit to the quality of an automated port to another language. What isn't encoded as automated tests or manual test procedure cannot be verified.

So often I'm amazed at what it's possible to accomplish from a prompt that's certainly insufficient with insufficient context. "It should have been necessary to specify more context there," or "I would have thought that it wasn't possible to do that without reading in more context than just one source code file," and then a few prompts later, "there's where we failed for trying to skimp on context"

To prevent architectural rework as a human developer also requires substantial ahead-of-time codebase review.

Are AGENTS.md files the best place to summarize more comprehensive codebase review and useful dense context like guidelines for testing and architectural components in order to avoid rework?

oliver_dryesterday at 8:36 PM

[dead]