the harness bottleneck is real - I've been working on ai code security stuff and the biggest issue isn't model capability, it's that most tools treat the output as gospel. they'll take a suggested fix and apply it without checking if it even compiles, let alone if it introduces new vulns. I've seen fixes that patch one CVE but break auth logic entirely.
the edit tool point hits though. when you give the model a better interface to express changes (structured diffs vs free-form patches), error rates drop. but nobody talks about this because benchmarks measure "did it solve the problem" not "how many attempts" or "what's the blast radius when it fails". idk maybe I'm just jaded from debugging too many of these.