I’ve found this to be critical for having any chance of getting agents to generate code that is actually usable.
The more frequently you can verify correctness in some automated way the more likely the overall solution will be correct.
I’ve found that with good enough acceptance criteria (both positive and negative) it’s usually sufficient for agents to complete one off tasks without a human making a lot of changes. Essentially, if you’re willing to give up maintainability and other related properties, this works fairly well.
I’ve yet to find agents good enough to generate code that needs to be maintained long term without a ton of human feedback or manual code changes.