Have you tried using an additional agent to verify the outputs? It seems that can help if the supervising agent has a small context demand on it. (ie. run this command, make sure it returns 0, invoke main coding agent with error message if it doesn't)
Yeah I've experimented with that pattern. The meta-agent approach works for catching obvious stuff, like "did the build pass" or "does this file actually exist." But the harder bugs are semantic. The agent writes a function that returns the right shape of data but with wrong values, or adds a fallback that masks the real failure. A supervising agent reading the same code often has the same blind spots.
What's worked better for me is building verification into the workflow itself, like explicit test assertions the agent has to pass before it can claim "done," plus a rule that any API call must show a real response, not a mock. Basically treating the AI like a junior dev who needs guard rails, not a senior who just needs a code review.