The trick is, with the setup I mentioned, you change the rewards.
The concept is:
Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.
Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.
Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.
It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.
This is very interesting, but like sibling comments, I'm very curious as to how you run this in practice. Do you just tell Claude/Copilot to do what you describe?
And do you have any prompts to share?
This seems quite amazing really, thanks for sharing
What is the scope of projects / features you’ve seen this be successful at?
Do you have a step before where an agent verifies that your new feature spec is not contradictory, ambiguous etc. Maybe as reviewed with regards to all the current feature sets?
Do you make this a cycle per step - by breaking down the feature to small implementable and verifiable sub-features and coding them in sequence, or do you tell it to write all the tests first and then have at it with implementation and refactoring?
Why not refactor-red-green-refactor cycle? E.g. a lot of the time it is worth refactoring the existing code first, to make a new implementation easier, is it worth encoding this into the harness?
This seems like a tremendous amount of planning, babysitting, verification, and token cost just to avoid writing code and tests yourself.
I'm curious how this works if the green team writes an implementation that makes a network call like an RPC.
Red team might not anticipate this if the spec does detail every expected RPC (which seems unreasonable: this could vary based on implementation). But a unit test would need mocks.
Is green team allowed to suggest mocks to add to the test? (Even if they can't read the tests themselves?) This also seems gamaeable though (e.g. mock the entire implementation). Unless another agent makes a judgement call on the reasonability of the mock (though that starts to feel like code review more generally).
Maybe record/replay tests could work? But there are drawbacks in the added complexity.
Someone directly down from you suggested looking up Mike Postock's TDD skill, so I did:
https://github.com/mattpocock/skills/blob/main/tdd%2FSKILL.m...
Everything below quoted from that skill, and serves as a much better rebuttal than I had started writing:
DO NOT write all tests first, then all implementation. This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code."
This produces crap tests:
Tests written in bulk test imagined behavior, not actual behavior You end up testing the shape of things (data structures, function signatures) rather than user-facing behavior Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine
You outrun your headlights, committing to test structure before understanding the implementation
Correct approach:
Vertical slices via tracer bullets.
One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.
Seems like red team is incentivized to write tests that violate the spec since you're rewarding failed tests.
How do you make sure Red Team doesn't just write subtly broken tests?
How do you define visibility rules? Is that possible for subagents?
You guys are describing wonderful things, but I've yet to see any implementation. I tried coding my own agents, yet the results were disappointing.
What kind of setup do you use ? Can you share ? How much does it cost ?