When you write tests with LLM-generated code you're not trying to prove correctness in a mathematically sound way.
I think of it more as "locking" the behavior to whatever it currently is.
Either you do the red-green-with-multiple-adversarial-sub-agents -thing or just do the feature, poke the feature manually and if it looks good then you have the LLM write tests that confirm it keeps doing what it's supposed to do.
The #1 reason TDD failed is because writing tests is BOORIIIING. It's a bunch of repetition with slight variations of input parameters, a ton of boilerplate or helper functions that cover 80% of the cases, but the last 20% is even harder because you need to get around said helpers. Eventually everyone starts copy-pasting crap and then you get more mistakes into the tests.
LLMs will write 20 test cases with zero complaints in two minutes. Of course they're not perfect, but human made bulk tests rarely are either.