> Tests created a similar false comfort. Having 500+ tests felt reassuring, and AI made it easy to generate more. But neither humans nor AI are creative enough to foresee every edge case you’ll hit in the future; there are several times in the vibe-coding phase where I’d come up with a test case and realise the design of some component was completely wrong and needed to be totally reworked. This was a significant contributor to my lack of trust and the decision to scrap everything and start from scratch.
This is my experience. Tests are perhaps the most challenging part of working with AI.
What’s especially awful is any refactor of existing shit code that does not have tests to begin with, and the feature is confusing or inappropriately and unknowingly used multiple places elsewhere.
AI will write test cases that the logic works at all (fine), but the behavior esp what’s covered in an integration test is just not covered at all.
I don’t have a great answer to this yet, especially because this has been most painful to me in a React app, where I don’t know testing best practices. But I’ve been eyeing up behavior driven development paired with spec driven development (AI) as a potential answer here.
Curious if anyone has an approach or framework for generating good tests
The false comfort usually comes from line coverage. I had a model at 97.2% coverage, 92 tests. Ran equivalence partitioning on it (partition inputs into classes, test one from each) and found 6 real gaps: a search scope testing 1 of 5 fields, a scope checking return type but not filtering logic, a missing state machine branch. SimpleCov said the file was covered. The logical input space was not. The technique is old (ISTQB calls it specification-based testing) but the manual overhead made it impractical until recently. Agents made it possible to apply across 60+ models, which is the one thing they have changed for testing so far.
Use tla+ and have it go back and forth with you to spec out your system behavior then iterate on it trying to link the tla+ spec with the actual code implementing it
Pull out as many pure functions as possible and exhaustively test the input and output mappings.
[dead]
I've always thought that writing good tests (unit, integration or e2e) is harder than the actual coding by maybe an order of magnitude.
The tricky part of unit tests is coming up with creative mocks and ways to simulate various situations based on the input data, w/o touching the actual code.
For integration tests, it's massaging the test data and inputs to hit every edge case of an endpoint.
For e2e tests, it's massaging the data, finding selectors that aren't going to break every time the html is changed, and trying to winnow down to the important things to test - since exhaustive e2e tests need hours to run and are a full-time job to maintain. You want to test all the main flows, but also stuff like handling a back-end system failure - which doesn't get tested in smoke tests or normal user operations.
That's a ton of creativity for AI to handle. You pretty much have to tell it every test and how to build it.