I am not sure how others are doing this, but here is our process:
- meaningful test coverage
- internal software architecture was explicitly baked into the prompts, and we try to not go wild with vibing, but, rather, spec it well, and keep Claude on a short leash
- each feature built was followed by a round of refactoring (with Claude, but with an oversight of an opinionated human). we spend 50% building, 50% refactoring, at least. Sometimes it feels like 30/70%. Code quality matters to us, as those codebases are large and not doing this leads to very noticeable drop in Claude's perceived 'intelligence'.
- performance tests as per usual - designed by our infra engineers, not vibed
- static code analysis, and a hierarchical system of guardrails (small claude.md + lots of files referenced there for various purposes). Not quite fond of how that works, Claude has been always very keen to ignore instructions and go his own way (see: "short leash, refactor often").
- pentests with regular human beings
The one project I mentioned - 2 months for a complete rewrite - was about a week of working on the code and almost 2 months spent on reviews, tests, and of course some of that time was wasted as we were doing this for the first time for such a large codebase. The rewritten app is doing fine in production for a while now.
I can only compare the outputs to the quality of the outputs of our regular engineering teams. It compares fine vs. good dev teams, IMHO.
The part about refactoring is very interesting and reassuring. I sometimes think I'm holding it wrong when I end up refactoring most of the agent's code towards our "opinionated" style, even after laying it out in md files. Thank you very much for this insight.
Very nice insight, that’s where the value is, even with a lot of time refactoring, testing and reviewing the compressed code phase is so much gziped than it’s still worth it to use an imperfect LLM. Even with humans we have all those post phases so great structure around the code generation leads to a lot of gains. It depends on industries and what’s being developed for sure