You're holding it wrong.
Set the boundaries and guidelines before it starts working. Don't leave it space to do things you don't understand.
ie: enforce conventions, set specific and measurable/verifiable goals, define skeletons of the resulting solutions if you want/can.
To give an example. I do a lot of image similarity stuff and I wanted to test the Redis VectorSet stuff when it was still in beta and the PHP extension for redis (the fastest one, which is written in C and is a proper language extension not a runtime lib) didn't support the new commands. I cloned the repo, fired up claude code and pointed it to a local copy of the Redis VectorSet documentation I put in the directory root telling it I wanted it to update the extension to provide support for the new commands I would want/need to handle VectorSets. This was, idk, maybe a year ago. So not even Opus. It nailed it. But I chickened out about pushing that into a production environment, so I then told it to just write me a PHP run time client that mirrors the functionality of Predis (pure-php implementation of redis client) but does so via shell commands executed by php (lmao, I know).
Define the boundaries, give it guard rails, use design patterns and examples (where possible) that can be used as reference.
So in my experience with Opus 4.6 evaluating it in an existing code base has gone like this.
You say "Do this thing".
- It does the thing (takes 15 min). Looks incredibly fast. I couldn't code that fast. It's inhuman. So far all the fantastical claims hold up.
But still. You ask "Did you do the thing?"
- it says oops I forgot to do that sub-thing. (+5m)
- it fixes the sub-thing (+10m)
You say is the change well integrated with the system?
- It says not really, let me rehash this a bit. (+5m)
- It irons out the wrinkles (+10m)
You say does this follow best engineering practices, is it good code, something we can be proud of?
- It says not really, here are some improvements. (+5m)
- It implements the best practices (+15m)
You say to look carefully at the change set and see if it can spot any potential bugs or issues.
- It says oh, I've introduced a race condition at line 35 in file foo and an null correctness bug at line 180 of file bar. Fixing. (+15m)
You ask if there's test coverage for these latest fixes?
- It says "i forgor" and adds them. (+15m)
Now the change set has shrunk a bit and is superficially looking good. Still, you must read the code line by line, and with an experienced eye will still find weird stuff happening in several of the functions, there's redundant operations, resources aren't always freed up. (60m)
You ask why it's implemented in such a roundabout way and how it intends for the resources to be freed up?
- It says "you're absolutely right" and rewrites the functions. (+15m)
You ask if there's test coverage for these latest fixes?
- It says "i forgor" and adds them. (+15m)
Now the 15 minutes of amazingly fast AI code gen has ballooned into taking most of the afternoon.
Telling Claude to be diligent, not write bugs, or to write high quality code flat out does not work. And even if such prompting can reduce the odds of omissions or lapses, you still always always always have to check the output. It can not find all the bugs and mistakes on its own. If there are bugs in its training data, you can assume there will be bugs in its output.
(You can make it run through much of this Socratic checklist on its own, but this doesn't really save wall clock time, and doesn't remove the need for manual checking.)
They aren't holding it wrong, it's a fundamental limitation of not writing the code yourself. You can make it easier to understand later when you review it, but you still need to put in that effort.
Enforce conventions, be specific, and define boundaries… in English?!
You are correct but developers are not yet ready to face it. The argument you'll always get is the flawed premise that it's less effort to write it yourself (While the same people work in teams that have others writing code for them every day of the week).