It comes down to trust. I was not able to trust GPT 4.1 or Sonnet 3.5 with anything other than short, well-specified tasks. If I let them go too long (e.g. in long Cursor sessions), it would lose the plot and start thrashing.
With better models and harnesses (e.g. Claude Code), I can now trust the AI more than I would trust a junior developer in the past.
I still review Claude's plans before it begins, and I try out its code after it finishes. I do catch errors on both ends, which is why I haven't taken myself out of the loop yet. But we're getting there.
Most of the time, the way I "verify" the code is behavioral: does it do what it's supposed to do? Have I tried sufficient edge cases during QA to pressure-test it? Do we have good test coverage to prevent regressions and check critical calculations? That's about as far as I ever took human code verification. If anything, I have more confidence in my codebases now.
good moooorning sir