Kind of orthogonal to the discussion, but could you broadly describe the code you're working on that both models are bad at? One thing I'm still struggling with is figuring out what types of code LLMs can vs cannot write.
Rules of thumb:
The more your toolchain (compilers, linters, etc) can statically verify, the better agents will do.
The terser the code, the better agents will do.
The more often similar problems have been solved in open source, the better agents will do. Agents seem particularly good at plumbing together different pieces of software.
Anything that requires a judgement call, as opposed to having one obvious way to do it, will get worse results from an agent.
As the scope of the request grows, agents get worse at it. This can be mitigated somewhat using various techniques ("write a plan", "do step 1 of the plan", etc) but never fully resolved. At some point the task is so big that it's necessary to do large parts by hand.
C code formally proven correct with Frama-C WP has been... marginal. The models do better than I expected at the proof portion (with ChatGPT 5.5 seeming to have a meaningful lead), but they all have a hard time (a) writing really good C code to begin with and (b) with compliance around not modifying C code semantics or performance as a cheat to simplify proof obligations. They also tend to be insanely and consistently verbose on the first proof pass... e.g. 8 lines of C code might end up at 200+ lines annotated and proven, but after simplification passes end up at 40 lines. I find I spend 90%+ of tokens on those simplification passes, and haven't really found a way to avoid the over-annotate-and-then-optimize tides by being a bit more sane the first time around.
> Kind of orthogonal to the discussion, but could you broadly describe the code you're working on that both models are bad at?
Commonly, anything that hasn't already been done across 100 different projects on GitHub.
Making a React app with a CRUD backend: LLMs are great. They've been trained on this.
Doing new work on complex non-public codebases or in niche problems that aren't commonly solved: Completely different story. Some times they'll find enough information to piece together a path toward a solution, but that doesn't mean it's a good solution. I also have to feed in a lot more context and even stop them when they go down bad paths frequently.
For the complex work I don't have the LLMs write code, but I may have them do a proof of concept. I have to write and understand everything myself. There are times when I'll think the LLM output looks good until I go through it line by line and realize it's done something completely unnecessary, or happened to get the right result for the wrong reasons. For unknown problems they're good at getting something to work through brute force if you let them consume enough tokens, but it may rely on safety fallbacks from the OS or fallbacks instead of being a proper solution. I always chuckle when they encounter intermittent errors and the first idea is to add a retry mechanism so the error is ignored.