I feel like the last 2-3 generations of models (after gpt-5.3-codex) didn't really improve much, just changed stuff around and making different tradeoffs.
I disagree, it improved enormously especially at staying consistent for long-tasks, I have a task running for 32 days (400M+ tokens) via Codex and that's only since gpt-5.4
I disagree, it improved enormously especially at staying consistent for long-tasks, I have a task running for 32 days (400M+ tokens) via Codex and that's only since gpt-5.4