> It's not really clear whether Opus 4.5+ represent a level shift on this frontier or just inhabits place on that curve which delivers higher performance, but at rapidly diminishing returns to inference cost.
I think we're reaching the point where more developers need to start right-sizing the model and effort level to the task. It was easy to get comfortable with using the best model at the highest setting for everything for a while, but as the models continue to scale and reasoning token budgets grow, that's no longer a safe default unless you have unlimited budgets.
I welcome the idea of having multiple points on this curve that I can choose from. depending on the task. I'd welcome an option to have an even larger model that I could pull out for complex and important tasks, even if I had to let it run for 60 minutes in the background and made my entire 5-hour token quota disappear in one question.
I know not everyone wants this mental overhead, though. I predict we'll see more attempts at smart routing to different models depending on the task, along with the predictable complaints from everyone when the results are less than predictable.
Human dev labor cost is still the high pole in the tent, even multiplying today's subsidized subscription cost by 10x. If the capability improvement trajectory continues, developers should prepare for a new economy where more productivity is achieved by fewer devs by shifting substantial labor budget to AI.
The problem is half the time you don't know you need the better model until the lesser model has made a massive mess. Then you have to do it again on the good model, wasting money. The "auto" modes don't seem to do a good job at picking a model IME.
> I know not everyone wants this mental overhead, though.
I’m curious how to even do it. I have no idea how to choose which model to use in advance of a given task, regardless of the mental overhead.
And unless you can predict perfectly what you need, there’s going to be some overuse due to choosing the wrong model and having to redo some work with a better model, I assume?
Isn't that essentially GPT Pro Extended Thinking?
Except developers can’t even do that. Estimation of any not-small task that hasn’t been done before is essentially a random guess.
[dead]
> It was easy to get comfortable with using the best model at the highest setting for everything for a while, but as the models continue to scale and reasoning token budgets grow, that's no longer a safe default unless you have unlimited budgets.
For a while I used Cerebras Code for 50 USD a month with them running a GLM model and giving you millions of tokens per day. It did a lot of heavy lifting in a software migration I was doing at the time (and made it DOABLE in the first place), BUT there were about 10 different places where the migration got fucked up and had to manually be fixed - files left over after refactoring (what's worse, duplicated ones basically), some constants and routes that are dead code, some development pages that weren't removed when they were superseded by others and so on.
I would say that Claude Code with throwing Opus at most problems (and it using Sonnet or Haiku for sub-agents for simple and well specified tasks) is actually way better, simply because it fucks things up less often and review iterations at least catch when things are going wrong like that. Worse models (and pretty much every one that I can afford to launch locally, even ones that need around ~80 GB of VRAM in the context of an org wanting to self-host stuff) will be confidently wrong and place time bombs in your codebases that you won't even be aware of if you don't pay enough attention to everything - even when the task was rote bullshit that any model worth its salt should have resolved with 0 issues.
My fear is that models that would let me truly be as productive as I want with any degree of confidence might be Mythos tier and the economics of that just wouldn't work out.