A rambling comment:
I think this is the first time we've had a third minor version bump on a frontier Anthropic model. (I count the 0.5s as major here, because they've been issued non-sequentially and also corresponded to massive capability leaps, eg, Sonnet 3.5, Opus 4.5).
So now the Opus 4.5 family has successors 4.6, 4.7, and 4.8, each posting fairly modest claimed gains. My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
Maybe my own tastes are saturated now (it's smarter than me?) and I'll never again perceive model progress. Maybe the incrementalism is such that I'd notice immediately if my 4.7 workflows were redirected now to 4.5.
Difficult spot for the labs to be in because, if they have a stronger product, I'd prefer they release it and that I can use it.
But as this dynamic continues, the improvements are going to be less and less legible for end-users, who will complain about the churn-without-payoff, even when the payoff may actually be real.
I'm curious to poll HN on this issue. Do you feel like we've had meaningful/noticeable gains in terms of your programming workflows between 4.5 and 4.7?
My 2¢, I personally feel like all of the productivity gains since 4.5's release (in November 2025!!) have come from improvements to the harnesses (cc, cursor cli, codex, opencode, whatever) AND from the context window expansion from 200k to 1M.
But the actual "raw" intelligence of the model / ability to make good decisions feels like it has plateaued since 4.5. 4.6 was maybe a small improvement, but hard to differentiate from in-context-learning with the 1M window. 4.7 if anything felt like a regression in wisdom for me and my coworkers, with it consistently making worse/lazier decisions.
4.7 was the first time I had to resort to using the previous version (4.6) for most use cases. Hoping 4.8 rectifies this.
I suspect the more frequent incremental releases may also be to deploy new capabilities used by Anthropic to control costs and throttle consumption of resources. I assume any new controls they expose to end-users have far more granular sub-controls under the hood which they can meta-adjust for each user type.
They mention more granular control of effort, 'dynamic workflows' and more speed controls ("fast mode"). While they position them as user features, they also sound like the kinds of knobs Anthropic will need to twiddle on the back-end to balance costs, margins, ARR, and user growth vs retention post-IPO to hit key metrics in quarterly reporting.
> My own experience w/ 4.6 and 4.7 are that I don't firmly grasp any capabilities improvements over my memory of 4.5, but it's all so fuzzy that it's truly difficult to tell.
I've actually intentionally switched back to 4.5. I hated 4.7 so much that I decided to jump back all the way to 4.5.
Now that I've been using 4.5 for a few weeks, I find it significantly more reliable but a bit more forgetful than 4.6/4.7. I'm okay with that because it's really easy to identify this forgetfulness and nudge it.
I found 4.7's adaptive thinking to be extremely unreliable. It seems to overcorrect on the current message without considering the difficult of the overall problem. I wonder if 4.8 will improve on that.
4.5/4.6 were roughly the same in our testing. Opus 4.7 is smarter, but it's difficult to use as a product for various personality issues. So far, Opus 4.8 seems to be going down that path (unusably slow, but this could be a launch day rollout problem). Full Opus 4.8 tests are in progress now.
Data at https://gertlabs.com/rankings
I am using Claude Code for formal verification with Lean. In my personal experience both Opus 4.7 and now what I see from first experiments with Opus 4.8 were big improvements. I was able to delegate proofs of larger theorems that their predecessors could not handle.
I've been using Claude Code regularly since the 4.5 release, and 4.7 was a significant regression: very unreliable, arguing about changes, deciding that fixes weren't needed, etc.
I'm hoping they recreate the magic of 4.5 but it's as much about the quality of harness, the memory and efficiency of the tools than simply the models at this point.
4.7 was a significant jump in the ability to run long-horizon tasks. It immediately completed tasks that 4.6 was unable to, even though I have the impression that it became a bit less capable over the first few weeks after release.
It also seems to be helpless at effort levels < xhigh, I turn to Sonnet when simpler tasks are needed.
“Maybe my own tastes are saturated now”
It might be saturated for smaller scopes of work, but it’s not hard to see the cracks when you scale up what you ask of SOTA models/agents.
One example, to try and single shot prompt coding a ChatGPT equivalent chatbot.
Sure it will spit something out, but the feature depth, UX subtitles, backend integration, and lots of pragmatic engineering decisions along the way will just not be baked.
Another example is building a C compiler from scratch which Anthropic showed is still a struggle to do.
Not that these these specific examples are important but just to point out scaling up expectations shows the cracks.
It’s not just a model problem of course, better agents, orchestration features (like Dynamic Workflows mentioned in the post), all need to continue to evolve.
Ar what point does my CS degree become totally useless is an open question.
pretty spot on.
In my experience, Opus 4.0 was fantastic, major jump from 3.7. it was creative, super slow and expensive, and would sometime forget what it was doing, but it was getting the job done.
4.1 they made it much faster, so a lot of infra improvements.
4.5 was the time it could work on longer task, didn't make a lot of obvious mistakes of 4.0, and i think this was about the time the opus went mainstream, and all of the anthropic's compute crisis began, so instead of making the model better they tried to optimize it to reduce cost instead.
4.6 was such a bad model, they switched to adaptive thinking and it had so many bugs. poor api design, benchmaxxed and poor real-world results. i switched back to 4.5.
4.7 they just fixed the bugs they added in 4.6. Better than 4.5.
haven't fully tested 4.8 yet.
My read - 4.7 was a tactical lobotomy to improve the average experience at the expense of peak performance; necessary due to compute pressure.
Now that they have Colossus capacity, I guess they can tune up the intelligence again and spend more tokens on reasoning budgets.
4.7 was definitely a lot more flaky for me vs. 4.6 before the reasoning bugs.
I think 4.7 was an awful model in actual use. I never got anything out of it and it was frustratingly weird. This feels more like an attempt to course correct and isn't a real bump
Ive been using gpt 5.4 and 5.5 and honestly 5.4 is solving everything at the pace I need it. I'm the biggest bottle neck in terms of reviewing PRs and my own code. So having a model which can solve a complex task in 10 minutes vs 30 minutes doesn't really give me any meaningful improvement.
Also, the biggest factor is having a good planning phase. A good plan is better than even major model improvements.
Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?
Given that 4.7 was a brand new model, trained from scratch with a unique architecture and tokenization scheme, I don't see the same pattern. It seems arbitrary.
How long would it take to evaluate a new coworker to say “wow she’s really bright?” Relative to your other coworkers?
A few days? A few weeks? Longer?
However a company releases a new AI model and within hours users are confidently proclaiming how much smarter it is than previous versions.
May be my tasks are rudimentary but the results I get with the 4.5 model are just the same as 4.7 or 4.6. it's just at the advanced models consume more tokens and and are actually loss making for my work. The incremental changes that they are making are not really that valuable. In fact I have found that even glm 5.1 is giving me something equivalent to what Opus 4.6 gives. Am I missing something that everyone else is cheering for in these small incremental model releases?
IMO they have all been clean and noticeable upgrades over their predecessors. Opus 4.7 in particular was a solid jump in capabilities.
I have seen a noticeable difference between 4.6 Medium (the default, and I skipped 4.7 because of various reported issues) and 4.8 High or whatever the default is now. It's far more likely to say it doesn't know and seems to think about things a lot more, but then it also spends a lot more time reporting on what it's thought about so it takes longer for you to process the output. In particular 4.6 would say "I've spotted something a bit off here" whereas 4.8 will say "if you do this and then this and then this under these conditions then something will go wrong here". So it seems to be closer to the claimed capabilities for Mythos than previous versions.
ChatGPT 5.5 is consistently the much better model and by a large margin.
How do I know? Because when pushing both to generate code or in independent chats to analyze projects, 5.5 will consistently find all the bugs that Claude does not find, and when challenged, Claude does agree those bugs were there. And my findings match those.
When from a blank start asking Claude to analyze project A and Project B,. Clause will consistently say project B is the better structured, more robust, and more defect free and does justify it. And project B was the one created by GPT 5.5....And also the one I judge to be the best one.
And yes, both at deep effort settings and starting from same specs...
I think the issue with legibility comes down to the fact that most users are not using LLMs for tasks where improvements to raw reasoning abilities wouldn't help much or at all. So it's not a matter of anyone's deficiency of perception but rather a lack of any benchmark to perceive.
It's kind of like how the consumer laptop market is now. I was telling my boss today that most employees wouldn't see any noticeable performance difference between a macbook pro and a neo if they are just doing admin stuff on the web.
IME the most noticeable performance boosts are in complex multi-agent workflows.
EX. You call an orchestration agent and define an implementation plan with the help of a number of sub agents planning out different features. You and the lead agent review all of the plans and send them off to a set of agents that write tests which get send back to the orchestrator then passed along with the plan to a set of coding agents who implement the features in their own worktrees. That gets passed back to the orchestrator which hands it off to another set of agents doing the code review and merging the features before sending it back to you.
Well, it seems like collectively we are all struggling to perceive model progress, given that it seems like every reply to you is reporting different experiences with which of the models has subjectively performed best for them.
We're at the top of the S-curve and you're romanticizing diminishing returns with vague hints of super human capabilities and singularities.
I'm here to complain about the churn.
I feel like I get to know a model in the human sense of understanding a personality. Yesterday I knew 4.6 extended, today it's different, there's multiple "token budget" levels. I just want 4.6 extended back as it was, I was getting on well with it / them.
The honesty will be noticeable. Maybe we'll see some honest assessments like "That is not possible within the laws of known physics", "Your legal argument is nonsensical and defies logic", "There is no evidence to support taking that will cure anything", etc., etc.
4.7 uses more tokens and costs more for the same task than OG 4.5, that's about it
> (it's smarter than me?)
I genuinely hope that you're joking with that statement.
Or this is a bot.
Or an ARG.
Or Art.
Help.
dangerous thing to believe IMO The models will get better, you will notice, everyone will notice. They will get better at coding and everything else. You should plan around that.
the churn is... a version bump to the same api? If you want to compare you can write some evals.
tbh, the last 2-3 version bumps, main change has been that they take longer, and cost more/have more usage restrictions. (combined with new tooling, which eats a ton of tokens)
I'm pretty sure they're releasing 4.8 because they massively shit the bed with 4.7 and people aren't using it.
I have ONLY heard negative feedback about it, and trying it myself also yielded really awful results.
Just want to say there's no question that you're smarter than any (and every) AI.
> I'll never again perceive model progress
If the hype train keeps going for another year, Sam and co will have to resort to direct gaslighting like saying the model is improving but nobody can feel it anymore, oh and I need 10 trillion dollars
"it's smarter than me?"
You don't have to correct it dozens of times a day!? Really?
The more difficult it is for humans to consistently and accurately compare model outputs the more opportunity there is to spread FUD (Fear, Uncertainty, Doubt). Considering valuations of these companies and the astronomical investments being made, a sabotage campaign with bots or paid users on reddit, twitter, YouTube, or whatever socials could go a long way towards knocking market cap off the competition. Not saying that's happening, just saying its an obvious target. Even if the goal is not nefarious, people with a perceived bad experience are 2-3x more likely to complain. So even without bad actors involved, a new model may need to be significantly better in order to break even on the old net promoter score.
I maintian a log of tasks, prompts, related information etc. So i can repeat past workflows verbatim, and I can qualitatively say each model beyond 4.5 has been a regression, and it would not surprise me 4.8 continues the trend. Each iteration has failed at more tasks previously completed succesfully. Right now it flat out refuses to answer many benign chemistry questions, or leans into shilling to hard and ignores non industry funded studies on certain topics. I'm transitioning to deepseek as a reuslt. Cheaper by far and at this stage not strictly speaking less capable.
I'm going to assume that at some point their "targeted training and tuning" will eventually reach some sort of "max" possible simulation of next good token. At that point I think it will be interesting to see what happens and how many parameters you really need to for different verticals.
why are the models the same price?
https://platform.claude.com/docs/en/about-claude/pricing
``` Model Base Input Tokens 5m Cache Writes 1h Cache Writes Cache Hits & Refreshes Output Tokens
Claude Opus 4.8 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.7 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.6 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.5 $5 / MTok $6.25 / MTok $10 / MTok $0.50 / MTok $25 / MTok
Claude Opus 4.1 $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Opus 4 (deprecated) $15 / MTok $18.75 / MTok $30 / MTok $1.50 / MTok $75 / MTok
Claude Sonnet 4.6 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4.5 $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Sonnet 4 (deprecated) $3 / MTok $3.75 / MTok $6 / MTok $0.30 / MTok $15 / MTok
Claude Haiku 4.5 $1 / MTok $1.25 / MTok $2 / MTok $0.10 / MTok $5 / MTok
Claude Haiku 3.5 (retired, except on Bedrock and Vertex AI) $0.80 / MTok $1 / MTok $1.60 / MTok $0.08 / MTok $4 / MTok ```
I can tell from hearing Feynman recordings that he was smarter than my own university's physics professor, but both were smarter than me.
It's almost like they used up most of the benefits of scaling and the fundamental issues that people have been talking about with LLMs for years are real.
The inability to tell if a model is improving is, I think, a tell that the model has improved up to your level of programmatic (analytic, computational) capacity.
A lot of the information (blogs, tweelches, plosts) that I consume seems to be converging on the idea that we all depend on the models. However. It seems to me that the exact opposite is true. The models depend on us, and _desperately_ so.
There must have been stories, books, movies, made about this intellectual (and propositional, legal, factual) inversion.
The majority need the minority. Has always been the case, I now think. But what has newly developed is that the majority can take a dependency not on the minority, but on a select few companies who are abstracting and compressing the minority into latent spaces.
honestly sonnet 3.7 is still good enough for me, as long as whatever tool prompts and so on are well optimized enough between harness and model.
i still havent really noticed it per set being better
[flagged]
[flagged]
[dead]
I won't be surprised if the next gen frontier models are the last.
There's orders of magnitude of low hanging juice to squeeze out of smaller models.
It is almost guaranteed that a 60-90B model can outperform current SOTA in coding tasks within 2-3 years (design not certain, probably unlikely).
It is far less clear that a 1.2T model will be meaningfully better enough to justify training it.
As far as reasoning is concerned, with the recent GRAM release, there may be 4 orders of magnitude of reasoning to tack on to smaller models.
Think about that... Google, OpenAI, Anthropic could train a 30B GRAM-based model in days - and it could potentially have better local reasoning than the best model available today at >1T params... They could upgrade that to a ~600B MoE model in days to have general trivia knowledge rivaling the best models...
You just can't train a 1T+ parameter model that fast. It is a giant if how much GRAM turns out to improve things, but it's unlikely to be trivial or nothing.
Larger models can already sort of tell you anything. They're never going to get everything right unless they stop being LLMs.
There's just not a lot of juice left to squeeze for Gemini to tell you exactly how tall Ke$ha is or when the last time Brittney Spears went to jail was...