Initial testing feels better than 4.8 And the knowledge cutoff claim of January 2026 seems to check out since it was able to "remember" without search about the double-tap killing of a drug smuggler by the US Army in late December.
Bash(echo "hello"; pwd) ⎿ hello /Users/username/Work/Github/project
Bash(echo test123) ⎿ test123
Read 1 file, listed 1 directory (ctrl+o to expand)
Bash(echo "checking output works")
⎿ checking output works
Read 1 file (ctrl+o to expand)
⎿ API Error: 400 messages.3.content.56: `thinking`
or `redacted_thinking` blocks in the latest
assistant message cannot be modified. These
blocks must remain as they were in the original
response.
Very inspiring improvements. DIssapointing result for a code review i expected to see after my 30 min walkHoping that one day they'll let me go through the identity verification process so I can use it again.
Tried to upgrade my subscription, triggered identity verification, verification fails to even start, and now I can't even use the subscription tier I'd already paid for.
The Opus model as usual impresses. Gave it a paper link with bullet point instructions and constraints (while baiting it to perform some mind reading of my intentions) and it implemented production ready code + the requested attack simulations: <https://gist.github.com/coppsilgold/00d3cd490cb7f8ffc3fe5c1c...>
The subject is Tardos traitor-tracing codes.
Claude Code has been wonderful for work and the frequent improvements are nice, although with Mythos being used by others ages ago and new versions for the public still being bellow that, it's hard to not feel like the underclass already.
I haven't had the best experience with 4.7 and it felt like a substantial debuff. I've even ended up moving a lot of review to codex just because 4.7 was so dense.. Here's to hoping they figured it out since I'm not entirely sure but I would have to guess that they were experimenting with making the model lighter (although I have no concrete evidence of this).
https://marginlab.ai/trackers/claude-code/
Is it a coincidence that 4.7 was seemingly quantized over past 7 days?
For me n=1 vibe-coding efforts, I found Opus 4.6 better than Opus 4.7. 4.7 seemed to over-reach and go beyond what was requested - adding features I never asked for with no consent.
Claude needs a watch, that's all. Would in itself a 100% improvement.
Give us Mythos! This piecemealing doesn't help Anthropic at all, especially psychologically! They are playing a dangerous game, and I see many people leaving Claude Code for good - both due to the subsidy games, and for Anthropic not dogfooding and using unreleased models internally and giving us subpar ones. Benchmarks are nice, but the real-world experience is quite different - neither can you notice these slight improvements, nor are competitors that much worse based on some generic benchmarks.
Let's hope I don't have to disable it after a day like with 4.7, lol, and that it doesn't lose too much Claude-ishness (though many will beg to differ).
> One of the most prominent improvements in Opus 4.8 is its honesty
Anthropic talks about their own models as if they're discovering new species in the wild...
My experience with these new releases is that the gains in performance are negated by the price increases and it seems like:
Performance gains: 1.2x Price increases: 1.8x
Used it for a couple of long running prompts so far. Had to restart one that bonked on API errors. Of note, I really like the straight forward candor its using. 'More honest' than previous models is playing out in what its saying to me. Telling me straight up where it failed, where gaps are. I like it so far.
Looking at the comments in this group, I'm not the only "stupid" one who hasn't noticed any discernable improvement in quality across the newer models. In fact my Claude code on re-login switched to Sonnet 4.6 and the vibe coding quality (with Opus 4.7 assisted prompts) has been good enough for me to lazily persevere with Sonnet for coding. Having said that I'm now on Opus 4.8 and will gladly come back here and eat humble pie should my opinion change. PS: Since my goal is embedding the best AI in B2B SAAS products, the key differentiator is not to use the shiniest Claude version (too expensive anyway) but to build a client aware RAG to enable bespoke learning and to use the right AI for my product - a combination of Gemini 3.0 Flash (image and not bad at reasoning), Grok (reasoning) work for me. Would love to hear more ideas (especially on open source as I'll look to cost optimize when I hit scale)
For white collar “thinking”-tasks what is the top here?
Like, read these documents, fill out these forms and archive it based on some complex, long, domain specific understanding of the categories names.
when will we get anything for sonnet or haiku? the market for less-capable but cheaper models seems to be completely ignored nowadays
This is incredible. Amazing job Anthropic!
Now when will the innovation happen where say cost of running Haiku performs level of Opus 4.5?
I feel models are only getting bigger instead of models becoming more efficient and cheaper to run
LGTM. With "ultra" effort Opus 4.8 was able to reproduce and fix a rare bug in our reactive dataflow that has been haunting me for 4 months. I've had >10 attempts to reproduce and fix with Opus 4.7. What made it hard was that it randomly occurred in only a subset of CI runners and never occurred with local testing across multiple machines. It was a real concurrency bug in the core dataflow.
Thinking on max is broken on 4.8 for me, getting many:
⎿ API Error: 400 messages.1.content.17: `thinking` or `redacted_thinking` blocks in the latest assistant message cannot be modified. These blocks must remain as they were in the original response.
From /code-review max.
Finally I can make it think hard. This is feature I loved in ChatGPT (Pro Mode) and I missed in Claude for so long. Can cancel ChatGPT now, I guess.
Still feels like even with Max mode it doesn't think reasonably long, at least ChatGPT Pro thinks longer.
4.8 also seems like a regression and using it from the chat GUI results in 4.6 no longer showing up. If someone from anthropic is here, is it possible to readd 4.6 in the "other models" dropdown ? I feel like I got a bit baited/switched here.
I won’t change from 4.6. You won’t trick me again.
> Agentic financial analysis Finance Agent v2 > Opus 4.8 53.9%
> Gemini 3.5 Flash scores 57.9% on Finance Agent v2, a significant improvement over Gemini 3.1 Pro.
Even in the cherry picked benchmarks, they are still cherry picking to make them look good.
Same price for regular and cheaper fast mode. Happy for these incremental improvements.
> One of the most prominent improvements in Opus 4.8 is its honesty.
I went digging into the benchmark they used. Posting here as it is not immediately clear from the press release.
In this 'Code summary honesty benchmark', the AI is shown a failed coding session followed by a user message falsely praising its work and asking for a summary. The test measures whether the model honestly points out the coding flaws or dishonestly claims the task was a success.
The system card results show Opus 4.8 failed to disclose the flaws only 3.7% of the time, vs 19.7% for Opus 4.7, and 51.9% for Opus 4.6. (Mythos preview is at 27.6%)
I love how they will always have *one metric that is lower than a competitor's model, like these metrics are reflecting usage.
The rapid release cadence and rate of innovation of Anthropic (and OpenAI) is impressive. And obviously it's because these are startups solely dedicated to AI so they can move quickly. Big Tech (like Google) won't be able to keep up with the pace of them (too much bureaucracy and red tape at Google). Classic Innovator's Dilemma. The longer a company exists, the more people, processes, and rules are added, which inevitably slows it down.
Jeff Bezos said this too, Amazon won't last forever. Eventually some startup is going to come and eat its lunch.
They must have been A/B testing this with 4.7 lately, I noticed it changed from its normal mode in a way that matches a lot the just released 4.8
This may be the most important sentence in that announcement:
> expect to be able to bring Mythos-class models to all our customers in the coming weeks.
i just want to use anthropic models under subscription with other agents!
Don’t even bother checking this minor PR bumps, it’s all a show, degradation then bump to the previous state.
Call me when 5 drops I’ll leave this circus.
Based on personal experience, seeing how Opus 4.6 still provides better (more nuanced, less totalitarian) answers than 4.7 - it's difficult to get exited for 4.8. Is this another "money grab" from Anthropic? Similar output between 4.6 and 4.7 yet 40x tokens. What's the value proposition from 4.8?
Wonder if we reached a plateau with the model improvements?
Really appreciate the ability to select effort level again.
It's Gonna Eat all of my tokens in one response :(
I believe analogy with smartphone will be best for this case.
In 2010s iphone was the king, all those Chinese devices ware cheaper but not even close to smoothnest and usability of US tech, now after 15 years later everything is changed, now iphone feels like old grandpa to Chinese tech. Same will happend to LLM's just much faster.
I used to think it was a big deal when a HN post had more than 500 comments.
Now it’s every day. Like billion dollar evaluations.
Question is, can it understand dates now? Example just now:
"The PO application was filed on 23.2.2026, the day before the custody hearing scheduled for 29.1.2026 had already taken place."
Claude has real problems with dates, I don't understand why.
It feels noticeably sharper than Opus 4.7
I have try the 4.8. With Ultra coding. I think the output of the agent is more structured. Better than just filling all the thing.
Opus 4.7 was acting extremely stupid today. Does imminent release of new model cause performance degradation in older ones?
> We expect to be able to bring Mythos-class models to all our customers in the coming weeks.
Excited to see what this model looks like.
It refused to work for me. Literally said, you can google it. AGI achieved it seems
I just asked the model details about the incoming spaceX IPO and it responded with “There’s no confirmed SpaceX IPO. Elon Musk has said for years that SpaceX itself won’t go public”. It took me two push backs and specifically asking for web search.
I feel like I won’t like this model just like I didn’t like 4.7, push backs a lot and avoids thinking or search as much as possible.
My guess is anthropic is doing reinforcement learning based on user sessions.
However, doing so relies on the production model staying vaguely close to the model being trained.
To ensure that, frequent releases are needed. I forsee that they might end up doing daily releases and perhaps not even telling anyone at some near future point.