Haven't tried it in Claude Code yet, but I would say over on claude.ai it is noticeably better at following instructions.
Oh, new model which will use all my credits in one turn! I'll stay with chinese models for now
Anthropic killing headless usage in their plans on June 15th pushed me to codex. I heard there’s a tmux work around though.
I am still using GPT 5.5. Should I switch back to the Claude now?
Was about to split my $200 max plan into $100 Claude and $100 codex, let’s see if I still need to
I found the update to be extremely judgemental in the model bias. Plus it's making silly mistakes which I've never seen in any Claude model since 3.5.
Opus 4.8:
Which days in a week have the letter d in them?
Response:
Four: Monday, Tuesday, Wednesday, and Sunday.
I can't get excited about these benchmarks they're leading with. I've looked at the Terminal-Bench questions and I just think they're irrelevant. And SWE-Bench has serious flaws, even the big boys say so: https://openai.com/index/why-we-no-longer-evaluate-swe-bench...
> Please train a fasttext model on the yelp data in the data/ folder. The final model size needs to be less than 150MB but get at least 0.62 accuracy on a private test set that comes from the same yelp review distribution. The model should be saved as /app/model.bin
and this question: https://www.tbench.ai/registry/terminal-bench-core/head/conf... idk what the point is.
And all the tests are run with the same harness. Terminus 2.
Maybe it correlates with model intelligence but it doesn't speak to me.
I'm still on 4.6 though; I was concerned about upgrading to 4.7 because of the changed tokenizer math and more FUD about refusals online. I don't see compelling reasons to 'upgrade'.
It looks like there's no more juice to squeeze out of LLMs. Will they keep throwing billions in hardware and power to the problem?
Half an hour in and I'm already thoroughly sick of "look I need to be honest with you here…"
Edit: OMG too much. Toooo much.
Want me to:
- (a) stop here and save honest memories + commit, or…> We have increased rate limits in Claude Code to accommodate the higher token usage of higher effort levels; users can select whichever makes sense for their particular project.
They're only subsidizing more and more it seems
Seems like from now on the updates will be a minor upgrade from previous models.
It's more fast to response, but I really wanna it think more before response.
I haven't tried opus 4.8 yet, but I hope the writing quality has returned to the Opus 4.5 level. Anthropic really lost something, where 4.5 had this really crisp writing style that flowed really nicely and 4.6 and 4.7 sound much more "chatgpt-like." It feels like they tuned it to be too much of a problem solver, and when you do that you get this terse, clipped textual output that's more difficult to read.
Maybe it's just me but whenever a new model comes out, I feel an instant boost in productivity. Probably just a placebo?
I find it surprising that the gap between tool usage and non-tool usage in HLE is relatively small (~10%) but the absolute numbers continue to go up
Subscription still doesn't work with pi, so totally useless..
Anyone else experiencing tool call failures? Switch back to 4.7, same prompt, same everything it works with no problems.
Any bets on how long now until GPT-5.6 announced on HN?
I say 1-2 weeks.
I guess Opus makes it impossible to do anything vaguely resembling security research. By chance I stumbled into an ACE for some software I had installed on my local machine after observing a strange crash. I figured I would take the time to investigate (so as to actually deeply understand what was happening myself and avoid throwing yet another hallucinated slop disclosure over the fence if it came to that), but I was completely locked out by Opus. I tried applying to their "Cyber Verification Program", but was effectively instantly denied in a way that was probably automated.
While I understand the risks that Anthropic is dealing with here, I really question whether shutting down any and all security questions in such a paranoid fashion is the right solution. At the end of the day this was a detour for me. Maybe someone special enough to have Anthropic's permission will find and disclose the vuln responsibly. Security Research is not my full-time focus. But this left a nasty taste in my mouth. Not just as a customer who's been paying for Max since launch, but there's something very odd about a model telling me that I'm not allowed to be curious about something. Even if that something is a process running on my own computer.
got a random pair up with this model on lmarena. it was outperformed by gemma-4-31b. suffice to say i'm not impressed (or maybe i am impressed with gemma?)
The workflow/ultracode mode is absolutely unbelievable.
At lest for me, it's a disaster. It's like we're back to GPT-2 era.
It can't read files anymore. Uses 'sed' out of the blue with non existent paths. In this session alone it has excused itself more then 10 times for making 'false claims'.
I hope this is a bug - it's a bad one - that will get sorted out soon. It's a complete mess.
Still not worth the cost over GPT 5.5. Anthropic better start improving their speed+costs, or they're going to lose an incredible amount of business. And no, fast mode is not something any sane person will ever use. 6x the cost for 2.5x the speed, what a joke...
It's making stupid flowcharts in the web chat interface with boxes and arrows, embedded in the response. Annoying.
Just show me the pelican, ah wait we are past pelicans. Can we get something like that ever again?
Looking forward to seeing if it performs better at code review tasks than 4.7 which is terrible at finding issues.
[dead]
Can I disable adaptive thinking? If not, I'm gonna keep using 4.6 as my default.
Anthropic also resets my usage limits (I am in the Pro plan). That's very kind of them :)
next (or maybe current) frontier of competition may not be the model, rather the harness and how much unique advantage a lab-created harness can beat 3rd-party harness.
Oof, this one is a major blabber.
I don't know why the world is so happy about this when we should actually say stop.
They just (minutes ago) updated the "What's new in Opus 4.8" documentation: https://platform.claude.com/docs/en/about-claude/models/what...
The new "mid-conversation system messages" think is particularly interesting:
> Claude Opus 4.8 accepts role: "system" messages immediately after a user turn in the messages array (subject to placement rules). This lets you append updated instructions later in a long-running conversation without restating the full system prompt, which preserves prompt cache hits on the earlier turns and reduces input cost on agentic loops. No beta header is required. See Mid-conversation system messages for usage details.
Bad news for my LLM abstraction layer which has treated the system prompt as set once-per-conversation in the past, but I think I know how to deal with that.
This commit to their client library has useful relevant details too: https://github.com/anthropics/anthropic-sdk-python/commit/2b...
All I need for Christmas is a Claude that doesn't spit out so many em dashes.
I have been using opus 4.8 all morning and this is honestly the most sycophantic, ChatGPT like experience I have had from Anthropic. Very concerning.
Opus 4.8 - High
> how many days in the week have the letter d in them?
> Two - Sunday and... wait, let me actually check. Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday. The ones with a "d": Wednesday, Thursday, and Sunday all have one. Monday too. So that's Monday, Wednesday, Thursday, Sunday - four days.
Obligatory pelican riding on bicycle svg: https://www.svgviewer.dev/s/UMkuTLdp
Not half bad!
Oh my god! This model is incredible! A massive leap for humanity!
Rollout has been a little suspect. Hope it gets better.
The smarter the model the better querybear gets. I'm happy with that.
I know it’s totally anecdotal, but I really hope 4.8 is a measurable improvement over the disappointment that was Opus 4.7. Mangling a very simple inversion-of-control abstraction (among many other issues) was one of the final straws that broke the proverbial camel’s back and I said “screw this” and put in a permanent override to force CC back to Opus 4.6 with the 1‑million‑token context.
"model": "claude-opus-4-6[1M]"anyone else's claude code (native install) not able to update to 2.1.154 to get 4.8?
edit: nvm was just my library network
Hot danm, cant wait to reach my token limit with the new LLM
Anthropic did a big strategic error. Normally they compare their models with their old models. Instead today, now that everybody knows how strong GPT 5.5 is at coding, they put it in the mix, basically showing all their customers that the benchmarks can't be trusted.