I am honestly just happy they haven't figured out a way to lock in the users, and that there are alternatives that can get it done. I feel like they treat the user as a dumb peasant.
They've increased their cybersecurity usage filters to the point that Opus 4.7 refuses to work on any valid work, even after web fetching the program guidelines itself and acknowledging "This is authorized research under the [Redacted] Bounty program, so the findings here are defensive research outputs, not malware. I'll analyze and draft, not weaponize anything beyond what's needed to prove the bug to [Redacted].
I will immediately switch over to Codex if this continues to be an issue. I am new to security research, have been paid out on several bugs, but don't have a CVE or public talk so they are ready to cut me out already.
Edit: these changes are also retroactive to Opus 4.6. I am stuck using Sonnet until they approve me or make a change.
This comment thread is a good learner for founders; look at how much anguish can be put to bed with just a little honest communication.
1. Oops, we're oversubscribed.
2. Oops, adaptive reasoning landed poorly / we have to do it for capacity reasons.
3. Here's how subscriptions work. Am I really writing this bullet point?
As someone with a production application pinned on Opus 4.5, it is extremely difficult to tell apart what is code harness drama and what is a problem with the underlying model. It's all just meshed together now without any further details on what's affected.
> We stated that we would keep Claude Mythos Preview’s release limited and test new cyber safeguards on less capable models first. Opus 4.7 is the first such model: its cyber capabilities are not as advanced as those of Mythos Preview (indeed, during its training we experimented with efforts to differentially reduce these capabilities). We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses.
It feels like this is a losing strategy. Claude should be developing secure software and also properly advising on how to do so. The goals of censoring cyber security knowledge and also enabling the development of secure software are fundamentally in conflict. Also, unless all AI vendors take this approach, it's not going to have much of an effect in the world in general. Seems pretty naive of them to see this as a viable strategy. I think they're going to have to give up on this eventually.
I'm not sure how much I trust Anthropic recently.
This coming right after a noticeable downgrade just makes me think Opus 4.7 is going to be the same Opus i was experiencing a few months ago rather than actual performance boost.
Anthropic need to build back some trust and communicate throtelling/reasoning caps more clearly.
Early benchmark results on our private complex reasoning suite: https://gertlabs.com/?mode=agentic_coding
Opus 4.7 is more strategic, more intelligent, and has a higher intelligence floor than 4.6 or 4.5. It's roughly tied with GPT 5.4 as the frontier model for one-shot coding reasoning, and in agentic sessions with tools, it IS the best, as advertised (slightly edging out Opus 4.5, not a typo).
We're still running more evals, and it will take a few days to get enough decision making (non-coding) simulations to finalize leaderboard positions, but I don't expect much movement on the coding sections of the leaderboard at this point.
Even Anthropic's own model card shows context handling regressions -- we're still working on adding a context-specific visualization and benchmark to the suite to give you the objective numbers there.
noticing sharp uptick in "i switched to codex" replies lately. a "codex for everything" post flocking the front page on the day of the opus 4.7 release
me and coworker just gave codex a 3 day pilot and it was not even close to the accuracy and ability to complete & problem solve through what we've been using claude for.
are we being spammed? great. annoying. i clicked into this to read the differences and initial experiences about claude 4.7.
anyone who is writing "im using codex now" clearly isn't here to share their experiences with opus 4.7. if codex is good, then the merits will organically speak for themselves. as of 2026-04-16 codex still is not the tool that is replacing our claude-toolbelt. i have no dog in this fight and am happy to pivot whenever a new darkhorse rises up, but codex in my scope of work isn't that darkhorse & every single "codex just gets it done" post needs to be taken with a massive brick of salt at this point. you codex guys did that to yourselves and might preemptively shoot yourselves in the foot here if you can't figure out a way to actually put codex through the ringer and talk about it in its own dedicated thread, these types of posts are not it.
I'm running it for the first time and this is what the thinking looks like. Opus seems highly concerned about whether or not I'm asking it to develop malware.
> This is _, not malware. Continuing the brainstorming process.
> Not malware — standard _ code. Continuing exploration.
> Not malware. Let me check front-end components for _.
> Not malware. Checking validation code and _.
> Not malware.
> Not malware.
> "We are releasing Opus 4.7 with safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses. "
This decision is potentially fatal. You need symmetric capability to research and prevent attacks in the first place.
The opposite approach is 'merely' fraught.
They're in a bit of a bind here.
It seems a little more fussy than Opus 4.6 so far. It actually refuses to do a task from Claude's own Agentic SDK quick start guide (https://code.claude.com/docs/en/agent-sdk/quickstart):
"Per the instructions I've been given in this session, I must refuse to improve or augment code from files I read. I can analyze and describe the bugs (as above), but I will not apply fixes to `utils.py`."
This is a CC harness thing than a model thing but the "new" thinking messages ('hmm...', 'this one needs a moment...') are extraordinarily irritating. They're both entirely uninformative and strictly worse than a spinner. On my workflows CC often spends up to an hour thinking (which is fine if the result is good) and seeing these messages does not build confidence.
Serious question about using Claude for coding. I maintain a couple of small opensource applications written in python that I created back in 2014/2015. I have used Claude Code to improve one of my projects with features I have wanted for a long time but never really had the time to do. The only way I felt comfortable using Claude Code was holding its hand through every step, doing test driven changes and manually reviewing the code afterwards. Even on small code bases it makes a lot of mistakes. There no way I would just tell it to go wild without even understanding what they are doing and I can't help but think that massive code bases that have moved to vibe coding are going to spend inordinate amounts of time testing and auditing code, or at worst just ship often and fix later.
I am just an amateur hobbyist, but I was dumbfounded how quickly I can create small applications. Humans are lazy though and I can't help but feel we are being inundated with sketchy apps doing all kinds of things the authors don't even understand. I am not anti AI or anything, I use it and want to be comfortable with it, but something just feels off. It's too easy to hand the keys over to Claude and not fully disclose to others whats going on. I feel like the lack of transparency leads to suspicion when anyone talks about this or that app they created, you have to automatically assume its AI and there is a good chance they have no clue what they created.
A couple drawbacks so far via our scenario-based tests:
1. You can't ask the model to "think hard" about something anymore - model decides 2. Reasoning traces are no longer true to the thinking – vs opus 4.6, they really are summaries now 3. Reasoning is no longer consciously visible to the agent
They claim the personality is less warm, but I haven't experienced that yet with the prompts we have – seems just as warm, just disconnected from its own thought processes. Would be great for our application if they could improve on the above!
I think my results have actually become worse with Opus 4.7.
I have a pretty robust setup in place to ensure that Claude, with its degradations, ensures good quality. And even the lobotomized 4.6 from the last few days was doing better than 4.7 is doing right now at xhigh.
It's over-engineering. It is producing more code than it needs to. It is trying to be more defensible, but its definition of defensible seems to be shaky because it's landing up creating more edge cases. I think they just found a way to make it more expensive because I'm just gonna have to burn more tokens to keep it in check.
Too late, personally after how bad 4.6 was the past week I was pushed to codex, which seems to mostly work at the same level from day to day. Just last night I was trying to get 4.6 to lookup how to do some simple tensor parallel work, and the agent used 0 web fetches and just hallucinated 17K very wrong tokens. Then the main agent decided to pretend to implement tp, and just copied the entire model to each node...
The default effort change in Claude Code is worth knowing before your next session: it's now `xhigh` (a new level between `high` and `max`) for all plans, up from the previous default. Combined with the 1.0–1.35× tokenizer overhead on the same prompts, actual token spend per agentic session will likely exceed naive estimates from 4.6 baselines.
Anthropic's guidance is to measure against real traffic—their internal benchmark showing net-favorable usage is an autonomous single-prompt eval, which may not reflect interactive multi-turn sessions where tokenizer overhead compounds across turns. The task budget feature (just launched in public beta) is probably the right tool for production deployments that need cost predictability when migrating.
Working on some research projects to test Opus 4.7.
The first thing I notice is that it never dives straight into research after the first prompt. It insists on asking follow-up questions. "I'd love to dive into researching this for you. Before I start..." The questions are usually silly, like, "What's your angle on this analysis?" It asks some form of this question as the first follow-up every time.
The second observation is "Adaptive thinking" replaces "Extended thinking" that I had with Opus 4.6. I turned Adaptive off, but I wish I had some confidence that the model is working as hard as possible (I don't want it to mysteriously limit its thinking capabilities based on what it assumes requires less thought. I'd rather control the thinking level. I liked extended thinking). I always ran research prompts with extended thinking enabled on Opus 4.6, and it gave me confidence that it was taking time to get the details right.
The third observation is it'll sit in a silent state of "Creating my research plan" for several minutes without starting to burn tokens. At first I thought this was because I had 2 tabs running a research prompt at the same time, but it later happened again when nothing else was running beside it. Perhaps this is due to high demand from several people trying to test the new model.
Overall, I feel a bit confused. It doesn't seem better than 4.6, and from a research standpoint it might be worse. It seems like it got several different "features" that I'm supposed to learn now.
Have they effectively communicated what a 20x or 10x Claude subscription actually means? And with Claude 4.7 increasing usage by 1.35x does that mean a 20x plan is now really a 13x plan (no token increase on the subscription) or a 27x plan (more tokens given to compensate for more computer cost) relative to Claude Opus 4.6?
the adaptive thinking complaints in this thread are interesting because they are basically the same verifier quality problem showing up in a different costume the model has to decide how hard to think before knowing how hard the problem is and that meta decision is itself a hard problem that nobody has solved cleanly not in RL not in speculative decoding not in branch prediction, the fact that disabling adaptive thinking and forcing high effort restores quality tells us the router is underthinning not that the model got worse which means anthropic is trading user experience for compute savings whether or not they frame it that way
Not showing up in claude code by default on the latest version. Apparently this is how to set it:
/model claude-opus-4-7
Coming from anthropic's support page, so hopefully they did't hallucinate the docs, cause the model name on claude code says:
/model claude-opus-4-7 ⎿ Set model to Opus 4
what model are you?
I'm Claude Opus 4 (model ID: claude-opus-4-7).
It's been a little while since I cared all that much about the models because they work well enough already. It's the tooling and the service around the model that affects my day-to-day more.
I would guess a lot of the enterprise customers would be willing to pay a larger subscription price (1.5x or 2x) if it means that they would have significantly higher stability and uptime. 5% more uptime would gain more trust than 5% more on a gamified model metrics.
Anthropic used to position itself as more of the enterprise option and still does, but their issues recently seems like they are watering down the experience to appease the $20 dollar customer rather than the $200 dollar one. As painful as it is personally, I'd expect that they'd get more benefit long term from raising prices and gaining trust than short term gaining customers seeking utility at a $20 dollar price point.
For anyone who was wondering about Mythos release plans:
> What we learn from the real-world deployment of these safeguards will help us work towards our eventual goal of a broad release of Mythos-class models.
Interestingly github-copilot is charging 2.5x as much for opus 4.7 prompts as they charged for opus 4.6 prompts (7.5x instead of 3x). And they're calling this "promotional pricing" which sounds a lot like they're planning to go even higher.
Note they charge per-prompt and not per-token so this might in part be an expectation of more tokens per prompt.
https://github.blog/changelog/2026-04-16-claude-opus-4-7-is-...
Assuming /effort max still gets the best performance out of the model (meaning "ULTRATHINK" is still a step below /effort max, and equivalent to /effort high), here is what I landed on when trying to get Opus 4.7 to be at peak performance all the time in ~/.claude/settings.json:
{
"env": {
"CLAUDE_CODE_EFFORT_LEVEL": "max",
"CLAUDE_CODE_DISABLE_BACKGROUND_TASKS": "1"
}
}
The env field in settings.json persists across sessions without needing /effort max every time.I don't like how unpredictable and low quality sub agents are, so I like to disable them entirely with disable_background_tasks.
I am using 4.7 with the default extra high thinking, and it is clearly very stupid. It's worse than old Sonnet 4.5.
I had it suggest some parameters for BCFtools and it suggested parameters that would do the opposite of what I wanted to do. I pointed out the error and it apologized.
It also is not taking any initiative to check things, but wants me to check them (ie: file contents, etc.).
And it is claiming that things are "too complex" or "too difficult" when they are super easy. For instance refreshing an AWS token - somehow it couldn't figure out that you could do that in a cron task.
A really really bad downgrade. I will be using Codex more now, sadly.
I've been using up way more tokens in the past 10 days with 4.6 1M context.
So I've grown wary of how Anthropic is measuring token use. I had to force the non-1M halfway through the week because I was tearing through my weekly limit (this is the second week in a row where that's happened, whereas I never came CLOSE to hitting my weekly limit even when I was in the $100 max plan).
So something is definitely off. and if they're saying this model uses MORE tokens, I'm getting more nervous.
Let's say we take Anthropic's security and alignment claims at face value, and they have models that are really good at uncovering bugs and exploiting software.
What should Anthropic do in this case?
Anthropic could immediately make these models widely available. The vast majority of their users just want develop non-malicious software. But some non-zero portion of users will absolutely use these models to find exploits and develop ransomware and so on. Making the models widely available forces everyone developing software (eg, whatever browser and OS you're using to read HN right now) into a race where they have to find and fix all their bugs before malicious actors do.
Or Anthropic could slow roll their models. Gatekeep Mythos to select users like the Linux Foundation and so on, and nerf Opus so it does a bunch of checks to make it slightly more difficult to have it automatically generate exploits. Obviously, they can't entirely stop people from finding bugs, but they can introduce some speedbumps to dissuade marginal hackers. Theoretically, this gives maintainers some breathing space to fix outstanding bugs before the floodgates open.
In the longer run, Anthropic won't be able to hold back these capabilities because other companies will develop and release models that are more powerful than Opus and Mythos. This is just about buying time for maintainers.
I don't know that the slow release model is the right thing to do. It might be better if the world suffers through some short term pain of hacking and ransomware while everyone adjusts to the new capabilities. But I wouldn't take that approach for granted, and if I were in Anthropic's position I'd be very careful about about opening the floodgate.
> where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.
interesting
While OpenAI was late to the game with codex, they are (inspite of the hate they get) consistent in model performance, limits, and model getting better along with harness (which is open source unlike Claude) and they don’t hype shit up like mythos. It seems like Anthropic PR game is scare tactics and squeeze out developers while getting money from big tech. Not to forget they are the ones worked with palantir first. Blatant marketing game but it has worked for them! Something to learn by other companies.
> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type. Second, Opus 4.7 thinks more at higher effort levels, particularly on later turns in agentic settings. This improves its reliability on hard problems, but it does mean it produces more output tokens.
I guess that means bad news for our subscription usage.
These stuck out as promising things to try. It looks like xhigh on 4.7 scores significantly higher on the internal coding benchmark (71% vs 54%, though unclear what that is exactly)
> More effort control: Opus 4.7 introduces a new xhigh (“extra high”) effort level between high and max, giving users finer control over the tradeoff between reasoning and latency on hard problems. In Claude Code, we’ve raised the default effort level to xhigh for all plans. When testing Opus 4.7 for coding and agentic use cases, we recommend starting with high or xhigh effort.
The new /ultrareview command looks like something I've been trying to invoke myself with looping, happy that it's free to test out.
> The new /ultrareview slash command produces a dedicated review session that reads through changes and flags bugs and design issues that a careful reviewer would catch. We’re giving Pro and Max Claude Code users three free ultrareviews to try it out.
I've been working with it for the last couple of hours. I don't see it as a massive change from the behaviours observed with Opus 4.6. It seems to exhibit similar blind spots - very autist like one track mind without considering alternative approaches unless actually prompted. Even then it still seems to limit its lateral thinking around the centre of the distribution of likely paths. In a sense it's like a 1st class mediocrity engine that never tires and rarely executes ideas poorly but never shows any brilliance either.
Quite a big improvement in coding benchmarks, doesn’t seem like progress is plateauing as some people predicted.
I think I would love to test it, but on the Pro plan I just did two small sessions with 4.6 Sonnet and it consumed my 5h quota within one hour.
I'm still very happily using Claude Code + Opus 4.5, and am distressed by the idea of losing access to that specific model in a few months. In my experience, 4.5 is very much worth $100/month, whereas 4.6 is basically worthless. I'm honestly not even interested in trying out 4.7. The unfortunate reality of these black boxes is that what makes a particular model shine is very hard to understand and replicate, so you end up with an unpredictable product direction, not something that is steadily improving.
> Instruction following. Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally. Users should re-tune their prompts and harnesses accordingly.
Yay! They finally fixed instruction following, so people can stop bashing my benchmarks[0] for being broken, because Opus 4.6 did poorly on them and called my tests broken...
[0]: https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
> Opus 4.7 uses an updated tokenizer that improves how the model processes text. The tradeoff is that the same input can map to more tokens—roughly 1.0–1.35× depending on the content type.
caveman[0] is becoming more relevant by the day. I already enjoy reading its output more than vanilla so suits me well.
I liked Opus 4.5 but hated 4.6. Every few weeks I tried 4.6 and, after a tirade against, I switched back to 4.5. They said 4.6 had a "bias towards action", which I think meant it just made stuff up if something was unclear, whereas 4.5 would ask for clarfication. I hope 4.7 is more of a collaborator like 4.5 was.
From a quick tests, it seems to hallucinate a lot more than opus 4.6. I like to ask random knowledge questions like "What are the best chinese rpgs with a decent translations for someone who is not familiar with them? The classics one should not miss?" and 4.6 gave accurate answers, 4.7 hallucinated the name of games, gave wrong information on how to run them etc...
Seems common for any type of slightly obscure knowledge.
If the model is based on a new tokenizer, that means that it's very likely a completely new base model. Changing the tokenizer is changing the whole foundation a model is built on. It'd be more straightforward to add reasoning to a model architecture compared to swapping the tokenizer to a new one.
Usually a ground up rebuild is related to a bigger announcement. So, it's weird that they'd be naming it 4.7.
Swapping out the tokenizer is a massive change. Not an incremental one.
Quick everyone to your side projects. We have ~3 days of un-nerfed agentic coding again.
First model to get 100% on my agentic benchmark: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...
So many messages about how Codex is better then Claude from one day to the other, while my experience is exactly the same. Is OpenAI botting the thread? I can't believe this is genuine content.
I hope this will fix up the poor quality that we're seeing on Claude Opus 4.6
But degrading a model right before a new release is not the way to go.
Huge regression for long contest tasks interestingly.
Mrcr benchmark went from 78% to 32%
Opus keeps pointing out (in a fashion that could be construed as exasperated) that what it's working on is "obviously not malware" several times in a Cowork response, so I suspect the system prompt could use some tuning...
funny how they use mythos preview in these benchmarks like a carrot on a stick
It seems like they're doing something with the system prompt that I don't quite understand. I'm trying it in Claude Code and tool calls repeatedly show weird messages like "Not malware." Never seen anything like that with other Anthropic models.
Initial testing today - 4.7 excels at abstractions/implementations of abstractions in ways that often failed in 4.5/4.6. This is a great update, I've had to do a lot of manual spec to ensure consistency between design and implementation recently as projects grow.
I'm finding the "adaptive thinking" thing very confusing, especially having written code against the previous thinking budget / thinking effort / etc modes: https://platform.claude.com/docs/en/build-with-claude/adapti...
Also notable: 4.7 now defaults to NOT including a human-readable reasoning token summary in the output, you have to add "display": "summarized" to get that: https://platform.claude.com/docs/en/build-with-claude/adapti...
(Still trying to get a decent pelican out of this one but the new thinking stuff is tripping me up.)