An update on recent Claude Code quality reports

768 points • by mfiguiere • yesterday at 5:48 PM • 599 comments • view on HN

Comments

"On March 26, we shipped a change to clear Claude's older thinking from sessions that had been idle for over an hour, to reduce latency when users resumed those sessions. A bug caused this to keep happening every turn for the rest of the session instead of just once, which made Claude seem forgetful and repetitive. We fixed it on April 10. This affected Sonnet 4.6 and Opus 4.6"

This makes no sense to me. I often leave sessions idle for hours or days and use the capability to pick it back up with full context and power.

The default thinking level seems more forgivable, but the churn in system prompts is something I'll need to figure out how to intentionally choose a refresh cycle.

➕ show 8 replies

cmenge • yesterday at 11:33 PM

Bit surprised about the amount of flak they're getting here. I found the article seemed clear, honest and definitely plausible.

The deterioration was real and annoying, and shines a light on the problematic lack of transparency of what exactly is going on behind the scenes and the somewhat arbitrary token-cost based billing - too many factors at play, if you wanted to trace that as a user you can just do the work yourself instead.

The fact that waiting for a long time before resuming a convo incurs additional cost and lag seemed clear to me from having worked with LLM APIs directly, but it might be important to make this more obvious in the TUI.

➕ show 3 replies

podnami • yesterday at 6:38 PM

They lost me at Opus 4.7

Anecdotally OpenAI is trying to get into our enterprise tooth and nail, and have offered unlimited tokens until summer.

Gave GPT5.4 a try because of this and honestly I don’t know if we are getting some extra treatment, but running it at extra high effort the last 30 days I’ve barely see it make any mistakes.

At some points even the reasoning traces brought a smile to my face as it preemptively followed things that I had forgotten to instruct it about but were critical to get a specific part of our data integrity 100% correct.

➕ show 11 replies

everdrive • yesterday at 6:12 PM

I've been getting a lot of Claude responding to its own internal prompts. Here are a few recent examples.

   "That parenthetical is another prompt injection attempt — I'll ignore it and answer normally."

   "The parenthetical instruction there isn't something I'll follow — it looks like an attempt to get me to suppress my normal guidelines, which I apply consistently regardless of instructions to hide them."

   "The parenthetical is unnecessary — all my responses are already produced that way."

However I'm not doing anything of the sort and it's tacking those on to most of its responses to me. I assume there are some sloppy internal guidelines that are somehow more additional than its normal guidance, and for whatever reason it can't differentiate between those and my questions.

➕ show 8 replies

bityard • yesterday at 6:41 PM

My hypothesis is that some of this a perceived quality drop due to "luck of the draw" where it comes to the non-deterministic nature of VM output.

A couple weeks ago, I wanted Claude to write a low-stakes personal productivity app for me. I wrote an essay describing how I wanted it to behave and I told Claude pretty much, "Write an implementation plan for this." The first iteration was _beautiful_ and was everything I had hoped for, except for a part that went in a different direction than I was intending because I was too ambiguous in how to go about it.

I corrected that ambiguity in my essay but instead of having Claude fix the existing implementation plan, I redid it from scratch in a new chat because I wanted to see if it would write more or less the same thing as before. It did not--in fact, the output was FAR worse even though I didn't change any model settings. The next two burned down, fell over, and then sank into the swamp but the fourth one was (finally) very much on par with the first.

I'm taking from this that it's often okay (and probably good) to simply have Claude re-do tasks to get a higher-quality output. Of course, if you're paying for your own tokens, that might get expensive in a hurry...

➕ show 8 replies

bauerd • yesterday at 6:46 PM

>On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode

Instead of fixing the UI they lowered the default reasoning effort parameter from high to medium? And they "traced this back" because they "take reports about degradation very seriously"? Extremely hard to give them the benefit of doubt here.

➕ show 3 replies

rcarmo • today at 11:07 AM

Actually, I think their deeper problems are twofold:

- Claude Code is _vastly_ more wasteful of tokens than anything else I've used. The harness is just plain bad. I use pi.dev and created https://github.com/rcarmo/piclaw, and the gaps are huge -- even the models through Copilot are incredibly context-greedy when compared to GPT/Codex

- 4.7 can be stupidly bad. I went back to 4.6 (which has always been risky to use for anything reliable, but does decent specs and creative code exploration) and Codex/GPT for almost everything.

So there is really no reason these days to pay either their subscription or their insanely high per/token price _and_ get bloat across the board.

whh • today at 11:47 AM

Thanks Anthropic, and a big thanks to your Claude Code team for the customer obsession here. I've just noticed the Command + Backspace fix and even the nice little Ctrl + y addition as a fix for accidents.

I really appreciate these little touches.

karsinkk • yesterday at 7:41 PM

" Combined with this only happening in a corner case (stale sessions) and the difficulty of reproducing the issue, it took us over a week to discover and confirm the root cause"

I don't know about others, but sessions that are idle > 1h are definitely not a corner case for me. I use Claude code for personal work and most of the time, I'm making it do a task which could say take ~10 to 15mins. Note that I spend a lot of time back and forth with the model planning this task first before I ask it to execute it. Once the execution starts, I usually step away for a coffee break (or) switch to Codex to work on some other project - follow similar planning and execution with it. There are very high chances that it takes me > 1h to come back to Claude.

➕ show 2 replies

hansmayer • today at 11:22 AM

A suggestion to Anthropic, just start charging the real price for your software. Of course you have to dumb it down, when the $200 tier in reality produces 5-10 thousand dollars in monthly costs when used by people who know how to max it out. So then you come up with creative nonsense like "adaptive thinking" when your tool is sometimes working and sometimes outright not - the irony of "intelligent tools" not "thinking" aside. Of course this would kind of ruin your current value proposition as charging the actual price would make your core idea of making large swaths of skilled population un-employed, unfeasible but I am sure if you feed it into the Claude, it will find some points for and against, just like how Karpathy uses his LLM of choice to excrement his blog posts.

➕ show 1 reply

anonyfox • today at 11:37 AM

I refuse to believe that caching tiers for longer than 1 hour would be impossible to transparently build and use to avoid all this complexity to begin with, nor that it would be that expensive to maintain in 2026 when the bulk costs are on inference anyways which would even be reduced by occasional longer time cache hits.

arkariarn • yesterday at 7:30 PM

I see some anthropic claude code people are reading the comments. A day or two ago I watched a video by theo t3.gg on whether claude got dumber. Even though he was really harsh on anthropic and said some mean stuff. I thought some of the points he was raising about claude code was quite apt. Especially when it comes to the harness bloat. I really hope the new features now stop and there is a real hard push for polish and optimization. Otherwise I think a lot of people will start exploring less bloated more optimized alternatives. Focus on making the harness better and less token consuming.

https://youtu.be/KFisvc-AMII?is=NskPZ21BAe6eyGTh

➕ show 3 replies

data-ottawa • today at 3:27 AM

I think most frustrating is the system prompt issue after the postmortem from September[1].

These bugs have all of the same symptoms: undocumented model regressions at the application layer, and engineering cost optimizations that resulted in real performance regressions.

I have some follow up questions to this update:

- Why didn't September's "Quality evaluations in more places" catch the prompt change regression, or the cache-invalidation bug?

- How is Anthropic using these satisfaction questions? My own analysis of my own Claude logs was showed strong material declines in satisfaction here, and I always answer those surveys honestly. Can you share what the data looked like and if you were using that to identify some of these issues?

- There was no refund or comped tokens in September. Will there be some sort of comp to affected users?

- How should subscribers of Claude Code trust that Anthropic side engineering changes that hit our usage limits are being suitably addressed? To be clear, I am not trying to attribute malice or guilt here, I am asking how Anthropic can try and boost trust here. When we look at something like the cache-invalidation there's an engineer inside of Anthropic who says "if we do this we save $X a week", and virtually every manager is going to take that vs a soft-change in a sentiment metric.

- Lastly, when Anthropic changes Claude Code's prompt, how much performance against the stated Claude benchmarks are we losing? I actually think this is an important question to ask, because users subscribe to the model's published benchmark performance and are sold a different product through Claude Code (as other harnesses are not allowed).

[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...

Robdel12 • yesterday at 6:00 PM

Wow, bad enough for them to actually publish something and not cryptic tweets from employees.

Damage is done for me though. Even just one of these things (messing with adaptive thinking) is enough for me to not trust them anymore. And then their A/B testing this week on pricing.

➕ show 4 replies

HarHarVeryFunny • today at 11:31 AM

And the reason why Claude Code is so buggy ...

https://techtrenches.dev/p/the-snake-that-ate-itself-what-cl...

MrOrelliOReilly • yesterday at 9:30 PM

IMO this is the consequence of a relentless focus on feature development over core product refinement. I often have the impression that Anthropic would benefit from a few senior product people. Someone needs to lend them a copy of “Escaping the Build Trap.” Just because we _can_ rapidly add features now doesn’t mean we should.

PS I’m not referencing a well-known book to suggest the solution is trite product group think, but good product thinking is a talent separate from good engineering, and Anthropic seems short on the later recently

➕ show 3 replies

huksley • today at 7:24 AM

Just add this, it works better than Opus 4.7

vim ~/.claude/settings.json

{ "model": "claude-opus-4-6", "fastMode": false, "effortLevel": "high", "alwaysThinkingEnabled": true, "autoCompactWindow": 700000 }

➕ show 1 reply

puppystench • yesterday at 7:14 PM

The Claude UI still only has "adaptive" reasoning for Opus 4.7, making it functionally useless for scientific/coding work compared to older models (as Opus 4.7 will randomly stop reasoning after a few turns, even when prompted otherwise). There's no way this is just a bug and not a choice to save tokens.

➕ show 2 replies

leobuskin • today at 4:40 AM

This usage reset you did on April 23 will not mitigate the struggle we’ve experienced. I didn’t even notice it yesterday. I checked this morning and it came down from 25% weekly to 7%. What is this? I didn’t have problems for two months like many others (maybe my CC habits helped), but two weeks were very painful. Make a proper apology, guys. This “reset” for many users could hit the first days of the week, tell me you thought about that.

sscaryterry • today at 11:30 AM

Glad there is finally some ownership. It is a pity that this was mostly because AMD embarrassed them on GitHub. Users have been reporting these issues for weeks, but were mostly ignored.

cedws • yesterday at 7:25 PM

>On April 16, we added a system prompt instruction to reduce verbosity

In practice I understand this would be difficult but I feel like the system prompt should be versioned alongside the model. Changing the system prompt out from underneath users when you've published benchmarks using an older system prompt feels deceptive.

At least tell users when the system prompt has changed.

➕ show 1 reply

kamranjon • yesterday at 10:38 PM

This black box approach that large frontier labs have adopted is going to drive people away. To change fundamental behavior like this without notifying them, and only retroactively explaining what happened, is the reason they will move to self-hosting their own models. You can't build pipelines, workflows and products on a base that is just randomly shifting beneath you.

➕ show 1 reply

nickdothutton • yesterday at 7:08 PM

I presume they don't yet have a cohesive monetization strategy, and this is why there is such huge variability in results on a weekly basis. It appears that Anthropic are skipping from one "experiment" to another. As users we only get to see the visible part (the results). Can't design a UI that indicates the software is thinking vs frozen? Does anyone actually believe that?

➕ show 1 reply

lherron • yesterday at 9:40 PM

Are they also going to refund all the extra usage api $$$ people spent in the last month?

Also I don’t know how “improving our Code Review tool” is going to improve things going forward, two of the major issues were intentional choices. No code review is going to tell them to stop making poor and compromising decisions.

➕ show 4 replies

vintagedave • yesterday at 8:55 PM

> Today we are resetting usage limits for all subscribers.

I asked for this via support, got a horrible corporate reply thread, and eventually downgraded my account. I'm using Codex now as we speak. I could not use Claude any more, I couldn't get anything done.

Will they restore my account usage limits? Since I no longer have Max?

Is that one week usage restored, or the entire buggy timespan?

➕ show 1 reply

rohansood15 • today at 8:55 AM

If Anthropic couldn't catch these issues before people started screaming at them, do we really believe 50% of software engineering jobs are going away?

exabrial • today at 3:02 AM

Last I tried 4.7, it was bad. Like ChatGPT bad: changed stuff it wasn’t supposed to, hallucinated code, forgot information, missed simple things, didn’t catch mistakes. And it burned through tokens like crazy.

I’ll stay on 4.6 for awhile. Seems to be better. What’s frustrating, though you cannot rely on these tools. They are constantly tinkering and changing with things and there’s no option to opt out.

➕ show 1 reply

skeledrew • yesterday at 10:07 PM

Some of these changes and effects seriously affect my flow. I'm a very interactive Claude user, preferring to provide detailed guidance for my more serious projects instead of just letting them run. And I have multiple projects active at once, with some being untouched for days at a time. Along with the session limits this feels like compounding penalties as I'm hit when I have to wait for session reset (worse in the middle of a long task), when I take time to properly review output and provide detailed feedback, when I'm switching among currently active projects, when I go back to a project after a couple days or so,... This is honestly starting to feel untenable.

dataviz1000 • yesterday at 6:16 PM

This is the problem with co-opting the word "harness". What agents need is a test harness but that doesn't mean much in the AI world.

Agents are not deterministic; they are probabilistic. If the same agent is run it will accomplish the task a consistent percentage of the time. I wish I was better at math or English so I could explain this.

I think they call it EVAL but developers don't discuss that too much. All they discuss is how frustrated they are.

A prompt can solve a problem 80% of the time. Change a sentence and it will solve the same problem 90% of time. Remove a sentence it will solve the problem 70% of the time.

It is so friggen' easy to set up -- stealing the word from AI sphere -- a TEST HARNESS.

Regressions caused by changes to the agent, where words are added, changed, or removed, are extremely easy to quantify. It isn’t pass/fail. It’s whether the agent still solves the problem at the same percentage of the time it consistently has.

➕ show 2 replies

foota • yesterday at 6:03 PM

> On April 16, we added a system prompt instruction to reduce verbosity. In combination with other prompt changes, it hurt coding quality, and was reverted on April 20. This impacted Sonnet 4.6, Opus 4.6, and Opus 4.7.

Claude caveman in the system prompt confirmed?

➕ show 1 reply

lukebechtel • yesterday at 6:36 PM

Some people seem to be suggesting these are coverups for quantization...

Those who work on agent harnesses for a living realize how sensitive models can be to even minor changes in the prompt.

I would not suspect quantization before I would suspect harness changes.

MillionOClock • yesterday at 6:20 PM

I see the Claude team wanted to make it less verbose, but that's actually something that bothered me since updating to Claude 4.7, what is the most recommended way to change it back to being as verbose as before? This is probably a matter of preference but I have a harder time with compact explanations and lists of points and that was originally one of the things I preferred with Claude.

PeakScripter • today at 10:53 AM

They should really test everything thoroughly and then make it available to general public to avoid these issues!!

jpcompartir • yesterday at 6:36 PM

Anthropic releases used to feel thorough and well done, with the models feeling immaculately polished. It felt like using a premium product, and it never felt like they were racing to keep up with the news cycle, or reply to competitors.

Recently that immaculately polished feel is harder to find. It coincides with the daily releases of CC, Desktop App, unknown/undocumented changes to the various harnesses used in CC/Cowork. I find it an unwelcome shift.

I still think they're the best option on the market, but the delta isn't as high as it was. Sometimes slowing down is the way to move faster.

➕ show 5 replies

bashtoni • today at 3:31 AM

The Claude Code experience is still pretty bad after upgrading. I often see

  Error: claude-opus-4-7[1m] is temporarily unavailable, so auto mode cannot determine the safety of Bash right now. Wait briefly and then try this action again. If it keeps failing, continue with other tasks that don't require this action and come back to it later. Note: reading files, searching code, and other read-only operations do not require the classifier and can still be used.

The only solution is to switch out of auto mode, which now seems to be the default every time I exit plan mode. Very annoying.

ctoth • yesterday at 7:12 PM

> As of April 23, we’re resetting usage limits for all subscribers.

Wait, didn't they just reset everybody's usage last Thursday, thereby syncing everybody's windows up? (Mine should have reset at 13:00 MDT) ? So this is just the normal weekly reset? Except now my reset says it will come Saturday? This is super-confusing!

➕ show 1 reply

hintymad • yesterday at 8:49 PM

> On March 4, we changed Claude Code's default reasoning effort from high to medium to reduce the very long latency—enough to make the UI appear frozen—some users were seeing in high mode.

This sounds fishy. It's easy to show users that Claude is making progress by either printing the reasoning tokens or printing some kind of progress report. Besides, "very long" is such a weasel phrase.

➕ show 1 reply

jryio • yesterday at 5:53 PM

1. They changed the default in March from high to medium, however Claude Code still showed high (took 1 month 3 days to notice and remediate)

2. Old sessions had the thinking tokens stripped, resuming the session made Claude stupid (took 15 days to notice and remediate)

3. System prompt to make Claude less verbose reducing coding quality (4 days - better)

All this to say... the experience of suspecting a model is getting worse while Anthropic publicly gaslights their user-base: "we never degrade model performance" is frustrating.

Yes, models are complex and deploying them at scale given their usage uptick is hard. It's clear they are playing with too many independent variables simultaneously.

However you are obligated to communicate honestly to your users to match expectations. Am I being A/B tested? When was the date of the last system prompt change? I don't need to know what changed, just that it did, etc.

Doing this proactively would certainly match expectations for a fast-moving product like this.

➕ show 5 replies

behat • yesterday at 8:36 PM

This is a very interesting read on failure modes of AI agents in prod.

Curious about this section on the system prompt change: >> After multiple weeks of internal testing and no regressions in the set of evaluations we ran, we felt confident about the change and shipped it alongside Opus 4.7 on April 16. As part of this investigation, we ran more ablations (removing lines from the system prompt to understand the impact of each line) using a broader set of evaluations. One of these evaluations showed a 3% drop for both Opus 4.6 and 4.7. We immediately reverted the prompt as part of the April 20 release.

Curious what helped catch in the later eval vs. initial ones. Was it that the initial testing was online A/B comparison of aggregate metrics, or that the dataset was not broad enough?

jameson • yesterday at 7:21 PM

> "In combination with other prompt changes, it hurt coding quality, and was reverted on April 20"

Do researchers know correlation between various aspects of a prompt and the response?

LLM, to me at least, appears to be a wildly random function that it's difficult to rely on. Traditional systems have structured inputs and outputs, and we can know how a system returned the output. This doesn't appear to be the case for LLM where inputs and outputs are any texts.

Anecdotally, I had a difficult time working with open source models at a social media firm, and something as simple as wrapping the example of JSON structure with ```, adding a newline or wording I used wildly changed accuracy.

munk-a • yesterday at 6:46 PM

It's also important to realize that Anthropic has recently struck several deals with PE firms to use their software. So Anthropic pays the PE firm which forces their managed firms to subscribe to Anthropic.

The artificial creation of demand is also a concerning sign.

ramoz • yesterday at 9:57 PM

Opus 4.7 is very rough to work with. Specifically for long-horizon (we were told it was trained specifically for this and less handholding).

I don't have trust in it right now. More regressions, more oversights, it's pedantic and weird ways. Ironically, requires more handholding.

Not saying it's a bad model; it's just not simple to work with.

for now: `/model claude-opus-4-6[1m]` (youll get different behavior around compaction without [1m])

Implicated • today at 12:09 AM

Just as a note to CC fans/users here since I had an opportunity to do so... I tested resuming a session that was stale at 950k tokens after returning from a full day or so of being idle, thus a fully empty quota/session.

Resuming it cost 5% of the current session and 1% of the weekly session on a max subscription.

russellthehippo • yesterday at 11:15 PM

Damn it was real the whole time. I found Opus 4.7 to holistically underperform 4.6, and especially in how much wordiness there is. It's harder to work with so I just switched back to 4.6 + Kimi K2.6. Now GPT 5.5 is here and it's been excellent so far.

lifthrasiir • yesterday at 6:36 PM

Is it just for me that the reset cycle of usage limits has been randomly updated? I originally had the reset point at around 00:00 UTC tomorrow and it was somehow delayed to 10:00 UTC tomorrow, regardless of when I started to use Claude in this cycle. My friends also reported very random delay, as much as ~40 hours, with seemingly no other reason. Is this another bug on top of other bugs? :-S

➕ show 2 replies

WhitneyLand • yesterday at 6:03 PM

Did they not address how adaptive thinking has played in to all of this?

arjie • yesterday at 7:11 PM

Useful update. Would be useful to me to switch to a nightly / release cycle but I can see why they don't: they want to be able to move fast and it's not like I'm going to churn over these errors. I can only imagine that the benchmark runs are prohibitively expensive or slow or not using their standard harness because that would be a good smoke test on a weekly cadence. At the least, they'd know the trade-offs they're making.

Many of these things have bitten me too. Firing off a request that is slow because it's kicked out of cache and having zero cache hits (causes everything to be way more expensive) so it makes sense they would do this. I tried skipping tool calls and thinking as well and it made the agent much stupider. These all seem like natural things to try. Pity.

sreekanth850 • today at 5:27 AM

Who’s going to pay for the exorbitant number of tokens Claude used without delivering any meaningful outcome? I spent many sessions getting zero results, and when I posted about it on their subreddit, all I got were personal attacks from bots and fanboys. I instantly cancelled my subscription and moved to Codex.

Also, it may be a coincidence, that the article was published just before the GPT 5.5 launch, and then they restored the original model while releasing a PR statement claiming it was due to bugs.

noname120 • today at 11:24 AM

So now the solution is to input a “ping” message every hour so that it keeps the cache warm?

pxc • yesterday at 7:23 PM

One of Anthropic's ostensive ethical goals is to produce AI that is "understandable" as well as exceptionally "well-aligned". It's striking that some of the same properties that make AI risky also just make it hard to consistently deliver a good product. It occurs to me that if Anthropic really makes some breakthroughs in those areas, everyone will feel it in terms of product quality whether they're worried about grandiose/catastrophic predictions or not.

But right now it seems like, in the case of (3), these systems are really sensitive and unpredictable. I'd characterize that as an alignment problem, too.

alt Hacker News

An update on recent Claude Code quality reports

Comments

🔗 View 50 more comments