Hey all, Boris from the Claude Code team here.
We've been investigating these reports, and a few of the top issues we've found are:
1. Prompt cache misses when using 1M token context window are expensive. Since Claude Code uses a 1 hour prompt cache window for the main agent, if you leave your computer for over an hour then continue a stale session, it's often a full cache miss. To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session), and are investigating defaulting to 400k context instead, with an option to configure your context window to up to 1M if preferred. To experiment with this now, try: CLAUDE_CODE_AUTO_COMPACT_WINDOW=400000 claude.
2. People pulling in a large number of skills, or running many agents or background automations, which sometimes happens when using a large number of plugins. This was the case for a surprisingly large number of users, and we are actively working on (a) improving the UX to make these cases more visible to users and (b) more intelligently truncating, pruning, and scheduling non-main tasks to avoid surprise token usage.
In the process, we ruled out a large number of hypotheses: adaptive thinking, other kinds of harness regressions, model and inference regressions.
We are continuing to investigate and prioritize this. The most actionable thing for people running into this is to run /feedback, and optionally post the feedback ids either here or in the Github issue. That makes it possible for us to debug specific reports.
Boris, you're seeing a ton of anecdotes here and Claude has done something that has affected a bunch of their most fervent users.
Jeff Bezos famously said that if the anecdotes are contradicting the metrics, then the metrics are measuring the wrong things. I suggest you take the anecdotes here seriously and figure out where/why the metrics are wrong.
So Anthropic is trying to save money on infrastructure, we all get it. However, it's not ok to degrade the performance your users have paid for. Last week the issue was that you reduced the default "effort" level, now the prompt cache is shortened. Several users experience far more restrictive usage limits lately.
There is only so much you can do through "UX improvements" or some smart routing on the backend. Your flagship product is actively getting worse, and if users need to fiddle with hidden settings and keep track of GitHub issues every week they will start voting with their money.
Why did this become an issue seemingly overnight when 1M context has been available for a while, and I assume prompt caching behavior hasn't changed?
EDIT: prompt caching behavior -did- change! 1hr -> 5min on March 6th. I'm not sure how starting a fresh session fixes it, as it's just rebuilding everything. Why even make this available?
It feels like the rules changed and the attitude from Anth is "aw I'm sorry you didn't know that you're supposed to do that." The whole point of CC is to let it run unattended; why would you build around the behavior of watching it like a hawk to prevent the cache from expiring?
For me definitely the worst regression was the system prompt telling claude to analyze file to check if it's malware at every read. That correlates with me seeing also early exhausted quotas and acknowledgments of "not a malware" at almost every step.
It is a horrible error of judgement to insert a complex request for such a basic ability. It is also an error of judgement to make claude make decisions whether it wants to improve the code or not at all.
It is so bad, that i stopped working on my current project and went to try other models. So far qwen is quite promising.
The /clear nudge isn't a solution though. Compacting or clearing just means rebuilding context until Claude is actually productive again. The cost comes either way. I get that 1M context windows cost more than the flat per-token price reflects, because attention scales with context length, but the answer to that is honest pricing or not offering it. Not annoying UX nudges. What’s actually indefensible is that Claude is already pushing users to shrink context via, I presume, system prompt. At maybe 25% fill:
“This seems like a good opportunity to wrap it up and continue in a fresh context window.”
“Want to continue in a fresh context window? We got a lot of work done and this next step seems to deserve a fresh start!”
If there’s a cost problem, fix the pricing or the architecture. But please stop the model and UI from badgering users into smaller context windows at every opportunity. That is not a solution, it’s service degradation dressed as a tooltip.> Since Claude Code uses a 1 hour prompt cache window for the main agent, if you leave your computer for over an hour then continue a stale session, it's often a full cache miss. To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session), and are investigating defaulting to 400k context instead
I don’t understand this. I frequently have long breaks. I never want to clear or even compact because I don’t want to lose the conversations that I’ve had and the context. Clearing etc causes other issues like I have to restate everything at times and it misses things. I do try to update the memory which helps. I wish there was a better solution than a time bound cache
I don't want a nudge. I want a clear RED WARNING with "You've gone away from your computer a bit too long and chatted too much at the coffee machine. You're better off starting a new context!"
Hey Boris - why is the best way to get support making a Hacker News or X post, and hoping you reply? Why does Anthropic Enterprise Support never respond to inquiries?
OpenAI (Codex) keeps on resetting the usage limits each time they fuck up...
I have yet to see Anthropic doing the same. Sorry but this whole thing seems to be quite on purpose.
I suspect 1M token context is questionable value because of the secondary effect of burning quota vs getting work done.
I think the model select that let me choose 1M made sense because I could decide if I was working on large documents and compacting more often was more effective.
Would it be possible to increase the cache duration if misses are a frequent source of problems?
Maybe using a heartbeat to detect live sessions to cache longer than sessions the user has already closed. And only do it for long sessions where a cache miss would be very expensive.
Boris,
Even if Anthropic is working in good faith to lower infrastructure costs, developers need more than 5 minutes to notice that CC completed a task, review its changes and ask it to merge. Only developers who do not review code changes can live with such a TTL...
Consider making this value configurable as the ideal TTL value is different for each person. If people are willing to pay more for 30 minutes TTL than 5 minutes, they should be able to.
> Since Claude Code uses a 1 hour prompt cache window for the main agent
this seems a bit awkward vs the 5 hour session windows.
if i get rate limited once, I'll get rate limited immediately again on the same chat when the rate limit ends?
any chance we can get some form of deffered cache so anything on a rate limited account gets put aside until the rate limit ends?
As another data point, I pay for Pro for a personal account, and use no skills, do nothing fancy, use the default settings, and am out of tokens, with one terminal, after an hour. This is typically working on a < 5,000 line code base, sometimes in C, sometimes in Go. Not doing incredibly complicated things.
Ah, so cache usage impacts rate limits. There goes the ”other harnesses aren’t utilizing the cache as efficiently” argument.
Hi Boris,
Long term claude code user here. Is the first time i've had to setup a hook to codex to review claude output.
Is hallucinating like never before
Is missing key concepts/instructions in context like never before
Is writing bad code that will "pass test" much more. Before it use to try be critic and do good code, now it will try to hack test and bypass intructions for a green pass.
Have you considered poking the cache?
When a user walks away during the business day but CC is sitting open, you can refresh that cache up to 10x before it costs the same as a full miss. Realistically it would be <8x in a working day.
Am I so out of touch?
No! It’s the children who are wrong!
Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?
I’ve seen the /clear command prompt and I found the verbiage to be a bit unclear. I think clarifying that the cache has expired and providing an understandable metric on the impact - ie “X% of your 5-hour window” for Pro/Mad users and details on token use for API users. A pop-up that requires explicit acknowledgment might also help, although that could be more of an annoyance to enterprise users.
One pattern I use frequently is using one high level design and implementation agent that I’ll use for multiple sessions and delegate implementation to lower level agents.
In this case it’d be helpful to have one of two options:
1. If Claude CLI could create an auto compaction of the conversation history before cache expiration. For example, if I’m beyond X minutes or Y prompts in a conversation and I’ve been inactive for a threshold it could auto-compact close to the expiration and provide that as an option on resume. 2. If I could configure cache expiration proactively and Anthropic could use S3 or a similar slow load mechanism to offload the cache for a longer period - possibly 24-72h.
I can appreciate that longer KV cache expiration would complicate capacity management and make inference traffic less fungible but I wouldn’t mind waiting seconds to minutes for it to load from a slower store to resume without quota hits.
Does this 60min ttl of also apply to claude code web?
I have regularly sessions open for multiple days.
Is that a pattern that is not advised?
Could we get an option to use Opus with a smaller context window? I noticed that results get much worse way earlier than when you reach 1M tokens, and I would love to have a setting so that I could force a compaction at eg 300k tokens.
You've created quite a conundrum.
The only people who are going to run into issues are superpower users who are running this excessively beyond any reasonable measure.
Most people are going to be quite happy with your service. But at the same time, and this is just a human nature thing people are 10 times more likely we complain about an issue than to compliment something working well.
I don't know how to fix this, but I strongly suspect this isn't really a technical issue. It's more of a customer support one.
> defaulting to 400k context instead, with an option to configure your context window to up to 1M if preferred
This seems really useful!
I'm surprised that "Opus 4.6" (200K) and "Opus 4.6 1M" are the only Opus options in the desktop app, whereas in the CLI/TUI app you don't seem to even get that distinction.
I bet that for a lot of folks something like 400k, 600k or 800k would work as better defaults, based on whatever task they want to work on.
Boris, wasnt this the same thing ~2 weeks ago? Is it the same cache misses as before? What's the expected time till solved? Seems like its taking a while
Thank you for your responses, especially on a Sunday. They give us some insights and at least a couple temporary workarounds to use, while the issues are being addressed :) much appreciated
/loop message ping every 4 minutes
keeps the cache warm while the CC REPL is not active.
Hello Boris! How do I increase the 1 hour prompt cache window for the main agent? I would love to be able to set that to, say, 4 hours. That gives me enough time to work on something, go teach a class, grab a snack, and come back and pick up where I left off.
Resizing the context window seems like a very good idea to me. I noticed a decline of productivity when the 1M context window was released and I'd like to bring it back to 200k, because it was totally fine for the things I was working on.
shouldn't compaction be interactive with the user as to what context will continue to be the most relevant in the future??? what if the harness allowed for a turn to clarify the user's expected future direction of the conversation and did the consolidation based upon the addition info?
there definitely seems to be a benefit to pruning the context and keeping the signal to noise high wrt what is still to be discussed.
> To improve this, we have shipped a few UX improvements (eg. to nudge you to /clear before continuing a long stale session)
Is this really an improvement? Shouldn't this be something you investigate before introducing 1M context?
What is a long stale session?
If that's not how Claude Code is intended to be used it might as well auto quit after a period of time. If not then if it's an acceptable use case users shouldn't change their behavior.
> People pulling in a large number of skills, or running many agents or background automations, which sometimes happens when using a large number of plugins.
If this was an issue there should have been a cap on it before the future was released and only increased once you were sure it is fine? What is "a large number"? Then how do we know what to do?
It feels like "AI" has improved speed but is in fact just cutting corners.
Where can i learn about concepts like prompt cache misses? I don't have a mental model how that interacts with my context of 1M or 400k tokens... I can cargo cult follow instructions of course but help us understand if you can so we can intelligently adapt our behavior. Thanks.
Why are you all of a sudden running into so many issues like this? Could it be that all of the Anthropics employees have completely unlimited and unbounded accounts, which means you don't get a feeling of how changes will affect the customers?
I have a feature request: I build an mcp server, but now it has over 60 tools. Most sessions i really don’t need most of them. I suppose I could make this into several servers. But it would maybe be nice to give the user more power here. Like let me choose the tools that should be loaded or let me build servers that group tools together which can be loaded. Not sure if that makes sense …
Have you tried asking Mythos for a fix?
from looking at the raw requests, that cant seem right?
its all "cache_control": { "type": "ephemeral" } there is no "ttl" anywhere.
// edit: cc_version=2.1.104.f27
How can we turn of 1m context? I don't find it has ever helped.
Claude Code cache is not 1 hour. There is a "Closed as not planned" issue in GitHub that confirms that it has been moved to 5 minutes since March: https://github.com/anthropics/claude-code/issues/46829. I started seeing the massive degradation exactly on the 23rd of March, hence after a few days I unsubscribed because it was completely unusable, with a ~5h session being depleted in as little as 15-20 mins.
People, just switch to MiniMax and ditch CC completely. It's not worth it any more.
There's an issue someone raised showing that prompt caches are only 5 minutes.
The reply seems to be: oh huh, interesting. Maybe that's a good thing since people sometimes one-shot? That doesn't feel like the messaging I want to be reading, and the way it conflicts with the message here that cache is 1 hour is confusing.
https://news.ycombinator.com/item?id=47741755
Is there any status information or not on whether cache is used? It sure looks like the person analyzing the 5m issue had to work extremely hard to get any kind of data. It feels like the iteration loop of people getting better at this stuff would go much much better if this weren't such a black box, if we had the data to see & understand: is the cache helping?
Number 2 makes me chuckle honestly. Too many people going down the 10x rabbit holes on youtube. Next up, a framework that 100xs your workflow. You know its good because it comes with 300 agents and 20 mcp servers and 1200 skills
Pulling all the skills and agents in the world in, when unused are a big hit. I deleted all of mine and added back as needed and there was an improvement.
Running Claude Cowork in the background will hit tokens and it might not be the most efficient use of token use.
Last, but not least, turning off 1M token context by default is helpful.
Can you explain why Opus 4.6 suddenly becomes dumb as a sack of potatoes, even if context is barely filled?
Can you explain why Opus 4.6 will be coming up with stupid solutions only to arrive at a good one when you mention it is trying to defraud you?
I have a feeling the model is playing dumb on purpose to make user spend more money.
This wasn't the case weeks ago when it actually working decently.
Eh you say that every time and yet it keeps happening.
Boris, is the KV cache TTL now reduced to 5 minutes from 1 hour?
I think this may be the biggest concern for people building tools on the API: https://github.com/anthropics/claude-code/issues/46829
I would argue that KV caching is a net gain for Ant and a well-maintained cache is the biggest thing that can generate induced demand and a thriving third party ecosystem. https://safebots.ai/papers/KV.pdf
Wait what? If I get told to come back in three hours because I'm using the product too much, I get penalized when I resume?
What's the right way to work on a huge project then? I've just been saying "Please continue" -- that pops the quota?
[flagged]
One thing I didn't see anywhere here, except your mention about pulling in large number of skills, is that the token consumption is significantly higher for users with many agents, skills, and MCPs installed, and many are mere ghosts. The 5m TTL from #46829 compounds the effect: in my case, I found ~20k tokens of ghost context I hadn't intentionally opened. Each idle period after 5m wastes that as a full cache miss.
Boris, would you please confirm on-record: is the current cache TTL for the main agent context 1h or 5m? Issue #46829 was closed as "not planned".