Hey, Boris from the Claude Code team here. Normally, when you have a conversation with Claude Code...

bcherny • yesterday at 7:02 PM • 54 replies • view on HN

Hey, Boris from the Claude Code team here.

Normally, when you have a conversation with Claude Code, if your convo has N messages, then (N-1) messages hit prompt cache -- everything but the latest message.

The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users. In an extreme case, if you had 900k tokens in your context window, then idled for an hour, then sent a message, that would be >900k tokens written to cache all at once, which would eat up a significant % of your rate limits, especially for Pro users.

We tried a few different approaches to improve this UX:

1. Educating users on X/social

2. Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

3. Eliding parts of the context after idle: old tool results, old messages, thinking. Of these, thinking performed the best, and when we shipped it, that's when we unintentionally introduced the bug in the blog post.

Hope this is helpful. Happy to answer any questions if you have.

Replies

dbeardsl • yesterday at 7:28 PM

I appreciate the reply, but I was never under the impression that gaps in conversations would increase costs nor reduce quality. Both are surprising and disappointing.

I feel like that is a choice best left up to users.

i.e. "Resuming this conversation with full context will consume X% of your 5-hour usage bucket, but that can be reduced by Y% by dropping old thinking logs"

➕ show 6 replies

btown • yesterday at 7:32 PM

Is there a way to say: I am happy to pay a premium (in tokens or extra usage) to make sure that my resumed 1h+ session has all the old thinking?

I understand you wouldn't want this to be the default, particularly for people who have one giant running session for many topics - and I can only imagine the load involved in full cache misses at scale. But there are other use cases where this thinking is critical - for instance, a session for a large refactor or a devops/operations use case consolidating numerous issue reports and external findings over time, where the periodic thinking was actually critical to how the session evolved.

For example, if N-4 was a massive dump of some relevant, some irrelevant material (say, investigating for patterns in a massive set of data, but prompted to be concise in output), then N-4's thinking might have been critical to N-2 not getting over-fixated on that dump from N-4. I'd consider it mission-critical, and pay a premium, when resuming an N some hours later to avoid pitfalls just as N-2 avoided those pitfalls.

Could we have an "ultraresume" that, similar to ultrathink, would let a user indicate they want to watch Return of the (Thin)king: Extended Edition?

➕ show 3 replies

jwr • today at 7:24 AM

These controversies erupt regularly, and I hope that you will see a common thing with most of them: you make a decision for your users without informing them.

Please fight this hubris. Your users matter. Many of us use your tools for everyday work and do not appreciate having the rug pulled from under them on a regular basis, much less so in an underhanded and undisclosed way.

I don't mind the bugs, these will happen. What I do not appreciate is secretly changing things that are likely to decrease performance.

➕ show 3 replies

Terretta • yesterday at 10:28 PM

This violates the principle of least surprise, with nothing to indicate Claude got lobotomized while it napped when so many use prior sessions as "primed context" (even if people don't know that's what they were doing or know why it works).

The purpose of spending 10 to 50 prompts getting Claude to fill the context for you is it effectively "fine tunes" that session into a place your work product or questions are handled well.

// If this notion of sufficient context as fine tune seems surprising, the research is out there.)

Approaches tried need to deal with both of these:

1) Silent context degradation breaks the Pro-tool contract. I pay compute so I don't pay in my time; if you want to surface the cost, surface it (UI + price tag or choice), don't silently erode quality of outcomes.

2) The workaround (external context files re-primed on return) eats the exact same cache miss, so the "savings" are illusory — you just pushed the cost onto the user's time. If my own time's cheap enough that's the right trade off, I shouldn't be using your machine.

uxcolumbo • yesterday at 8:48 PM

I don't envy you Boris. Getting flak from all sorts of places can't be easy. But thanks for keeping a direct line with us.

I wish Anthropic's leadership would understand that the dev community is such a vital community that they should appreciate a bit more (i.e. not nice sending lawyers afters various devs without asking nicely first, banning accounts without notice, etc etc). Appreciate it's not easy to scale.

OpenAI seems to be doing a much better job when it comes to developer relations, but I would like to see you guys 'win' since Anthropic shows more integrity and has clear ethical red lines they are not willing to cross unlike OpenAI's leadership.

cowlby • today at 1:37 PM

Ahh that makes sense. Sometimes it's convenient to re-use an older conversation that has all the context I need. But maybe it's just the last 20% that's relevant.

It would be nice to be able to summarize/cut into a new leaner conversation vs having to coax all the context back into a fresh one. Something like keep the last 100,000 tokens.

I believe /compact achieves something like this? It just takes so long to summarize that it creates friction.

kuboble • yesterday at 9:02 PM

As some others have mentioned.

I think the best option would be tell a user who is about to resurrect a conversation that has been evicted from cache that the session is not cached anymore and the user will have to face a full cost of replaying a session, not only the incremental question and answer.

(In understand under the hood that llms are n^2 by default but it's very counter intuitive - and given how popular cc is becoming outside of nerd circles, probably smaller and smaller fraction of users is aware of it)

I would like to decide on it case by case. Sometimes the session has some really deep insight I want to preserve, sometimes it's discardable.

➕ show 2 replies

isaacdl • yesterday at 7:19 PM

Thanks for giving more information. Just as a comment on (1), a lot of people don't use X/social. That's never going to be a sustainable path to "improve this UX" since it's...not part of the UX of the product.

It's a little concerning that it's number 1 in your list.

ceuk • yesterday at 7:20 PM

Is having massive sessions which sit idle for hours (or days) at a time considered unusual? That's a really, really common scenario for me.

Two questions if you see this:

1) if this isn't best practice, what is the best way to preserve highly specific contexts?

2) does this issue just affect idle sessions or would the cache miss also apply to /resume ?

➕ show 2 replies

fidrelity • yesterday at 7:08 PM

Just wanted to say I appreciate your responses here. Engaging so directly with a highly critical audience is a minefield that you're navigating well.

Thank you.

➕ show 3 replies

saadn92 • yesterday at 7:40 PM

I leave sessions idle for hours constantly - that's my primary workflow. If resuming a 900k context session eats my rate limit, fine, show me the cost and let me decide whether to /clear or push through. You already show a banner suggesting /clear at high context - just do the same thing here instead of silently lobotomizing the model.

➕ show 1 reply

artdigital • yesterday at 11:37 PM

I'm also a Claude Code user from day 1 here, back from when it wasn't included in the Pro/Max subscriptions yet, and I was absolutely not aware of this either. Your explanation makes sense, but I naively was also under the impression that re-using older existing conversations that I had open would just continue the conversation as is and not be a treated as a full cache miss.

My biggest learning here is the 1 hour cache window. I often have multiple Claudes open and it happens frequently that they're idle for 1+ hours.

This cache information should probably get displayed somewhere within Claude Code

➕ show 1 reply

mtilsted • yesterday at 8:20 PM

Then you need to update your documentation and teach claude to read the new documentation because here is what claude code answered:

Question: Hey claude, if we have a conversation, and then i take a break. Does it change the expected output of my next answer, if there are 2 hours between the previous message end the next one?

Answer: No. A 2-hour gap doesn't change my output. I have no internal clock between messages — I only see the conversation content plus the currentDate context injected each turn. The prompt cache may expire (5 min TTL), which affects cost/latency but not the response itself.

  The only things that can change output across a break: new context injected (like updated date), memory files being modified, or files on disk changing.

-- This answer directly contradict your post. It seems like the biggest problem is a total lack of documentation for expected behavior.

A similar thing happens if I ask claude code for the difference between plan mode, and accept edits on.

Then Claude told me the only difference was that with plan mode it would ask for permission before doing edits. But I really don't think this is true. It seems like plan mode does a lot more work, and present it in a total different way. It is not just a "I will ask before applying changes" mode.

➕ show 1 reply

bobkb • yesterday at 8:08 PM

Resuming sessions after more than 1 hour is a very common workflow that many teams are following. It will be great if this is considered as an expected behaviour and design the UX around it. Perhaps you are not realising the fact that Claude code has replaced the shells people were using (ie now bash is replaced with a Claude code session).

➕ show 1 reply

kccqzy • yesterday at 10:33 PM

This just does not match my workflow when I work on low-priority projects, especially personal projects when I do them for fun instead of being paid to do them. With life getting busy, I may only have half an hour each night with Claude to make some progress on it before having to pause and come back the next day. It’s just the nature of doing personal projects as a middle-aged person.

The above workflow basically doesn’t hit the rate limit. So I’d appreciate a way to turn off this feature.

ryanisnan • yesterday at 7:47 PM

Why does the system work like that? Is the cache local, or on Claude's servers?

Why not store the prompt cache to disk when it goes cold for a certain period of time, and then when a long-lived, cold conversation gets re-initiated, you can re-hydrate the cache from disk. Purge the cached prompts from disk after X days of inactivity, and tell users they cannot resume conversations over X days without burning budget.

➕ show 1 reply

andrewingram • today at 11:49 AM

This points to a fairly fundamental mismatch between the realities of running an LLM and the expectations of users. As a user, I _expect_ the cost of resuming X hours/days later to be no different to resuming seconds or minutes later. The fact that there is a difference, means it's now being compensated for in fairly awkward ways -- none of the solutions seem good, just varying degrees of bad.

Is there a more fundamental issue of trying to tie something with such nuanced costs to an interaction model which has decades of prior expectation of every message essentially being free?

➕ show 1 reply

iidsample • yesterday at 7:24 PM

We at UT-Austin have done some academic work to handle the same challenge. Will be curious if serving engines could modified. https://arxiv.org/abs/2412.16434 .

The core idea is we can use user-activity at the client to manage KV cache loading and offloading. Happy to chat more!!

Joeri • yesterday at 7:39 PM

This sounds like one of those problems where the solution is not a UX tweak but an architecture change. Perhaps prompt cache should be made long term resumable by storing it to disk before discarding from memory?

➕ show 2 replies

winternewt • today at 11:39 AM

> Adding an in-product tip to recommend running /clear when re-visiting old conversations (we shipped a few iterations of this)

I feel like I'm missing something here. Why would I revisit an old conversation only to clear it?

To me it sounds like a prompt-cache miss for a big context absolutely needs to be a per-instance warning and confirmation. Or even better a live status indicating what sending a message will cost you in terms of input tokens.

toephu2 • yesterday at 10:01 PM

How does the Claude team recommend devs use Claude Code?

1) Is it okay to leave Claude Code CLI open for days?

2) Should we be using /clear more generously? e.g., on every single branch change, on every new convo?

Folcon • today at 10:45 AM

Hi Boris

I'm curious why 1 hour was chosen?

Is increasing it a significant expense?

Ever since I heard about this behaviour I've been trying to figure out how to handle long running Claude sessions and so far every approach I've tried is suboptimal

It takes time to create a good context which can then trigger a decent amount of work in my experience, so I've been wondering how much this is a carefully tuned choice that's unlikely to change vs something adjustable

ohcmon • yesterday at 8:17 PM

Boris, wait, wait, wait,

Why not use tired cache?

Obviously storage is waaay cheaper than recalculation of embeddings all the way from the very beginning of the session.

No matter how to put this explanation — it still sounds strange. Hell — you can even store the cache on the client if you must.

Please, tell me I’m not understanding what is going on..

otherwise you really need to hire someone to look at this!)

➕ show 3 replies

8note • yesterday at 8:32 PM

reasonably, if i'm in an interactive session, its going to have breaks for an hour or more.

whats driving the hour cache? shouldnt people be able to have lunch, then come back and continue?

are you expecting claude code users to not attend meetings?

I think product-wise you might need a better story on who uses claude-code, when and why.

Same thing with session logs actually - i know folks who are definitely going to try to write a yearly RnD report and monthly timesheets based on text analysis of their claude code session files, and they're going to be incredibly unhappy when they find out its all been silently deleted

➕ show 1 reply

looshch • today at 10:00 AM

> We tried a few different approaches to improve this UX

how about acknowledging that you fucked up your own customers’ money and making a full refund for the affected period?

> Educating users on X/social

that is beyond me

ты не Борис, ты максимум борька

BoppreH • yesterday at 9:45 PM

Isn't that exactly what people had been accusing Anthropic of doing, silently making Claude dumber on purpose to cut costs? There should be, at minimum, a warning on the UI saying that parts of the context were removed due to inactivity.

nhinck3 • today at 10:45 AM

So is it for latency or is it for cost?

Why did you lie 11 days ago, 3 days after the fix went in, about the cause of excess token usage?

the-grump • yesterday at 8:31 PM

That is understandable, but the issue is the sudden drop in quality and the silent surge in token usage.

It also seems like the warning should be in channel and not on X. If I wanted to find out how broken things are on X, I'd be a Grok user.

try-working • yesterday at 11:30 PM

You created this issue by setting a timer for cache clearing. Time is really not a dimension that plays any role in how coding agent context is used.

dnnddidiej • yesterday at 9:49 PM

It is too suprising. Time passed should not matter for using AI.

Either swallow the cost or be transparent to the user and offer both options each time.

willsmith72 • today at 3:26 AM

Wow so that's why you did #2? The explanation in the CLI is really not clear. I thought it was just a suggestion to compact, no idea it was way more expensive than if I hadn't left it idle for an hour.

You guys really need to communicate that better in the CLI for people not on social

noname120 • today at 11:05 AM

Why not automatically run a compaction close to the 1-hour mark? Then the cache miss won’t have such a bad impact.

foobarbecue • today at 12:21 PM

Hi Boris! Wanted to let you know that I find those ads with you saying "now when you code, you use an agent" obnoxious because of that incorrect statement. I have no interest in slop coding. I find it way more ergonomic and effective to use code to tell a machine precisely what to do than to use English to tell it vaguely. I hate that your ad is misleading so many non-coders, who will actually believe your lie that nobody codes anymore. Probably doesn't help that YouTube was playing it as an interruption in every video I watched. I probably saw it 100 times and was getting to the "throw the remote at the tv" stage XD.

Confiks • today at 1:32 AM

So you made this change completely invisible to the user, without the user being able to choose between the two behaviors, and without even documenting it in the (extremely verbose) changelog [1]? I can't find it, the Docs Assistant can't find it (well, it "I found it!" three times being fed your reply with a non-matching item).

I frequently debug issues while keeping my carefully curated but long context active for days. Losing potentially very important context while in the middle of a debugging session resulting in less optimal answers, is costing me a lot more money than the cache misses would.

In my eyes, Claude Code is mainly a context management tool. I build a foundation of apparent understanding of the problem domain, and then try to work towards a solution in a dialogue. Now you tell me Anthrophic has been silently breaking down that foundation without telling me, wasting potentially hours of my time.

It's a clear reminder that these closed-source harnesses cannot be trusted (now or in the future), and I should find proper alternatives for Claude Code as soon as possible.

[1] https://code.claude.com/docs/en/changelog

troupo • yesterday at 7:24 PM

> We tried a few different approaches to improve this UX: 1. Educating users on X/social

No. You had random developers tweet and reply at random times to random users while all of your official channels were completely silent. Including channels for people who are not terminally online on X

➕ show 1 reply

albert_e • today at 9:10 AM

> The challenge is: when you let a session idle for >1 hour, when you come back to it and send a prompt, it will be a full cache miss, all N messages. We noticed that this corner case led to outsized token costs for users.

I dont agree with this being characterized as a "corner case".

Isn't this how most long running work will happen across all serious users?

I am not at my desk babysitting a single CC chat session all day. I have other things to attend to -- and that was the whole point of agentic engineering.

Dont CC users take lunch breaks?

How are all these utterly common scenarios being named as corner cases -- as something that is wildly out of the norm, and UX can be sacrificed for those cases?

infogulch • yesterday at 8:23 PM

How big is the cache? Could you just evict the cache into cheap object storage and retrieve it when resuming? When the user starts the conversation back up show a "Resuming conversation... ⭕" spinner.

mandeepj • today at 1:14 AM

> that would be >900k tokens written to cache all at once

Probably that's why I hit my weekly limits 3-4 days ago, and was scheduled to reset later today. I just checked, and they are already reset.

Not sure if it's already done, shouldn't there be a check somewhere to alert on if an outrageous number of tokens are getting written, then it's not right ?

arcza • yesterday at 10:49 PM

You need to seriously look at your corporate communications and hire some adults to standarise your messaging, comms and signals. The volatility behind your doors is obvious to us and you'd impress us much more if you slowed down, took a moment to think about your customers and sent a consistent message.

You lost huge trust with the A/B sham test. You lost trust with enshittification of the tokenizer on 4.6 to 4.7. Why not just say "hey, due to huge input prices in energy, GPU demand and compute constraints we've had to increase Pro from $20 to $30." You might lose 5% of customers. But the shady A/B thing and dodgy tokenizer increasing burn rate tells everyone inc. enterprise that you don't care about honesty and integrity in your product.

I hope this feedback helps because you still stand to make an awesome product. Just show a little more professionalism.

0123456789ABCDE • today at 8:22 AM

2. could you bring back the _compact and accept plan_? even if it is not the default option.

baq • today at 12:04 PM

maybe you could surface an expected cache miss to the user

samusiam • today at 10:41 AM

For idle sessions I would MUCH rather pay the cost in tokens than reduced quality. Frankly, it's shocking to me that you would make that trade-off for users without their knowledge or consent.

nextaccountic • yesterday at 7:55 PM

what about selling long term cache space to users?

or even, let the user control the cache expiry on a per request basis. with a /cache command

that way they decide if they want to drop the cache right away , or extend it for 20 hours etc

it would cost tokens even if the underlying resource is memory/SSD space, not compute

FuckButtons • today at 12:11 AM

From a utility perspective using a tiered cache with some much higher latency storage option for up to n hours would be very useful for me to prevent that l1 cache miss.

airstrike • today at 2:26 AM

Why is time the variable you're solving for? Why can't I keep that cache warm by keeping the session open?

taspeotis • today at 7:01 AM

Hi, thanks for Claude Code. I was wondering though if you'd considering adding a mode to make text green and characters come down from the top of the screen individually, like in The Matrix?

chris1993 • today at 12:25 AM

So this explains why resuming a session after a 5-hour timeout basically eats most of the next session. How then to avoid this?

useyourforce • today at 12:56 AM

I actually have a suggestion here - do not hide token count in non-verbose mode in Claude Code.

gverrilla • yesterday at 7:18 PM

I drop sessions very frequently to resume later - that's my main workflow with how slow Claude is. Is there anything I can do to not encounter this cache problem?

growt • yesterday at 8:07 PM

Wasn’t cache time reduced to 5 minutes? Or is that just some users interpretation of the bug?

alt Hacker News

Replies

🔗 View 4 more replies