I think most frustrating is the system prompt issue after the postmortem from September[1].
These bugs have all of the same symptoms: undocumented model regressions at the application layer, and engineering cost optimizations that resulted in real performance regressions.
I have some follow up questions to this update:
- Why didn't September's "Quality evaluations in more places" catch the prompt change regression, or the cache-invalidation bug?
- How is Anthropic using these satisfaction questions? My own analysis of my own Claude logs was showed strong material declines in satisfaction here, and I always answer those surveys honestly. Can you share what the data looked like and if you were using that to identify some of these issues?
- There was no refund or comped tokens in September. Will there be some sort of comp to affected users?
- How should subscribers of Claude Code trust that Anthropic side engineering changes that hit our usage limits are being suitably addressed? To be clear, I am not trying to attribute malice or guilt here, I am asking how Anthropic can try and boost trust here. When we look at something like the cache-invalidation there's an engineer inside of Anthropic who says "if we do this we save $X a week", and virtually every manager is going to take that vs a soft-change in a sentiment metric.
- Lastly, when Anthropic changes Claude Code's prompt, how much performance against the stated Claude benchmarks are we losing? I actually think this is an important question to ask, because users subscribe to the model's published benchmark performance and are sold a different product through Claude Code (as other harnesses are not allowed).
[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...