I like your style, and I appreciate you trying to get to the truth, despite us both being aware that we are engaging in persuasive writing here, so part of the rhetorical game is in what we choose to emphasize and what we choose to leave out.
> How likely do you think this is? Do you think it is more likely than the other three I mentioned?
I won't write down probability estimates, because frankly, I have no idea. Unless you are yourself a decision-maker at Anthropic, which, from what I can infer, you aren't, both of us are speculating. However, I can try to address each of your explanations at face value, because I don't think any of them makes Anthropic look any better than the explanation I provided.
> (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded.
As far as I understand it, scaling issues would result in increased latency or requests being dropped, not model quality being lower. However, there is a very widespread rumor that Anthropic is routing traffic to quantized models during peak times to help decrease costs. Boris Cherny, Thariq Shihipar, and others have repeatedly denied this is happening [1]. I would be more concerned if this were the actual explanation, because as a user of the Claude Code Max plan and of the API, I have the expectation that each dollar I spend buys me access to the same model without opaque routing in the background.
> (2) Another possibility is that they are not as tuned to what customers want relative to what their engineers want.
There is actually a strong case for this: the high performance on the benchmarks relative to the qualitatively low performance reported on real-world tasks after launch. I suspect quite a bit of RL training was spent optimizing for beating those benchmarks, which resulted in overfitting the model on particular kinds of tasks. I'm not claiming this is nefarious in any way or that it is something only Anthropic is guilty of doing: these benchmarks are supposed to be a good representation of general software tasks, and using them as a training ground is expected.
> (3) It is also possible they have slowed their models down due to safety concerns. To be more specific, they are erring on the side of caution (which would be consistent with their press releases about safety concerns of Mythos).
This would be the most concerning to me. I don't want to get too deeply into a political/philosophical argument, but I am very much on the other side of the e/accy vs. P(doomy) debate, and I strongly believe that keeping these tools under the control of some council of enlightened elders who claim to know what is best for humanity is ultimately futile.
If the result of the behind-the-scenes "cerebration" is an actual effort to try and slow down AI development or access, I don't have much confidence in the future of Anthropic.
I agree that there are incentives other than pure profit maximization here (I don't want to get into "my friend at Anthropic told me such and such" games, but I also believe this is the case). I'm sure there is some tension between these objectives inside Anthropic, but what is interesting is that lower model quality and maximizing user engagement could, at least in principle, align with both constraints.
I strive to be decently Bayesian and embrace uncertainty. I'm sharing my probability estimates because it helps me to stop and think ("is this roughly what I think?" and "let spend a minute making sure before I say so"). But yeah, of course, they are my priors and fuzzy. Hopefully I can reflect I figure them out +/- 15% or so. But at least you can see how my takes compare with each other. And down the road I can see how I did.
Thanks for getting into some of the details ...
>> (1) One possibility is they are having capacity and/or infrastructure problems so the model performance is degraded.
> As far as I understand it, scaling issues would result in increased latency or requests being dropped, not model quality being lower.
Yes, many scaling issues would manifest in that way -- but not all. It seems plausible for Anthropic to have other ways to degrade model performance that don't show up in the latency or reliability metrics. I need to research more... (I'll try to think more on your other points later).