For what it's worth, here some of my experiences, as recently I have had some major deviations ...

Topfi • today at 11:55 AM • 0 replies • view on HN

For what it's worth, here some of my experiences, as recently I have had some major deviations from what I've come to expect:

1.) Opus 4.7 via the API is great. Unlike 4.6, I have found the model to degrade far less beyond 120k, even 600k can be relied upon. Task Inference, Task Evaluation, Task Adherence, tool calling, all do very well on my evals. I did however for the first time in a while end my Claude Max subscription because, after their post-mortem [0] I for the first time saw true, reproducible, incredibly frustrating regressions in model output when using Claude Code.

Yes, this was after their post in the last week of April 26 and yes, I have been fortunate enough to never have been affected by regressions up to this point. The model via API with other harnesses provides consistent, useful and high quality output, but the recent changes have become an avalanche of "this requires more than two changes so we should table this for later" and "it seems the subagent finding was wrong and this is not actionable" with a healthy mix of suggestions that clearly are there to safe tokens, but go against clear instructions. I understand that they are compute constraint but as someone who until recently has never maxed out their weekly and nearly never their 5 hour limits on the Max 5x plan, these changes are not just frustrating (and make reasonable users think the model was nerfed rather than the harness) but also cost more as I now have to prompt four times and thousands of tokens more for a task that previously the same harness with the same model did far more efficiently. I regularly check the numbers and yes, by trying to be more efficient, they made what I am costing them far higher, going beyond what I pay for the subscription. Ironically, and I must emphasise this, I did not have regressions before, which suggests some major luck in A/B testing at least.

2.) GPT-5.5 is amazing, a true jump I have not seen since GPT-5 and far more than even GPT-5.4 is approaching the way Anthropic models have handled task inference, which also has lead to far reduced reasoning needs. I very much like it, with the exception of the reduced context window and degradation in compaction. GPT-5.4 did compaction so consistently well, that the 272k standard window before the price increase was of no concern and going beyond it was reliably possible. With GPT-5.5, the cost per token is doubled and compaction is far less reliable, leading to loss of task adherence and preventing task completion in certain cases. I am aware GPT-5.5 is a new pretrain (though how new given frontend is still abhorrently poor and has been since Horizon Alpha which I maintain was worse than GPT-4.1) and am hopeful they can integrate some of the solutions they were leveraging for GPT-5.4 compaction, but until then, it remains a model great for very challenging and complex blockers, but not a GPT-5.4 drop-in replacement.

3.) Kimi K2.6 is great for the API price, efficient, fast and does very well on all my metrics. I very much like it, far more so than Deepseek V4 Pro, any Qwen, Z.AI or Meta model and I truly am impressed. Composer 2 has shown how you can take the base even further given the right data and if I had to pay exclusively API pricing without any subscriptions, I think I'd have no problem leaning on K2.6 for most needs. It is what I'd love to see from Mistral or Apple and shows that one can't just succeed in a few narrow areas (Z.AI with tool calling, Deepseek with world knowledge, Mistral with being European, etc.) but provide a balanced product across all areas as an open weight company. I just wish they'd expose Agent Swarms via the API, there are a few experiments I'd like to try.

[0] https://www.anthropic.com/engineering/april-23-postmortem

alt Hacker News