logoalt Hacker News

OsrsNeedsf2Pyesterday at 10:09 PM1 replyview on HN

Do your benchmark results indicate any level of regression on Opus 4.6 or 4.5 since their first release?


Replies

gertlabsyesterday at 10:27 PM

We only have some basic time filtering (https://gertlabs.com/?days=30), but most of our samples are from the last 2 months. This is a visualization we plan to add when we've collected more historical data.

But we did heavily resample Claude Opus 4.6 during the height of the degraded performance fiasco, and my takeaway is that API-based eval performance was... about the same. Claude Opus 4.6 was just never significantly better than 4.5.

But we don't really know if you're getting a different model when authenticated by OAUTH/subscription vs calling the API and paying usage prices. I definitely noticed performance issues recently, too, so I suspect it had more to do with subscription-only degradation and/or hastily shipped harness changes.

show 1 reply