logoalt Hacker News

gertlabsyesterday at 10:27 PM1 replyview on HN

We only have some basic time filtering (https://gertlabs.com/?days=30), but most of our samples are from the last 2 months. This is a visualization we plan to add when we've collected more historical data.

But we did heavily resample Claude Opus 4.6 during the height of the degraded performance fiasco, and my takeaway is that API-based eval performance was... about the same. Claude Opus 4.6 was just never significantly better than 4.5.

But we don't really know if you're getting a different model when authenticated by OAUTH/subscription vs calling the API and paying usage prices. I definitely noticed performance issues recently, too, so I suspect it had more to do with subscription-only degradation and/or hastily shipped harness changes.


Replies

b--ltoday at 12:27 AM

"but most of our samples are from the last 2 months."

There's your major issue. That's well within the brutal quantization window.