Fwiw I run this eval every week on a set of known prompts and I believe the in group differences are...

kasey_junk • yesterday at 1:50 PM • 0 replies • view on HN

Fwiw I run this eval every week on a set of known prompts and I believe the in group differences are bigger than out group.

That is I get more variance between opus 4.6 and itself than I do between the sota models.

I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.

alt Hacker News