I recognize the sarcasm. The data I can find says it's performing at baseline however?

MattSayar • last Thursday at 4:58 PM • 1 reply • view on HN

ACCount37 • last Thursday at 5:07 PM

Yeah, that's my point. Humans are not reliable LLM evaluators. "Secret model nerfs" happen in "vibes" far more often than they do in any reality.

alt Hacker News