It's really hard to tell. Almost all the models have the benchmarks in their training data, whi...

BobbyJo • yesterday at 11:45 PM • 0 replies • view on HN

It's really hard to tell. Almost all the models have the benchmarks in their training data, which pushes us into the realm of basing model capability rankings on vibes. I think the OSS models tend to do worse on things outside their corpus, but Deepseek specifically has done insanely good work on efficiency and scaling, which is verifiable in a way capabilities are not.

alt Hacker News