So the benchmark is : Two models with different harness produced very different results . Glm gam...

maxdo • today at 1:06 PM • 0 replies • view on HN

So the benchmark is : Two models with different harness produced very different results .

Glm game was completely broken Opus game was at first glance ok but also with bugs

Different models with different cost produced different non perfect results . How is it “close” ? :)

Also on costs : glm burns more tokens on average vs opus . Gpt5.5 burns less surprisingly

alt Hacker News