It doesn't sound to me like this benchmark is attempting to measure architecture design. As far...

SpicyLemonZest • yesterday at 3:52 PM • 0 replies • view on HN

It doesn't sound to me like this benchmark is attempting to measure architecture design. As far as I see in the paper, they do not evaluate the architectural quality of a task completion, only whether the model is capable of completing it at all.

alt Hacker News