For coding, qwen 3.6 35b a3b solved 11/98 of the Power Ranking tasks (best-of-two), compared to...

jbellis • yesterday at 7:35 PM • 3 replies • view on HN

For coding, qwen 3.6 35b a3b solved 11/98 of the Power Ranking tasks (best-of-two), compared to 10/98 for the same size qwen 3.5. So it's at best very slightly improved and not at all in the class of qwen 3.5 27b dense (26 solved) let alone opus (95/98 solved, for 4.6).

Replies

kristianp • yesterday at 9:32 PM

This has similar problems to swe bench in that models are likely trained on the same open source projects that the benchmark uses.

https://blog.brokk.ai/introducing-the-brokk-power-ranking/

➕ show 1 reply

__natty__ • yesterday at 8:09 PM

You compare tiny modal for local inference vs propertiary, expensive frontier model. It would be more fair to compare against similar priced model or tiny frontier models like haiku, flash or gpt nano.

➕ show 2 replies

alt Hacker News

Replies