logoalt Hacker News

XCSmeyesterday at 1:38 AM6 repliesview on HN

Yup, they do quite poorly on random non-coding tasks:

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...


Replies

usagisushiyesterday at 4:37 AM

Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)

rmi_yesterday at 7:54 AM

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

show 1 reply
wizeeyesterday at 3:27 AM

It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.

show 2 replies
raincoleyesterday at 9:51 AM

I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.

scotty79yesterday at 10:01 AM

GLM 5 here is significantly better than GPT-5.4

show 1 reply
comboyyesterday at 8:23 AM

Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

show 1 reply