Yup, they do quite poorly on random non-coding tasks:

XCSme • yesterday at 1:38 AM • 6 replies • view on HN

https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...

Replies

Interesting benchmark. It is notable that Gemini-3-Flash outperforms 3.1 Pro. My experience using Flash via Opencode over the past month suggests it is quite underrated.

Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)

rmi_ • yesterday at 7:54 AM

Wild benchmark. Opus 4.6 is ranked #29, Gemini 3 Flash is #1, front of Pro.

I'm not saying it's bad, but it's definitely different than the others.

➕ show 1 reply

wizee • yesterday at 3:27 AM

It’s worth also comparing Qwen 3.5, it’s a very strong model. Different benchmarks give different results, but in general Qwen 3.5, GLM 5, and Kimi K2.5 are all excellent models, and not too far from current SOTA models in capability/intelligence. In my own non-coding tests, they were better than Gemini 3.1 flash. They’re comparable to the best American models from 6 months ago.

➕ show 2 replies

raincole • yesterday at 9:51 AM

I can't imagine anyone looking at this benchmark without laughing. It's so disconnected.

scotty79 • yesterday at 10:01 AM

GLM 5 here is significantly better than GPT-5.4

➕ show 1 reply

comboy • yesterday at 8:23 AM

Not really related, but does anybody know if somebody's tracking same models performance on some benchmarks over time? Sometimes I feel like I'm being A/B tested.

➕ show 1 reply

alt Hacker News

Replies