I've been doing benchmarking of various models for finding hard security bugs, and my faith in ...

SwellJoe • yesterday at 10:30 PM • 4 replies • view on HN

I've been doing benchmarking of various models for finding hard security bugs, and my faith in Haiku (and Sonnet, even) has dropped precipitously in the process. Self-hosted Qwen 3.6 27B consistently outperforms both for finding security bugs, which was a shocking result. I expected Qwen to be around Haiku level, maybe a little worse, and I definitely expected it to be worse than Sonnet.

And, DeepSeek and MiMo perform much better than Haiku and Sonnet, near Opus/GPT 5.5 levels, at a fraction of the cost.

There's seemingly no reason to ever use Haiku or Sonnet, if you're not getting it for free or as part of a subscription (that you don't usually saturate).

Replies

SyneRyder • today at 2:57 PM

I don't suppose you've had a chance to benchmark MiniMax V3 yet? I've only just started testing other models after being an Anthropic fan. I haven't put MiniMax V3 to coding tasks yet, but something about my early simple tests has impressed me. The MiniMax API pricing is about 7% of Anthropic API prices (about matching Anthropic's subscription pricing).

gwerbin • yesterday at 10:48 PM

I don't think that's what these small models are for. They are for things like text summarization and generating a title for your AI session. Maybe Haiku occupies a weird zone where it's overpowered for those tasks but underpowered for anything more sophisticated. But for example I used it on an agentic reasoning task recently (reading a chunk of information and drawing a written conclusion, not writing code) and it did just fine. More powerful model would have been a waste of money.

➕ show 2 replies

canpan • today at 5:35 AM

Same opinion. Opus is best for coding, but Qwen 3.6 27b Q8 is next, before Sonnet.

Sonnet might have more knowledge and is maybe good for making excel sheets, but it does not write good code and does not follow instructions well.

But 27b Q8 needs a very beefy PC (48GB VRAM or more), so it is not an option many people can use and DS4F is so cheap right now, if you are open to externally hosted models.

egeozcan • today at 4:38 AM

DeepSeek competes with Sonnet, not significantly worse or better. It tends to do weird things in codebases on the bigger side.

➕ show 1 reply

alt Hacker News

Replies